It was a treat to be in Germany for the last two weeks attending NVIDIA's GTC Europe in Munich and prior to this the HPC User Forum in Stuttgart. The Autumn weather was mostly warm and sunny - great for the Octoberfest but certainly not “free air cooling” for computer clusters despite most of Europe’s AI, HPC and Virtual Reality boffins being in town! Once again, here are some observations and insights from my travels...
As usual, GTC started with a bang, with NVIDIA CEO Jensen Huang announcing a host of new products and GPU applications. It’s well worth the two and a quarter hours to watch it. He does a fabulous job of promoting his products and pumping iron to facilitate holding the dense DGX-2 motherboard which without its cooling fins weighs close to 15 kilograms.
Volvo’s announcement that they are going to use the NVIDIA Drive AGX for its next generation of vehicles demonstrates autonomous vehicles are marching relentlessly toward mainstream production. The Drive AGX incorporates the NVIDIA Xavier system-on-a-chip, the world’s first processor built for autonomous driving. Architected for safety, the Xavier system on a chip (SoC) incorporates six different types of processors for redundant and diverse algorithms.
The GTC sessions cover the gambit from AI/deep learning to supercomputing to virtual reality all exploiting GPUs. The 16 GPU DGX-2 is the heart of many supercomputer sites while the 1080Ti and recently available 2080Ti are the work horse of workstation graphics. I focused much of my time trying to better understand the AI/GPU ecosystem and the evolution of a product through it.
It’s a new wrapper on a familiar product development structure. All products start with an idea which leads to some prototyping. For deep learning the prototyping often includes selecting the best DNN training framework and type of DNN which provides the best results. Often a specific DNN type works well for a specific task while others when primed with similar data take forever to train. Picking the combination of framework and DNN type is often regarded as black magic – the data scientists involved almost always do rigorous a/b testing of many theories before converging on a combination.
Providing a sufficiently powerful compute environment is essential to make good progress. A few years ago, the NVIDIA Digits workstation was popular and still is in academia, but many corporate development teams have their own DGX Station or a high-powered server with 4-8 GPUs to experiment with. The prototyping stage will deal with modest datasets, ImageNet is a popular one in the machine vision area. It’s an image dataset organised according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet and on average 1000 images to illustrate each synset. Prototyping will likely train on a subset of synsets.
Images of each concept are quality-controlled and human-annotated offering tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.
After the prototyping stage come the product phase which uses the concepts developed in the product stage but trains on industrial sized datasets with many, if not hundreds of, millions of examples. This phase often needs industrial scale computing to complete the task in a reasonable timeframe to iterate against. It is not unusual to see well over 10 racks of GPU accelerated servers grinding away for a few days to a month on such a task. Shorter is better especially if you want to keep your valuable data scientists happy. Many such product DNN training is not viable near the large urban centers due to power and hosting costs devouring too much of the budget. Iceland can help here.
NVIDIA extended its development ecosystem with a new Rapids data science tool, a suite of open source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs, which will most certainly help accelerate creating DNN products.
Once a product has trained sufficiently for user feedback the puzzle changes yet again. Often the DNN needs updating on a regular but infrequent basis resulting from user feedback, perhaps once a quarter. However, in a very competitive market like natural language Siri, Alexa, Google Now, Cortana, DeepL, etc are retraining continuously to maintain their market positions. Iceland has lots to offer the latter group and others like them providing that the datasets are not voluminous like a million-car sensor dataset prior to its pre-processing and sensor fusion.
Alle KI-Trainingswege führen zu Verne Global in Island!
Earlier during my German visit I was privileged to see many of the supercomputer centers' future roadmaps. Leonardo Flores of the European Commission presented about “EuroHPC and the European HPC Strategy”. It was an ambitious HPC, AI and data security roadmap extending out to 2027 and combining over €1B of both European Commission and Member State funding. If I remember correctly, it planned to build 3 centers for each of AI, HPC and data security to be operated in a similar fashion to today’s supercomputing centers.
In this Supercomputer Center community value is attributed to the Top 500 supercomputer rankings:
This list includes some truly impressive compute facilities chasing the exascale compute goal. Out of interest I compared DeepL’s compute cluster in Iceland to the list and found that they made the top 50! Not bad for a 30+ person start-up from Cologne, Germany. I asked many of the GTC attendees about their Top 500 list ranking, and many of them were in the top 200 and I’m sure that the Hyperscalers will take half of the top 10 slots. It appears commercial companies would prefer their competitors not to know if they are bulking-up on compute power for whatever reason.
The commoditisation and increasing diversity of HPC computing elements has recently introduced an interesting puzzle for the supercomputing ecosystem, which now is mostly GPU augmented servers. How do you efficiently manage a funding and procurement cycle which is much longer than the average GPU product cycle/market window?
Perhaps the Amazon/Walmart commerce model provides some clues – when the cost to return and repair a smaller product approaches its value, they ask you to recycle it and just send you a new one! Effectively abandoning the time- honoured exchange of a broken device for a new one or the option to repair it. Today’s killer businesses are all about low friction, minimal transaction cost and scale.
Lastly, just a quick note to say I am really looking forward to Verne Global's AI and HPC Field Trip taking place next week in Iceland. Iceland is increasingly becoming a destination for intensive, deep neural network training and heavy machine learning workloads. I will be showing an invited group of European and North American AI pioneers around the data center and I'm equally enthused to learn from them, as I am to show them our dedicated infrastructure. More on this to follow...