The long-lasting Icelandic winter didn’t impact the momentum of our second AI and HPC Field Trip. Once again we gathered around 15 practitioners from the worlds of AI, machine learning, deep learning and high performance computing to network and brainstorm together, as well as tour our industrial scale campus on Iceland’s former NATO based near Keflavik.
The attendees were a particularly interesting group representing medical, drug discovery, investment management, machine vision and robotics industries with attendees converging from the United States, Italy, Switzerland, France, the Netherlands, the United Kingdom and Germany. With such an international mix of attendees the ideas were varied and the insights fluctuated, but across all delegates the one combining factor was a clear passion for the possibilities AI can provide business and society.
Our Field Trip started with an introductory social gathering followed by a dinner where we all got to learn more about each other’s AI journey and the challenges we all face. These challenges correlated and expanded upon the first AI and HPC Field Trip back in Autumn 2018 and focused on deep neural network (DNN) evolution tracking, evolving from prototype to a product and cost-effectively scaling once in production.
The second day kicked-off with the group visiting our campus, and viewing our solutions for intensive forms of AI compute – HPC Cloud via our bare metal hpcDIRECT platform, and HPC Colocation via our powerADVANCE and powerDIRECT products. The high density, high power halls associated with powerDIRECT got everyone thinking about industrial scale:
Everyone was amazed with the clean, crisp designs for both traditional cost-effective Tier-3 equivalent data center colocation and its industrial alternative, offering up to a 70% savings verses the United States, UK and continental Europe. powerDIRECT garnered interest from the Monte Carlo and volume DNN training communities, especially when you realise we can often save the price of the compute hardware in the first year or so of operation!
We also showed the group our HPC Cloud infrastructure which is becoming increasingly popular with rapid growth AI and machine learning start-ups such as the UK’s Satavia and VoiceBase in the United States. This platform provides dedicated, bare metal HPC which provides up to 30% faster performance than hyperscale alternatives at 50% of the cost – a fair amount to think on when you’re digesting your Icelandic lunch of lobster soup and roast lamb!
After the data center the group travelled across to the nearby Blue Lagoon – Iceland’s famous geothermally heated spa which uses the same geothermal power used at the data center to power and provide super-heated water for a truly awe-inspiring swimming experience.
Over lunch at the Lagoon our brainstorming session discussed some industry trends and we enjoyed brief guest talks from Bart Schneider (Senior Director Business Development, NVIDIA) about the optimum AI training GPUs, Tim Llewellynn (CEO, Nviso) about their new DNN training collaboration infrastructure and Kamil Tamiola (CEO, Peptone) about exploiting quantum computing for drug discovery.
The brainstorming resolved around key AI challenges, compute acceleration beyond CPUs and HPC versus generic compute clouds. In a couple of hours, we covered lots of ground but here are five key points that impressed me:
1. The top challenge for several attendees was the tracking and explainability of the resulting trained DNN. The consensus current best practice, before DNN tracking databases become available, is to checkpoint each trained DNN iteration in GitHub allowing you to backtrack easily if the DNN evolution starts to diverge from its goal.
2. Often before image data can be trained it needs to be annotated. Merchandise images, traffic signs, vehicle types etc all need annotation upon the images. The wrong annotation technique can be more impactful on the final DNN results than the DNN training methods themselves. We contrasted the popular methods to solve this problem from buying pre-annotated data, to developing an annotation DNN to do it, to crowd sourcing annotation data from the Internet. The crowd sourcing annotation has been particularly successful for many companies.
Annotation from Clickworker
3. Memory becomes increasingly problematic as the parameter size grows. This can easily exceed the GPU capabilities if not managed. The group was evenly split between whether you are better culling the data set or throwing more capable hardware at the problem as its price decreases over time. My experience from the Internet backbone suggested that the increasing capable and affordable hardware will eventually win-out but that in the early market phase some data optimisation will pay dividends.
4. We tested the concept that any cloud can be used for DNN training. The consensus was that beyond early prototyping the generic compute clouds are expensive and not performant beyond about 3 GPUs. Someone shared a story of a colleague who used AWS for prototyping and then found a major lighthouse customer who was an Azure shop. It took their engineering team 3 months to remove all the AWS dependencies – quite an engineering project slip. Everyone agreed that using containers with the drivers and training framework inside allowed the training to exploit any bare metal solution in short order. Using cloud vendor feature APIs was not recommended for production products.
Brainstorming AI Best Practices
5. Kamil shared that simulating quantum algorithms with GPUs in less than real-time was an industry best practice in the interim until you can obtain your own quantum hardware. Quantum computers are likely to be very task specific so they will likely be integrated in a hybrid cluster including both CPUs and GPUs too. I look forward to seeing how this field evolves and I’m planning a future blog on the nuances of quantum computer colocation – my initial research has been fascinating.
Quantum computer cryogenic cooling
In conclusion, everybody acknowledged that HPC, AI DNN training and quantum computing need a specialised environment to scale. Verne Global is one of the few locations where both HPC Cloud and HPC Colocation services were designed from the ground up as fit for industrial scaling.
Let’s brainstorm about your HPC and AI DNN training requirements and perhaps get you started on your AI journey with a trial of our hpcDIRECT AI optimised GPU bare metal cloud.