Last week I was privileged to be part of our AI and HPC Field Trip to Iceland. The goal of the trip was to share insight and observations around the evolution of AI deep neural network (DNN) training and to tour our HPC-optimised data center. The attendees spanned DNN veterans like Eric Sigler of OpenAI, large enterprise data science leaders like Pardeep Bassi of Liverpool Victoria Insurance (LVE) and start-up pioneers like Max Gerer of e-bot7. By intention (and very much to the surprise of the attendees) there was no hard selling involved. For sure, the group were impressed by our CTO - Tate Cantrell's passion for designing high quality, industrial scale data centers and his healthy obsession with the small details that together make the campus so optimal for intensive compute. Here are my learnings and observations from the Field Trip...
First off, all good 'brainstorming' sessions require fuel, in our case a traditional (and what seemed like regular, daily supply of...) Icelandic lobster bisque, salted lamb and a selection of fabulous cream cakes. Unlike previous Field Trips, nobody was tempted by the wine with lunch, so I was rather grateful that I’d polled the group for their “sleepless night” problems and challenges to be well prepared.
Prior to diving into the challenges, I shared a series of Rumours, Tips and Tricks amassed during 18 months on the AI-trade show circuit. Dave Driggers of Cirrascale also shared his insight into building large-scale DNN training clusters and what impacts their performance. The brainstorming focused on four topic areas:
- DNN training challenges
- Beyond GPUs – is it practical?
- Training in the generic/hyperscale compute clouds
- The 2019 AI market
All of the idea sharing was fascinating but the highlights for me were the following:
- Be very careful in architecting your multiple GPU solution – any mismatching of memory, storage or data I/O will invalidate large investments in GPU hardware. I can just imagine spending $30,000 on a potent 3-GPU server to find it’s no more productive than the single GPU one it replaced!
- Data ingest design is frequently the DNN training infrastructure achilles heel for multiple GPU solutions. Time invested to ensure that large training datasets are always available when the GPUs need them is important, perhaps including some cache type capability. More than once someone cited the need for pure storage type flash memory storage systems versus spinning disks. Watch out though as flash storage can easily break the bank/budget.
- One of the natural language attendees detailed how his company had evolved from CPU DNN training to GPU training and were now regressing back to CPUs. Interestingly it appears that in their case the large-scale parallelism of GPUs was not the panacea they had hoped it would be and their training methodology was better suited to CPUs.
- A couple of the more DNN training experienced companies in the room admitted to tracking more than 19 new hardware alternatives to GPUs. Products as diverse as the well-known Graphcore IPU and the much less well-known Singular Computing approximate computing APU. The consensus was there would need to be a 5-10 times performance improvement to justify the porting of existing software but a few in the room thought this 10+ times was easily possible. I look forward to seeing whatever they have previewed.
- A completely new term for me but not for many in the room was “Noisy Neighbour”. We had a couple of folks explain how when working on generic compute clouds they had failed to garner anything like their expected or budget justifying performance. It appears that in a virtual machine (VM) environment each VM can make unprovisioned demands on the hardware which requires the other VM to wait. This is not such a drama for regular compute tasks like email and web hosting, but it really impacts DNN training where the compute task is running as fast as the hardware will let it for extended periods.
There was agreement that the DNN product evolution follows a prototype to initial product to production product roadmap. Where the initial prototype is typically developed on a GPU enabled workstation or generic compute cloud perhaps focusing on DNN framework selection and dataset annotation verification perhaps using 100s of samples – the DNN to a proof of concept.
Thereafter the generation of product DNN occurs either on a bare metal cloud or a dedicated compute cluster in some appropriate colocation facility. This will likely use 10s or 100s of millions of data samples to fully train the DNN. Depending on the market window this may need some extreme compute resources to meet a deadline. Once a DNN is in production the market and data type typically define the retaining cadence. In a competitive market like Siri versus Alexa versus Google Now the DNN is retrained continuously with user feedback and new languages. Perhaps some product specific DNN for factory robotic vision would be retrained every quarter.
Some new best practice guidance suggests that doing the initial prototyping in a bare metal cloud or collocated compute cluster will speed the product development time by avoiding the transition from the workstation/generic cloud. I am pleased to say Verne Global can help with both cases.
After the brainstorming session we visited the Institute for Intelligent Machines (IIIM) at Reykjavik University for some entertaining presentations on natural language translation and virtual reality technology, enjoyed some more lobster bisque and salted lamb at the superb Nautholl (below), and then de-camped to the excellent Marina Hotel in Reykjavik.
Post dinner I enjoyed enlightened discussions on GPU acceleration of DNN training with Matthew Lamons who also introduced me to how Skejul are utilising AI-driven data analytics to facilitate global diary scheduling. It appears that he also started tweeting to his extensive followers about the Field Trip before he departed the US - we like those kind of delegates, you can come again Matthew!
I also wagered Hicham Tahiri of SmartlyAI that I would blog before he Tweets about it – the clock is running Hicham...... 😊 His innovative chat-bot start-up is a great example of some of the increasing AI-powered ventures coming out of Paris. I’ve kept my eye on the AI scene in Paris and something is definitely brewing there. I hear spring is a lovely time to be in Paris so maybe that will have to be added to the AI-trade show footprint!
I’d like to thank all the contributors to our AI and HPC Field Trip and especially the attendees who engaged in many lively discussions. One of the key contributors to that networking was Christian Bryndum from VoiceBase, who are now I am delighted to say, our newest hpcDIRECT customer. You can read more about why they chose hpcDIRECT here.
Verne Global will be hosting another AI Field Trip next February 26-28, 2019. Please reach out to me if you actively train DNNs, or have a particular fondness for lobster bisque and salted lamb, and would like to participate! If you have any questions about the brainstorming catch me at SC18 in Dallas or drop me an email at [email protected]