I’m starting to feel like a Formula 1 racing driver where every month is a new venue with huge crowds but in my case it’s AI industry events. This autumn I’m alternating events across the Atlantic providing a great insight into any differences in current practices between North America and Europe.
Back in October I attended the excellent World AI Summit in Amsterdam. It was a great event and had a very new-age European feel to it, making extensive use of video, virtual reality and animation with a video DJ as the master of ceremonies. It was quite the AI Grand Prix pit-party!
Despite the show’s glitzy wrapper to the presentations, the event was all business with many deep neural network (DNN) training practitioners walking the exhibition floor. I enjoyed working the aisles near our booth talking to folks who have never considered Iceland as the home for their intensive, industrial scale AI compute, but who after 30 seconds have their lightbulb moment when they suddenly realise the economic savings, and the benefit to their compute’s carbon footprint.
Many of the people who came to our booth had done the easy, and many in the high performance computing world would say lazy choice by just using a hyperscale cloud for their machine learning, but were already complaining of issues. These ranged from performance issues due to noisy neighbours, non-stellar technical support and the obvious one being the pricing. Without degrees in hyperscale sliding pricing mechanics, many were struggling, and one guy I spoke to had spent his monthly budget in a couple of days without realising – ouch.
Since the summer when I got a deep technical briefing on the NVIDIA T4 GPU’s mixed precision capabilities and we got the opportunity to benchmark it against the P100 and V100, I’ve been asking everyone what precision they train their DNNs at. I’ve noticed a dichotomy between subjective and object datasets with voice, language and vision datasets using lower precision than their scientific counterparts. The exceptions to this that I’ve encountered are usually due to historic development tools and are subject to future migration to lower precision when their training volumes escalates.
This has created a sustained interest in our bare metal cloud NVIDIA T4 GPUs since our benchmark showed them performing machine vision DNN training at 90% of the V100 with the latest NVIDIA drivers and appropriate tuning. Once you need ultra-speedy connectivity between GPUs, NVLink or FP/64 arithmetic the V100s have no comparison, especially when configured in an NVIDIA DGX chassis. Hence the number of DGX1/2 systems being used for autonomous driving development.
Clearly the T4 GPU has a different less power-hungry floating-point architecture to the V100. I’m seeing the more sophisticated DNN systems engineers paying careful examination of their DNN operations both in the training and inference hardware domains to test for slight divergences in results, which may be impactful to the final application. I’m sensing a “best practice” evolving to train and run the inference on the same GPU floating-point architecture. If you are training your DNN on V100s and running inference on T4s or FPGAs be thoughtful.
As the DNN training users on our bare metal cloud grow we are seeing an evolution in their storage thoughts. Often the first prototyping or initial proof of concept (POC) compute node is a 2 or 4 GPU, dual CPU, server with a generous internal solid-state drive (SSD). Over time as the training dataset gets larger and the internal storage is augmented with a 20 – 50TB Network File System (NFS) node with RAID protection and a back-up/replication scheme. It’s then easy to get the appropriate training datasets to load into the GPU node.
Once again, over time this is augmented with an object storage solution and we provide Ceph-based ones, which are ideal for storing large datasets for later deployment directly into the GPU node or the NFS storage. This hierarchical storage solution allows the needed data to always be in close proximity and provide the optimum compute performance.
This storage blog gives a great summary of the storage system types and their best use cases: “With block storage, files are split into evenly sized blocks of data, each with its own address but with no additional information (metadata) to provide more context for what that block of data is. ... Object storage, by contrast, doesn't split files up into raw blocks of data and does have describing metadata.”
There are a couple of common land mines poised to kill any new embryonic AI product. The first is any proprietary development tools or APIs which would lock you into a specific and potentially expensive cloud environment. This is particularly important if the product is destined for volume usage, where the higher compute costs associated with using the APIs or toolsets in a cloud becomes a cost-of-goods (COG) issue having a meaningful impact on the ultimate product pricing.
The second is the choice of training and inference operating system. CentOS is by far our most popular operating system due to being open source. Migrating to it from the other Linux or Unix flavors is a modest task but moving from Windows Server is a completely different story. It requires careful planning integration into a busy product roadmap because it delivers no user functionality, customer value, only reduced development and operations costs when in production.
As you start and progress your AI development odyssey, consider our bare metal cloud or extreme “DGX-ready” colocation, both of which come with ample industry experience, low cost Icelandic renewable energy and a campus built for the job. Steve Jobs who famously said in 1996: "Picasso had a saying -- 'good artists copy; great artists steal' -- and we have always been shameless about stealing great ideas." There is no need to steal, just train your DNN with us in Iceland and we’ll provide you with a steady stream of industry best practices to consider.
My Formula 1 calendar has two more US stops this year Supercomputing 19 in Denver November 17th – 21st and the AI Summit in New York City December 11th – 12th which is preceded by our HPC and AI meetup on December 10th. Let’s compare notes at one of these.
Bob Fletcher, VP of Artificial Intelligence, Verne Global (Email: [email protected])