Data centers are evolving quickly, with the advent of LLM and the need for bigger, more powerful chips. Traditional data centers with their central processing units (CPUs) managed around 12 kilowatts (kW) per rack. However, AI data centers are trending upwards quite rapidly as graphics processing units (GPUs) begin to dominate the scene while becoming increasingly more powerful. This trend is accelerating quickly. Nvidia’s H100 chip has a rack density of 41 kW, while the successor Blackwell chip has an expected rack density of 130 kW, and future chips may drive rack densities of 250 kW. More capacity and electricity consumption will create significantly more waste heat, making it more challenging to maintain an optimal temperature of between 21 and 24 degrees Celsius.
With cooling representing approximately 35% of total data center energy use, the design and choice of the systems employed becomes increasingly critical. They can help improve system efficiencies and reduce both upfront capital expenditures (CapEx) related to building costs as well as ongoing operational expenses (OpEx), principally related to electricity costs. These savings may be considerable, with some liquid cooling technologies cutting CapEx expenditures by over 50% and OpEx by an impressive 90%.
However, the value of reducing cooling energy use is not related solely to cost management. Perhaps even more importantly, the ability to reduce energy - and especially capacity requirements – will help increase access to the power grid. Capacity represents the instantaneous demand (or supply) on the system, and this is where most power grids are limited. They simply do not have the ability to connect more loads to a bulk power system in which both generating and transmission capacity are in short supply.
With the rapid scale-up of AI compute requirements, tens of gigawatts of generating and transmission capacity are being sought, even as time-to-power becomes more important. Therefore, the ability to reduce overall capacity requirements by deploying more efficient systems will increase the opportunities for new data centers to hook up to power, while getting more compute capability for every scarce megawatt of capacity they are able to access.
Most of today’s typical data centers have relatively low power densities and generally utilize forced air cooling. However, these systems require large heatsinks and the ability to manage hot exhaust air. That heat can affect the lifespan of nearby components. Air-cooled systems are also noisy, a concern in populated environments.
By contrast, liquid cooling approaches offer a number of advantages. They’re more efficient in dissipating heat away from chips, cutting energy use and operating costs. These systems are typically smaller, freeing up space. Liquid cooling has another advantage that has yet to be realized: it’s efficiency in concentrating the waste heat allows for repurposing of that heat for secondary uses, such as district heating.
Given the rapid pace at which chip technology is evolving and the resulting demands on the grid, liquid cooling is quickly becoming a necessity. A number of data center developers are incorporating liquid cooling into their data center designs from the outset, and Nvidia announced that its AI-oriented Blackwell architecture will be completely liquid cooled. This dynamic is picking up speed. Last December, Microsoft and Schneider Electric each released high-efficiency liquid cooling system designs to cool AI-dedicated chips. For newer and more energy-intensive data centers, liquid cooling is rapidly becoming the default approach to managing waste heat.
There are a number of ways that liquids are used to cool data centers, and each technology has a sweet spot in the balance between efficiency and cost, relative to the heat management task it performs.
Rear door heat exchangers (RDHx) are currently in use in many facilities. Located at the end of the server racks, they transfer heat from fan-driven hot exhaust air to a chilled water loop that pulls heat out of that air before it recirculates. The water is in turn cooled in a chiller or cooling tower before flowing back to the heat exchanger. These systems reduce the need for larger fan-based circulation systems and create more uniform temperatures on the server floor. Today, this technology may effectively be deployed for racks up to 100 kW.
Direct-to-chip cooling transfers heat from the processor to a cold plate that dissipates heat through liquid channels. There are both single-phase and two-phase systems.
In a single-phase system, the heated liquid is then channeled to a heat exchanger before the cooled liquid is re-introduced to the cold plate. Single-phase systems are significantly better at removing heat than forced air cooling as well as addressing higher heat loads than RDHx systems. They are relatively cost-effective, compared with other liquid cooling systems, but if they leak, they can significantly damage computing equipment.
Two-phase systems, by contrast, also harness the cooling power of evaporation. They typically use a fluorocarbon-based liquid in a two-phase system in which the liquid coolant absorbs heat and boils off, releasing heat in the process. That additional complexity also introduces higher management and maintenance costs, but leakages are not an issue.
Immersion cooling involves submerging the entire system in non-conductive dielectric liquid. This technology offers excellent cooling efficiency, and can address very high heat loads. However, there are drawbacks related to environmental impact (they use PFAS “forever chemicals”), potential leakage, and the fact that the system can become the single point of failure, significantly damaging the entire data processing system.
Just as with direct-to-chip cooling, immersion technologies come in two basic variants, single-phase and two-phase immersion. Single phase coolants constantly remain in liquid form, while with two-phase immersion, the coolants start as liquids and when heated boil off as gases. The latter solution boasts higher efficiencies, but is more costly and does not lend itself as well to smaller and less dense data facilities.
There are also other emerging liquid-based technologies such as microconvective cooling, also called microjet impingement. This approach sprays cooling fluid on specific hotspots on processors and is more effective for high-power applications.
As high-performance and AI computing accelerates and rack densities increase, the level of interest in liquid cooling and industry uptake has grown quickly. An early 2024 survey revealed that 40% of participants that were increasing rack densities were focused on liquid cooling technologies. Almost one-third of survey respondents anticipated adopting liquid cooling technologies within the coming 12-24 months. Direct-to-chip single phase was the leading choice, followed by two-phase and single-phase immersion.
Where the interest lies
As rack densities increase in kW ratings, and run at higher capacity factors, traditional air-cooling approaches are simply unable to keep up the managing heat. Some operators are even retrofitting existing facilities to liquid cooling. These days, liquid cooling isn’t just something to consider – it’s increasingly a necessity.