The International Conference for High Performance Computing, Networking, Storage, and Analysis

Birds of a Feather Archive

Ongoing/Commissioning Liquid Cooling Systems: Case Studies and a Systematic Guide


Authors: David Grant (Oak Ridge National Laboratory (ORNL)), Chris DePrater (Lawrence Livermore National Laboratory (LLNL)), David Martinez (Sandia National Laboratories), John Herboth (Glumac), Sara Hitt (Vertiv), Terry Rodgers (T5DataCenters)

Abstract: Ever-increasing compute system heat density and scale is driving liquid cooling solutions in cutting-edge supercomputers and large-scale AI training/inference systems. Ensuring peak performance and reliability hinges on robust commissioning and ongoing commissioning of data centers. This session, tailored for operational managers, facility engineers, liquid cooling vendors, architects, and engineers, looks into this critical process. Gain firsthand insights from real-world case studies, featuring cooling system commissioning at Oak Ridge National Laboratory's Frontier and Lawrence Livermore National Laboratory's El Capitan, alongside Sandia National Laboratories' advanced OCx methodologies. Learn about a community initiative to document a guideline for liquid cooling commissioning.

Long Description: This session will share practical insights into commissioning and ongoing commissioning (OCx) of facility-side cooling systems for large, high-density supercomputers. Drawing on real experience from national laboratories, it will use case studies to support a collaborative discussion on best practices and a draft guide for liquid-cooled system commissioning. Audience feedback will help refine this guide to ensure its relevance and value for HPC sites worldwide.

The session begins with two presentations on commissioning cooling systems for exascale-class systems. Oak Ridge National Laboratory will share experiences from Frontier, and Lawrence Livermore National Laboratory will discuss their experience commissioning El Capitan’s cooling system — both highlighting real-world challenges and solutions for large-scale liquid cooling.

Next, Sandia National Laboratories will present their approach to Ongoing Commissioning (OCx). While initial commissioning brings systems online, OCx keeps them running optimally through routine monitoring, testing, and adjustments, showing how to move from one-time testing to continuous performance improvement.

A commissioning agent from Glumac will moderate an audience discussion on the draft guide for systematic commissioning. The discussion will focus on clarity, completeness, and adaptability across diverse sites. The aim is to build practical, shared guidance to help engineers and operators manage growing cooling demands as supercomputing scales up.

As HPC moves toward exascale and beyond, robust and efficient cooling is critical. Commissioning and OCx are key to achieving reliable, long-term operation. This session is critically relevant not only to the HPC community but also to the rapidly expanding realm of large-scale AI training and inference data centers. The HPC community, having pioneered liquid cooling for high-density compute for far longer, possesses invaluable experience in developing efficient, minimal-water cooling strategies for fully liquid-cooled systems. This deep expertise can significantly guide the AI infrastructure sector, which is now designing AI supercomputers with megawatt-scale racks akin to traditional HPC systems. By sharing direct experience from Frontier, El Capitan, and Sandia’s OCx practice, this session delivers tested, practical knowledge for those managing HPC and AI facilities.

Attendees will be encouraged to ask questions, share insights, and contribute suggestions for the guide — supporting reliable operations, energy savings, and sustainable data centers.

While general commissioning practices are known in the data center field, dedicated sessions on liquid-cooled exascale systems — combined with real-time community input — are rare. Though a similar BoF took place at SC13, technologies and approaches have evolved significantly. This session brings fresh insight from current exascale deployments to help the community adapt.

Key techniques include pressure testing for leaks, flow tests for valve function and rates, and load simulation to verify demand handling. Due to scale and logistics, full-load testing may not be practical, so staged testing and data extrapolation can build high confidence in the mechanical, electrical, and control systems supporting next-generation HPC infrastructure.

Website: https://sites.google.com/lbl.gov/liquid-cooling-commissioning/home



Back to Birds of a Feather Archive Listing