SC25 Proceedings

Birds of a Feather Archive

Converged HPC-AI Platforms: Navigating the Challenges of Heterogeneous Systems

Authors: Clayton Hughes (Sandia National Laboratories), Edgar Leon (Lawrence Livermore National Laboratory (LLNL)), Patrick Carribault (French Alternative Energies and Atomic Energy Commission (CEA)), Julien Loiseau (Los Alamos National Laboratory (LANL))

Abstract: Recent national laboratories and supercomputing centers are deploying heterogeneous systems integrating multi-core CPUs with GPUs, AI accelerators, FPGAs, IPUs, and DPUs. While these converged HPC-AI platforms promise unified infrastructure for large-scale simulation and AI workloads, they introduce complexities in programming, scalability, portability, and optimization. This BoF session examines the novel co-design strategies required to address heterogeneous architectures, focusing on scalable application frameworks, unified programming models, portable workflows, and software-hardware integration. Attendees will share experiences, tools, and emerging architectures to facilitate the convergence of HPC and AI, aiming to equip the community with actionable insights for developing high-performance applications on post-exascale systems.

Long Description: Recent and upcoming large-scale deployments at national laboratories and supercomputing centers worldwide underscore a paradigm shift toward increasingly heterogeneous systems, both at the node and system levels. Systems such as LLNL’s El Capitan, ANL’s Aurora, LANL’s Venado, CSCS’s Alps, BSC’s MareNostrum5, and JSC’s JUPITER exemplify this trend. These platforms integrate CPUs with massive core counts alongside advanced accelerators like GPUs, AI-specific accelerators, FPGAs, IPUs, and DPUs. The diversity of hardware introduces unique programming models, parallelization techniques, and optimization requirements that challenge application developers tasked with achieving peak performance across these platforms.

In this context, converged HPC-AI platforms are emerging as an essential strategy to unify traditional HPC and AI workloads. By integrating the compute power required for large-scale simulations with the data-driven demands of AI workflows, these platforms promise to simplify application deployment, streamline resource utilization, and drive innovation. However, this convergence necessitates significant effort in software co-design to address challenges posed by heterogeneous architectures, including:

1) Seamless Scalability: Ensuring applications can scale efficiently across diverse architectures, leveraging specialized hardware optimally.

2) Unified Programming Ecosystems: Developing abstractions, frameworks, and programming models that bridge the gap between HPC and AI requirements.

3) Portability and Performance Optimization: Establishing workflows that maintain performance across multiple, distinct hardware configurations.

4) Forward-Looking Co-Design: Preparing applications for post-exascale systems and evolving workloads, with a particular focus on AI-driven use cases.

This Birds of a Feather (BoF) session explores the critical role of co-design in advancing the convergence of HPC and AI. Participants will discuss the lessons learned from porting applications to heterogeneous systems, share strategies to mitigate risks, and highlight tools and frameworks that facilitate the integration of AI workloads into HPC environments. Additionally, the session will examine emerging technologies and architectures that are shaping the future of converged platforms.

By focusing on these key themes, we aim to provide attendees with actionable insights into developing robust, high-performance applications that thrive in the era of heterogeneous, converged HPC-AI systems.

Website: https://lanl.github.io/cea-nnsa-codesign/

Back to Birds of a Feather Archive Listing