Authors: William Magro (Google LLC), Katherine Riley (Argonne Leadership Computing Facility (ALCF)), Satoshi Matsuoka (RIKEN), Rick Stevens (Argonne National Laboratory (ANL), University of Chicago), Costas Bekas (Citadel Securities)
Abstract: The confluence of HPC, AI, and cloud is entering a new phase, catalyzed by the integration of AI and its influence on HPC hardware. Scientific workflows are evolving to treat simulation, AI, and analytics as a a deeply connected continuum. In this session, we’ll explore how the need to incorporate large-scale AI models and agentic systems—from training on scientific data to real-time inferencing—is creating new patterns of hybrid HPC-cloud usage. We will also discuss the architectural, software, and policy challenges and opportunities facing HPC professionals in a world where AI and Cloud are now driving the evolution of high performance computing infrastructure.
Long Description: The conversation around HPC and Cloud has centered on bursting for capacity, cloud suitability for HPC workloads, data management, and cloud-like capabilities for on-premise systems. Today, the driver has fundamentally shifted: Artificial Intelligence is reshaping not only the hardware landscape but the scientific method itself. The rise of "AI for Science" is creating workflows of unprecedented complexity and capability. We see this in the emergence of:
Scientific Foundation Models: Large models trained on vast corpuses of scientific data (e.g., proteins, climate data, material properties) that can be fine-tuned for specific research questions.
AI-in-the-loop Simulations: Where AI models act as surrogate models or steer complex simulations in real time, dramatically accelerating discovery in fields from drug design to cosmology.
AI for Hypothesis Generation: Using AI to explore vast parameter spaces and propose novel experiments or molecular structures.
These AI-powered workflows demand new approaches to resource integration. A single scientific campaign may now require the raw, tightly-coupled power of an on-premise supercomputer for a core simulation, followed by a burst to a public cloud to leverage specialized AI accelerators for training, and then continuous, low-latency access to a cloud-hosted inference API.
This shift presents critical questions for our community, which we will explore through the lens of key personas:
Supercomputing Centers: How can we securely and efficiently integrate cloud capabilities into our HPC environments? What is the right economic and operational model for supporting hybrid workflows?
Cloud Providers: Beyond providing raw instances, how can we better tailor both AI services and simulation infrastructure to better meet the needs of scientific HPC workflows?
Scientific and Industrial Users: How do we design and manage a portable software environment that can leverage and span the best of on-prem HPC and cloud? How do we best address data gravity when datasets and models are dispersed across different locations?
Hardware and Software Vendors: How can we evolve architectures to both meet the low-precision math demands of AI while continuing to provide a path to high-precision math required for scientific simulation?
This BoF will move beyond the "if" of to the "how," focusing on the practical challenges and emerging solutions in this new, AI-driven era of HPC.