SC25 Proceedings

Birds of a Feather Archive

Artificial Intelligence and Machine Learning for HPC Workload Analysis (Fifth Session)

Authors: Kadidia Konaté (Lawrence Berkeley National Laboratory (LBNL)), Sergio Iserte (Barcelona Supercomputing Center (BSC)), Kevin Menear (National Renewable Energy Laboratory (NREL)), Terry Jones (Oak Ridge National Laboratory (ORNL)), Torsten Hoefler (ETH Zürich)

Abstract: Modern HPC systems generate massive amounts of monitoring and performance data daily, making manual analysis increasingly impractical. AI and machine learning are emerging as powerful tools to extract insights, detect anomalies, and optimize workload and resource behavior. This BoF brings together experts from HPC, AI, and data science to share current practices, challenges, and emerging solutions in the field. The session aims to foster collaboration and highlight real-world applications of AI/ML for improving system efficiency, reliability, and user understanding in large-scale computing environments.

Long Description: High-Performance Computing (HPC) platforms today operate at an unprecedented scale and complexity. Each day, these systems generate terabytes of data across a wide spectrum of telemetry sources. These include hardware performance counters, job scheduler traces, I/O logs, resource- and energy-use metrics, thermal and fault events, user support tickets, and system administrator notes. This diverse and voluminous data landscape offers rich opportunities to understand and optimize workload behavior, system health, and performance characteristics. However, the sheer scale and complexity of these systems—and their associated data—make manual inspection and traditional rule-based monitoring insufficient for reliable decision-making. As HPC facilities evolve toward exascale and incorporate increasingly heterogeneous hardware, the challenge of understanding user behavior, predicting system bottlenecks, and ensuring resilience becomes more pressing. Adding to this complexity is the dynamic and multi-tenant nature of modern supercomputing workloads, which often include jobs of widely varying sizes, runtime behaviors, and I/O demands. These realities demand smarter, automated systems for analysis and control. Artificial Intelligence (AI) and Machine Learning (ML) have emerged as transformative tools for unlocking the value hidden within HPC system data. By learning from vast and diverse telemetry, AI/ML approaches are capable of discovering patterns, modeling behavior, predicting failures, guiding scheduling decisions, detecting performance anomalies, and even summarizing logs. Already, preliminary applications have shown promise in areas such as job classification, workload forecasting, and real-time monitoring. However there remain significant open questions around model interpretability, data sharing and privacy, domain adaptation across sites, integration with existing workflow managers and job schedulers, and the balance between predictive accuracy and operational robustness. Furthermore, the AI/ML community and the HPC systems community frequently operate in parallel, with few venues for exchanging ideas, tools, and lessons learned. To help bridge this gap, the Birds of a Feather on Artificial Intelligence and Machine Learning for HPC Workload Analysis provides a dedicated space for researchers, engineers, system administrators, and AI practitioners to connect, share, and debate the future of intelligent workload analysis. This session builds on the momentum and success of previous editions at ISC 2024, SC2024, CUG 2024 and ISC 2025, with the intent to foster deeper cross-disciplinary collaboration and highlight real-world applications. At ISC2025, Torsten Hoefler– Professor at ETH Zurich and the Chief Architect for Machine Learning at the Swiss National Supercomputing Center–shared his latest research on function embeddings and offered insights on optimizing HPC’s 3 core dimensions: I/O, Compute, and Communication. Over 120 attendees also engaged with Francesco Antici and Jens Domke’s talk on detecting AI training jobs to enhance HPC security. We invite participation from those developing AI/ML models for workload monitoring, performance prediction, anomaly detection, system resilience, and intelligent scheduling, as well as from developers of open datasets, toolkits, and benchmarking platforms. The goal of this BoF is to move toward a shared understanding of how to operationalize AI/ML methods at scale in production HPC environments, identify key challenges, showcase community-driven solutions, and spark new collaborations that advance smarter, more efficient supercomputing ecosystems.

Back to Birds of a Feather Archive Listing