The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

From Exploration to Explanation: ML-Driven Causal Discovery for Datacenter Reliability at Scale


Workshop: ISAV25: In Situ AI, Analysis, and Visualization

Authors: Pavana Prakash, Rolando P. Hong Enriquez, and Sergey Serebryakov (Hewlett Packard Labs); David Grant and Wesley Brewer (Oak Ridge National Laboratory (ORNL)); and Dejan Milojicic (Hewlett Packard Labs)

Abstract: Modern datacenters operate at unprecedented scale, supporting HPC and AI workloads while consuming hundreds of megawatts of power. Their reliability is challenged by complex interdependencies across cooling, power, and network subsystems, where failures can cascade into downtime and degraded performance. Existing monitoring approaches, largely threshold or only correlation-based, struggle to isolate root causes within high-dimensional, evolving telemetry. We present PACE (Pattern and Causal Exploration), an ML-based framework that combines unsupervised correlation clustering with supervised, lag-aware Granger causality to uncover subsystem structure and directed causal pathways from multivariate telemetry. PACE yields interpretable causal graphs, subsystem heatmaps, that align with physical processes and control logic, providing actionable insights for operations. Finally, we discuss how embedding PACE into digital twin architectures enables causal-informed \emph{what-if} reasoning, advancing the reliability and efficiency in datacenters.


Back to ISAV25: In Situ AI, Analysis, and Visualization Archive Listing Back to Full Workshop Archive Listing