The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Dynamic Topology-Aware Scheduling in HPC Systems with Topograph


Workshop: 7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)

Authors: Dmitry Shmulevich (NVIDIA Corporation)

Abstract: HPC workloads continue to grow in complexity and resource demands, requiring large-scale compute clusters. To achieve maximal efficiency, multi-node workloads should be scheduled on network-adjacent nodes. SLURM supports topology-aware scheduling using a cluster topology configuration file. However, in large or dynamic environments, nodes may be added or removed at any time, making it crucial to maintain an accurate view of the cluster’s network topology. Having up-to-date information about network structure is even more important in cloud environments, where users have less control over compute resources than in on-premises setups. In this talk, we introduce Topograph, an open-source tool that automatically discovers and maintains cluster network topology. Topograph supports both CSPs and on-premises environments, and can be deployed in SLURM and Kubernetes clusters, including hybrid SLURM-on-Kubernetes systems. By exposing detailed, real-time network topology, Topograph enables HPC workloads to run on nodes with optimal interconnectivity, improving performance and resource efficiency.


Back to 7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC) Archive Listing Back to Full Workshop Archive Listing