The International Conference for High Performance Computing, Networking, Storage, and Analysis

Research and ACM SRC Posters Archive

Understanding GPU Utilization Using LDMS Data on Perlmutter


Poster Type: ACM Student Research Competition, Graduate

Author: Onur Cankur (University of Maryland), Brian Austin (Lawrence Berkeley National Laboratory (LBNL)), Abhinav Bhatele (University of Maryland)

Supervisor: Abhinav Bhatele (University of Maryland)

Abstract: GPGPU-based clusters and supercomputers have grown significantly in popularity over the past decade. While numerous GPGPU hardware counters are available to users, their potential for workload characterization remains underexplored. In this work, we analyze previously overlooked GPU hardware counters collected via the Lightweight Distributed Metric Service on Perlmutter. We examine spatial imbalance, defined as uneven GPU usage within the same job, and perform a temporal analysis of how counter values change during execution. Using temporal imbalance, we capture deviations from average usage over time. Our findings reveal inefficiencies and imbalances that can guide workload optimization and inform future HPC system design.

Best Poster Finalist (BP): no
Poster: PDF
Poster Summary: PDF


Back to Poster Archive Listing