Poster Type: Research Posters
Author: Hari Teja Jajula (The University of Alabama, Lawrence Berkeley National Laboratory (LBNL)), Dhruva Kulkarni (Lawrence Berkeley National Laboratory (LBNL)), Brian Austin (Lawrence Berkeley National Laboratory (LBNL)), Purushotham Bangalore (The University of Alabama)
Supervisor:
Abstract: Modern HPC systems generate large amounts of GPU and network telemetry, typically used for system health monitoring. At NERSC, we are developing a Performance API/UI that generates a job report card from this telemetry, providing an overview of performance characteristics. Using DCGM counters, we report GPU memory, compute, and power usage, and present preliminary investigations of job-level network activity. Without traditional profiling tools, this application-agnostic approach helps identify resource utilization imbalances, detect anomalies such as memory leaks, and assess overall performance for the user without additional effort.
Best Poster Finalist (BP): no
Poster: PDF
Poster Summary: PDF