Workshop: PMBS25: The 16th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems
Authors: Sairam Sri Vatsavai (Brookhaven National Laboratory); Raees Khan Ahmed (university of pittsburgh); Kuan-Chieh Hsu, Ozgur Kilic, Yihui (Ray) Ren, David Park, and Paul Nilsson (Brookhaven National Laboratory); Tania Korchuganova (University of Pittsburgh); Sankha Dutta (Brookhaven National Laboratory); Joseph Boudreau (University of Pittsburgh); Tasnuva Chowdhury (Brookhaven National Laboratory); Shengyu Feng (Carnegie Mellon University); Fatih Furkan Akman (University of Massachusetts); Adolfy Hoisie (Brookhaven National Laboratory); Scott Klasky (Oak Ridge National Laboratory (ORNL)); Tadashi Maeno (Brookhaven National Laboratory); Verena Ingrid Martinez Outschoorn (University of Massachusetts); Norbert Podhorszki and Frédéric Suter (Oak Ridge National Laboratory (ORNL)); John Rembrandt (Remy) Steele (University of Massachusetts); Wei Yang (SLAC National Accelerator Laboratory); Yiming Yang (Carnegie Mellon University); and Shinjae Yoo and Alexei Klimentov (Brookhaven National Laboratory)
Abstract: Large-scale distributed computing infrastructures like the Worldwide LHC Computing Grid (WLCG) require comprehensive simulation tools for performance evaluation and resource optimization. Existing simulators suffer from limited scalability, hardwired algorithms, lack of real-time monitoring, and inability to generate machine learning-suitable datasets.We present CGSim, a simulation framework addressing these limitations. Built on the validated SimGrid framework, CGSim provides high-level abstractions for modeling heterogeneous grid environments while maintaining accuracy and scalability. Key features include a modular plugin mechanism for testing custom workflow policies, interactive real-time visualization dashboards, and automatic generation of event-level datasets for AI-assisted performance modeling. Comprehensive evaluation using production ATLAS PanDA workloads demonstrates significant calibration accuracy improvements across WLCG sites. Scalability experiments show near-linear scaling for multi-site simulations, with distributed workloads achieving 6× better performance than single-site execution. CGSim enables researchers to simulate WLCG-scale infrastructures with hundreds of sites and thousands of concurrent jobs on commodity hardware within practical time budgets.
Back to PMBS25: The 16th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems Archive Listing Back to Full Workshop Archive Listing