The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

WAGES: Workload-Aware GPU Sharing System for Energy-Efficient Serverless LLM Serving


Workshop: ExHetAI: Extreme Heterogeneity and AI Convergence in HPC

Authors: Tianyu Wang (University of Pittsburgh), Gourav Rattihalli and Aditya Dhakal (Hewlett Packard Enterprise (HPE)), Xulong Tang (University of Pittsburgh), and Dejan Milojicic (Hewlett Packard Enterprise (HPE))

Abstract: Serverless LLM serving lowers costs by elastically provisioning GPUs and charging only for usage. However, current systems mostly target cold-start latency, overlooking inefficiencies: (i) static, exclusive GPU allocation that wastes compute resources and increases costs, and (ii) fixed hardware-controlled clock speeds that waste energy. Our analysis shows many LLM workloads can meet SLOs with partial SM allocations and reduced clock speeds, enabling GPU multiplexing and dynamic clock scaling. We present WAGES, a workload-aware GPU sharing system that uses NVIDIA MPS to co-locate LLMs, dynamically adjusting SM partitions and clock speeds to workload needs while meeting SLOs. A two-tier scheduler coordinates global GPU consolidation and local SLO-aware tuning, overlapping model/KV migration with execution to reduce reconfiguration overhead. On real LLM traces, WAGES improves SLO attainment by up to 4% over prior GPU sharing approaches and reduces energy use by up to 26%.


Back to ExHetAI: Extreme Heterogeneity and AI Convergence in HPC Archive Listing Back to Full Workshop Archive Listing