SC25 Proceedings

Workshops Archive

To Virtualize or Not to Virtualize: Experiences from Building Two Generations of Virtualized Infrastructure for LLM Training

Workshop: 7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC)

Authors: Apoorve Mohan, Ming-hung Chen, I-hsin Chung, Richard Welp, and SEETHARAMI Seelam (IBM Thomas J. Watson Research Center)

Abstract: Large Language Model (LLM) training workloads share computational characteristics with high-performance computing applications, requiring intensive parallel processing, complex matrix operations, and distributed computing with frequent synchronization -- requiring specialized hardware to deliver optimal performance.

This talk presents insights from Vela, a cloud-native system architecture introduced in 2021 for LLM training using commercial hardware and open-source software. The Vela architecture combines off-the-shelf hardware, Linux KVM virtualization with PCIe passthrough, and virtualized RDMA over Converged Ethernet networks. The system employs software-defined networking with SRIOV technology for GPU Direct RDMA, achieving near-bare-metal performance while maintaining virtualization benefits.

Based on multiple data center deployments and iterations, we present two case studies examining what it takes for virtualization-based systems to deliver (a) bare-metal RoCE-like performance and (b) bare-metal InfiniBand-like performance for LLM training workloads. The discussion focuses on virtualization challenges, experiences, and runtime optimizations required for optimal performance in cloud-native training infrastructure.

Back to 7th International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC) Archive Listing Back to Full Workshop Archive Listing