Workshop: The 12th Annual International Workshop on Innovating the Network for Data-Intensive Science (INDIS)
Authors: Alex Batlle Casellas (Qualcomm Europe, Inc.); Adrián Pérez Diéguez (Qualcomm Technologies, Inc.); Aleix Torres-Camps (Qualcomm Europe, Inc.); Harris Teague (Qualcomm Technologies, Inc.); and Arnau Padres and Jordi Ros-Giralt (Qualcomm Europe, Inc.)
Abstract: We present a comprehensive benchmarking study that evaluates the scaling performance of RDMA over Converged Ethernet (RoCE) and compares it with Infiniband in the context of large-scale LLM training workloads. While Infiniband is traditionally favored for its low-latency, high-bandwidth characteristics, it imposes significant infrastructure and operational costs. RoCE, leveraging commodity Ethernet and RDMA, offers a cost-effective alternative. Through extensive experiments on production clusters, we demonstrate that RoCE can achieve near-linear scaling performance comparable to Infiniband when properly configured. Our analysis spans data sharding strategies, quantization and activation recomputation techniques, batch size tuning, and system-level optimizations, providing practical guidance for designing scalable and efficient AI infrastructure.
Back to The 12th Annual International Workshop on Innovating the Network for Data-Intensive Science (INDIS) Archive Listing Back to Full Workshop Archive Listing