Authors: Nicholas Malaya (Advanced Micro Devices, Inc. (AMD)), Nikoli Dryden (Lawrence Livermore National Laboratory (LLNL)), Pavan Balaji (Meta Platforms, Inc.), Jeff Hammond (NVIDIA Corporation), Shelby Lockhart (Advanced Micro Devices, Inc. (AMD))
Abstract: The world's largest supercomputers for scientific discovery are also premier systems for artificial intelligence model training and inference. While traditional HPC compute has predominantly leveraged the MPI standard, AI workloads have increasingly focused on collective communication libraries, such as NVIDIA's NCCL and AMD's RCCL, which are optimized for high-bandwidth throughput. This BoF session at SC25 aims to delve into the intricacies of collective communication libraries, focusing on the comparison between the widely adopted Message Passing Interface (MPI) and NCCL/RCCL, as well as other key messaging libraries such as SHMEM.
Long Description: The world's largest supercomputers for scientific discovery are also premier systems for artificial intelligence model training and inference. However, while traditional HPC compute has predominately leveraged the MPI standard, AI workloads have increasingly focused on collective communication libraries, such as NVIDIA's NCCL and AMD's RCCL, which are optimized for high bandwidth throughput. With increasing emphasis on use in AI inference, use of device-side, latency-optimized communication has also increased.
This Birds of a Feather at SC addresses the intricacies of collective communication libraries, focusing on the comparison between the widely adopted Message Passing Interface (MPI) and NCCL/RCCL, as well as other libraries such as SHMEM.
Collective communication libraries are essential for synchronizing data across multiple processes, enabling tasks such as broadcasting, reduction, and gathering. While MPI has been the cornerstone of HPC communication for decades, *CCL present opportunities for performance enhancement through techniques like kernel fusion. These libraries leverage hardware-specific optimizations to accelerate communication, particularly in GPU-centric environments.
Fine-grained interleaving of compute and communication, e.g. with device-side *SHMEM libraries, has opened new doors to multi-X performance via fusion lower overheads.
Unlike MPI, which is designed to be thread-safe, AI collective libraries are not inherently thread-safe, posing challenges for developers in multi-threaded applications. This limitation necessitates careful programming and clear guidance to application developers to avoid race conditions and ensure data integrity.
The session will cover the following key topics:
* Overview of Collective Communication Libraries: Introduction to MPI, NCCL/RCCL/MSCCL, etc., highlighting their core functionalities and potential use cases in AI and HPC.
* Performance Comparison: Analyzing the performance benefits of NCCL/RCCL, such as kernel fusion, and how they compare to MPI in various scenarios.
* Case Studies and Real-World Applications: discussions from experiences on leadership computing systems where collective libraries and large scale AI training runs have been performed.
* Tools: Are debuggers and profilers from traditional HPC applications still useful for AI applications? What new technologies should the community design to ensure productive developers on AI and HPC workloads?
* Future Directions: Exploring the future of collective communication libraries, and if they are converging or diverging from standards. This will also discuss how collective and device-side libraries, along with the MPI standard, might be extended.
The session is designed to be highly interactive, encouraging participants to share their experiences, challenges, and insights related to collective communication in HPC.
Expected outcomes:
* Enhanced awareness within the HPC community of the capabilities and limitations of MPI vs. AI-focused collective communication libraries.
* Practical strategies for integrating non-thread-safe libraries into HPC applications.
* Insights into performance optimization techniques specific to GPU-based systems.
This session is relevant to the HPC community, including researchers, developers, and system architects, who are exploring the integration of complex workflows that leverage AI with HPC on supercomputers. An open dialogue on the state of the community around communication libraries, along with their benefits and challenges, will educate the SC user community and also provide a lens by which to evaluate future performance goals, software requirements and standards, and candidate system architectures.
Website: https://hpc-ai-comm-libs.github.io/