Authors: Gilad Shainer (NVIDIA Corporation), Jeff Kuehn (AMD), Pavel Shamis (NVIDIA Corporation), Oscar Hernandez (Oak Ridge National Laboratory (ORNL)), Dhabaleswar Panda (The Ohio State University)
Abstract: In order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. UCX is a collaboration between industry, national labs, and academia that consolidates and provides a unified open-source framework.
The UCX project is managed by the UCF Consortium (http://www.ucfconsortium.org/) and includes members from Los Alamos National Laboratory, Argonne National Laboratory, Ohio State University, AMD, NVIDIA, and more. The session will serve as the UCX community meeting and will introduce the latest developments to HPC developers and the broader user community.
Long Description: To exploit the capabilities of new HPC systems and meet their demands for scalability, communication software must scale to millions of cores and support applications with adequate functionality to express their parallelism. UCX is a collaboration between industry, national laboratories, and academia that consolidates multiple technologies to provide a unified, open-source framework.
The UCX project is managed by the UCF Consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, Ohio State University, AMD, NVIDIA, and more. Other entities supporting the open-source development include other labs, OEMs, and large cloud providers. The session will serve as the UCX community meeting and will introduce the latest developments and specifications to HPC developers and the broader user community.
Modern HPC and AI systems include extreme numbers of compute elements and extremely low-latency interconnection networks. To exploit the capabilities of these architectures and to meet their demands in scalability, communication software needs to scale and support applications with adequate functionality to express their parallelism. Moreover, communication software should add as little overhead as possible to avoid compromising the native performance of the interconnection network. These requirements make the design of high-performance communication software extremely intricate since they demand minimal memory requirements and low instruction counts and cache activity while meeting stringent performance targets.
High-level programming models for communication (e.g., MPI, SHMEM) can be built on top of middleware, such as Portals, GASNet, UCCS, and ARMCI, or use lower-level network-specific interfaces, often provided by the vendor. While the former offers high-level communication abstractions and portability across different systems, the latter offers proximity to the hardware and minimizes overheads related to multiple software layers. An effort to combine the advantages of both is UCX, a communication framework for high-performance computing systems.
Due to its importance to the future of HPC and AI technologies and applications, UCX has received the 2019 R&D100 award.
The UCF organization manages other open-source projects, including UCC (Unified Collective Communication), OpenSNAPI (Open Smart NIC API), and others. The session will include a brief overview of these projects and call for participation.
Beyond the traditional HPC frameworks, the session will cover the latest AI libraries for both training and inference workloads utilizing UCX, and will describe the benefits of UCX for these frameworks.