The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

MPI Collectives with Programmable Smart Switches


Workshop: ExaMPI25: Workshop on Extreme Scale MPI

Authors: Thomas Erbesdobler (Technical University of Munich) and Amir Raoofy, Ehab Saleh, and Josef Weidendorfer (Leibniz Supercomputing Centre (LRZ))

Abstract: Programmable smart network devices are heavily used by cloud providers, but typically not for HPC. However, they provide opportunities for off-loading computations, in particular for collective operations, which are important for data intensive workloads in classic HPC and ML training. In this paper, we present a prototype called mpitofino to enable offloading MPI collectives (in particular reductions) onto smart switches over an Ethernet fabric. We target Intel’s programmable Ethernet switches equipped with a Tofino ASIC, and we use the P4 programming language to process collective packets on the chip’s low-latency data path. We demonstrate how the flexibility of P4 enables us to use RoCEv2 as protocol, utilizing RDMA hardware support on the nodes’ NICs. Furthermore, we implement mpitofino as a collective provider in Open MPI and discuss its desirable scaling characteristics. Finally, we demonstrate that mpitofino can achieve data throughput close to the 100GBit/s line rate.


Back to ExaMPI25: Workshop on Extreme Scale MPI Archive Listing Back to Full Workshop Archive Listing