The International Conference for High Performance Computing, Networking, Storage, and Analysis

Doctoral Showcase Archive

Designing GPU-Aware Collective Communication for Heterogeneous Clusters with Diverse GPUs and Interconnects


Author: Chen-Chun Chen (The Ohio State University)

Advisor: Dhabaleswar K. Panda (The Ohio State University)

Abstract: GPU-accelerated HPC and deep learning workloads now operate at scales of tens to thousands of GPUs, making collective communication a dominant cost. Applications such as Amber, heFFTe, and distributed LLM training require frequent synchronization and exchange of large data partitions. At the same time, systems are increasingly heterogeneous: clusters combine NVIDIA, AMD, and Intel GPUs with interconnects such as NVLink, Infinity Fabric, InfiniBand, and Slingshot. Many MPI runtimes remain tuned for CPU-centric designs, performing unnecessary host staging, adding extra copies, and underutilizing high-bandwidth device paths or multi-rail topology. Support for newer stacks, particularly SYCL and Level Zero on Intel GPUs, is also uneven, hindering performance portability.

We present a unified, GPU-aware collective framework that targets portability and efficiency across vendors and networks. For Alltoall, we design IPC-based intra-node paths that avoid host staging and introduce push and pull variants that overlap intra- and inter-node transfers. For Allreduce, we implement on-device reduction kernels with native inter-node GPU support and computation-communication overlap; for medium messages at large scale, we add a direct sendrecv algorithm with throttling to balance bandwidth and latency. The framework extends to Intel GPUs via SYCL and Level Zero, alongside CUDA and ROCm back ends. To mitigate inter-node bandwidth limits for very large messages, we integrate a lightweight casting-based compression that downcasts in flight with negligible accuracy loss. Together, these designs provide efficient Alltoall and Allreduce across NVIDIA, AMD, and Intel platforms, improving end-to-end performance while reducing CPU involvement and data movement overhead.


Thesis Canvas: pdf



Back to Doctoral Showcase Archive Listing