The International Conference for High Performance Computing, Networking, Storage, and Analysis

Doctoral Showcase Archive

Improving Collective Aggregation for HPC and AI Workloads


Author: Mikaila Gossman (Clemson University)

Advisor: Jon Calhoun (Clemson University), Bogdan Nicolae (Argonne National Laboratory (ANL))

Abstract: High performance computing (HPC) applications generate massive volumes of data, placing sustained pressure on parallel file systems (PFS) that face limited bandwidth and resource contention. While file-per-process I/O allows lock-free access, reducing stripe contention, it creates excessive metadata overhead and poor manageability at scale. Aggregation—consolidating output from many processes into fewer shared files—helps mitigate these issues, but introduces new challenges related to concurrency, resource contention, complex I/O patterns, and their interactions with heterogeneous storage devices.

We identify and evaluate key I/O bottlenecks across these dimensions. To support system-level tuning, we introduce a lightweight OpenMP benchmark that helps users identify optimal aggregation parameters and found that interleaved, append-only I/O provides better performance when aggregating to a shared file. From this work, we present a novel, producer-consumer-based aggregation model designed to balance concurrency and resource usage efficiently. In microbenchmarks, our strategy achieved up to 2× higher write throughput than GIO and 1.6× higher than ADIOS2. In a real-world HPC application (HACC), it delivered 1.2× higher throughput with only 3% checkpoint overhead—compared to ~12% for GIO, which is optimized for HACC. Finally, we demonstrate the limitations of existing checkpointing approaches using DeepSpeed Megatron on the BLOOM 3B model, revealing significant inefficiencies during restore phase due to excessive reads and seeks.

Future work will extend our aggregation framework for large language model (LLM) C/R, which introduces highly concurrent, small, and random I/O patterns that pose new challenges for traditional PFS architectures.


Thesis Canvas: pdf



Back to Doctoral Showcase Archive Listing