Workshop: 2025 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC)
Authors: Esteban M. Rangel and Humza Qureshi (Argonne National Laboratory (ANL))
Abstract: Preparing large-scale scientific applications for diverse GPU architectures requires strategies that balance performance, portability, and long-term maintainability. We introduce a unified kernel abstraction and evaluate it using CRK-HACC, a production N-body cosmology code, enabling single-source compilation through both CUDA and SYCL toolchains. Our approach introduces a thin C++ layer that preserves the original CUDA kernel syntax and launch style while providing SYCL compatibility through a mechanical ``functorization'' process. This method avoids the complexity of automated source translation, retains architecture-specific optimizations, and reduces maintenance effort by eliminating code duplication. We evaluate the implementation on two DOE leadership systems—Polaris (NVIDIA GPUs) and Aurora (Intel GPUs)—comparing kernel-level execution times across backends and architectures. Results show competitive performance for SYCL relative to native CUDA while preserving code clarity and portability. This case study demonstrates a practical path toward sustaining performance in complex, physics-rich codes as HPC hardware continues to evolve.
Back to 2025 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) Archive Listing Back to Full Workshop Archive Listing