SC25 Proceedings

Workshops Archive

Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Workshop: 2025 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC)

Authors: Sreeram Venkat (The University of Texas at Austin; Advanced Micro Devices, Inc. (AMD)); Kasia Swirydowicz and Noah Wolfe (Advanced Micro Devices, Inc. (AMD)); and Omar Ghattas (The University of Texas at Austin)

Abstract: The hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.

Back to 2025 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) Archive Listing Back to Full Workshop Archive Listing