The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Enabling Unstructured Sparse Fine-Tuning and Inference for Foundation Models on Wafer-Scale Engine


Workshop: ExHetAI: Extreme Heterogeneity and AI Convergence in HPC

Authors: Haoyu Zheng and Yifan Zeng (Oregon State University), Linghao Song (Yale University), Murali Emani (Argonne National Laboratory (ANL)), and Wenqian Dong (Oregon State University)

Abstract: Adapting foundation models via fine-tuning often negates the benefits of sparsity, as common sparse-to-dense training results in high inference costs measured in Floating-Point Operations (FLOPs). We propose PHOENIX, a framework designed for efficient sparse inference on the Cerebras CS-2 wafer-scale accelerator. PHOENIX employs an innovative strategy that merges sparse model weights with low-rank adapters, preserving high levels of sparsity throughout the adaptation process without sacrificing accuracy. It leverages the CS-2's native support for unstructured sparsity to accelerate inference computations.

Across multiple models and tasks, PHOENIX maintains accuracy comparable to dense baselines even at 50–60% sparsity. This high level of sparsity enables a near 2x reduction in FLOPs and a 1.7x improvement in inference throughput compared to a single NVIDIA A100 GPU, demonstrating a practical path to efficient, deployable sparse models.


Back to ExHetAI: Extreme Heterogeneity and AI Convergence in HPC Archive Listing Back to Full Workshop Archive Listing