Poster Type: Research Posters
Author: Prajwal Singhania (University of Maryland), Siddharth Singh (University of Maryland, NVIDIA Corporation), Lannie Dalton Hough (University of Maryland), Ishan Revankar (University of Maryland), Harshitha Menon (Lawrence Livermore National Laboratory (LLNL)), Charles Jekel (Lawrence Livermore National Laboratory (LLNL)), Abhinav Bhatele (University of Maryland)
Supervisor:
Abstract: As large language models (LLMs) grow in parameter count, efficient generation requires inference to scale beyond a single node. Current approaches use tensor parallelism (TP) or pipeline parallelism (PP), but TP incurs high communication volume, while PP suffers from pipeline bubbles and is unsuitable for latency-critical scenarios. We present Yalis (Yet Another LLM Inference System), a lightweight and modular distributed inference framework that performs comparably to existing state-of-the-art systems for offline inference, while enabling rapid prototyping. Using Yalis, we study strong scaling of LLM inference on the Alps and Perlmutter supercomputers, revealing the poor scaling performance of existing parallelism strategies due to high communication overheads. We further compare the all-reduce performance of NCCL and MPI in the small-message regime, finding that while NCCL is efficient intra-node, MPI can outperform it cross-node for messages between 256-1024 KB. These results motivate the need for communication-efficient parallelism strategies for multi-node LLM inference.
Best Poster Finalist (BP): no
Poster: PDF
Poster Summary: PDF