Workshop: MEMO’25: International Workshop on Memory System Management and Optimization
Authors: Andrew Tee (University of California, Riverside; Advanced Micro Devices, Inc. (AMD)); Nicholas Curtis and Noah Wolfe (Advanced Micro Devices, Inc. (AMD)); and Daniel Wong (University of California, Riverside)
Abstract: This paper presents an analysis of memory hierarchy latency across AMD Instinct™ MI300A, MI300X, and MI250X GPUs using a fine-grained pointer-chasing microbenchmark. We characterize the scalar L1 (sL1), L2, AMD Infinity Cache™ referred to as the MALL (Memory Attached Last Level), and HBM (High Bandwidth Memory), revealing distinct latency levels and architectural trade-offs. MI300A and MI300X, based on the CDNA3 architecture, exhibit nearly identical latency profiles, while MI250X lacks a MALL, resulting in different performance characteristics. Memory latency remains consistent across compute partitioning modes, but NUMA Partitioning per Socket (NPS) significantly impacts performance. In NPS4 mode, partitioning improves locality, reducing latency by up to 1.42× in MALL and 1.31× in HBM. We further analyze MALL contention and Translation Lookaside Buffer (TLB) behavior under varying parallelism levels, identifying conditions where MALL performance degrades. These findings provide actionable insights for optimizing memory access patterns and improving performance on AMD’s latest GPU architectures.
Back to MEMO’25: International Workshop on Memory System Management and Optimization Archive Listing Back to Full Workshop Archive Listing