Workshop: 6th Workshop on Heterogeneity and Memory Systems (HMEM)
Authors: Bin Ma (University of California, Merced); Jie Ren (William & Mary); and Shuangyan Yang and Dong Li (University of California, Merced)
Abstract: Deep learning recommendation models (DLRMs) rely on massive embedding tables that often exceed GPU memory capacity. Tiered memory offers a cost-effective solution but creates challenges for managing irregular access patterns. We introduce RecMG, an ML-guided caching and prefetching system tailored for DLRM inference. RecMG uses separate models for short-term reuse and long-range prediction, with a novel differentiable loss to improve accuracy. In large-scale deployments, RecMG reduces on-demand fetches by up to 2.8× and cuts inference time by up to 43%.
Back to 6th Workshop on Heterogeneity and Memory Systems (HMEM) Archive Listing Back to Full Workshop Archive Listing