Author: Arjun Kashyap (University of California, Merced)
Advisor: Xiaoyi Lu (University of California, Merced)
Abstract: Modern data center workloads demand substantial server resources, motivating the adoption of data processing units (DPUs) for improved efficiency. Despite increasing deployment, systematic characterization of SoC-based DPUs remains limited. We present a rigorous evaluation of NVIDIA’s BlueField-1, BlueField-2 (BF-2), and BlueField-3 (BF-3) across 15 benchmarks, revealing key idiosyncrasies in network, DMA, and memory. We further provide design recommendations and release our artifacts to the community. Additionally, naively integrating DPUs into workloads often reduces server resource usage without necessarily delivering high performance. In-memory key-value stores (KVS) are widely used for edge data storage, where low latency and high throughput are essential. We explore fine-grained offloading of in-memory CPU-based KVS to SoC-based DPUs by decomposing KVS and offloading the communication engine, the most CPU-intensive component, to enhance performance. We also propose a series of performance optimizations, such as overlapped request/response handling, reduced DMA operations, and dual communication engines. Our design achieves up to 68% lower latency and 36% higher throughput compared to CPU-only or coarse-grained offloading.
Applications in containers or VMs commonly rely on TCP/IP for communication in HPC clouds and data centers, yet TCP/IP introduces significant bottlenecks for NVMe-over-Fabrics I/O in disaggregated storage. We propose NVMe-over-Adaptive-Fabric (NVMe-oAF), an adaptive communication channel that leverages locality awareness and optimized shared memory/TCP paths to accelerate I/O-intensive workloads. Co-designed with Intel’s SPDK, NVMe-oAF achieves up to 7.1x higher bandwidth and 4.2x lower latency compared to TCP/IP over commodity Ethernet (10–100 Gbps), while delivering up to 7x bandwidth gains for HDF5 applications when integrated with H5bench.
Thesis Canvas: pdf