SC25 Proceedings

Birds of a Feather Archive

Advanced Architecture Testbeds: Community Resources for Enhanced HPC Research

Authors: Oscar Hernandez (Oak Ridge National Laboratory (ORNL)), Jeffrey Young (Georgia Institute of Technology), Filippo Spiga (NVIDIA Corporation), Jens Domke (RIKEN Center for Computational Science (R-CCS)), Kristel Michielsen (RWTH Aachen University), Miwako Tsuji (RIKEN Center for Computational Science (R-CCS)), Hal Finkel (DOE Office of Advanced Scientific Computing Research), Amir Shehata (Oak Ridge National Laboratory (ORNL)), Nick Brown (Edinburgh Parallel Computing Centre (EPCC)), Mosè Giordano (University College London (UCL)), Teresa Cervero (Barcelona Supercomputing Center (BSC))

Abstract: Testbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.

Long Description: The supercomputing community is in the midst of a period of unprecedented architectural innovation. The explosion in architectural diversity leads to a number of challenges, including understanding the potential performance impact of new architectural technologies on workloads of interest and guidance for architectural design from application and algorithm features.

To address these challenges, a variety of architectural testbed efforts have been established by leading worldwide HPC centre and national laboratories. Examples are CENATE (Pacific Northwest National Laboratory), HAAPS (Sandia National Laboratory), Rogues Gallery (Georgia Tech), OLCF and ExCL (Oak Ridge National Laboratory), ALCF AI testbed (Argonne National Laboratory), RIKEN's “virtual Fugaku” HPC cloud on AWS, and SmartNICs testbeds (Los Alamos National Laboratory). Foremost among them is the dimension of architectural diversity in processors, memory, and network that resulted from architectural designers grappling with increased demands on performance and energy efficiency in the processing, memory, storage and interconnect space.

A novel topic for this BoF will be evaluation of AI workloads and security practices. Generative AI and large language models (LLM) have proven to be transformational in tackling real-world problems like health analytics for vaccine candidate research and other critical health topics. The massive fast-paced adoption of these tools has put an extra pressure on HPC centers and national laboratories to understand both the hardware and the software side of them, which has similarities but also huge differences compared to classic HPC workload in computational science and engineering. Also, the explosion of AI accelerators, also pose the risk of understanding performance portability challenges in terms of accuracy and performance of AI models. Since deploying any large scale infrastructure specialized for a specific set of workloads is a huge investment, testbeds represent a viable way to de-risk prior adoption by understanding the hardware technology and track evolution of the software ecosystem. The security of data processing on these devices is critical to preserve user privacy but it is increasingly challenging because of the diverse system designs that exist in emerging architecture.

This BoF brings together researchers and practitioners involved in these programmes to share lessons learned from evaluating diverse architectures, testbed design principles, reproducible benchmarking methodologies, overall system evaluation, and experience on systems bring-up and availability. The audience is encouraged to be actively involved in discussion with topics ranging from applications evaluation, suitability of programming language, software stack maturity, resilience, security.

Building on the success of our previous BoF sessions in 2019 (50 attendees), 2021 (48 attendees), and 2023 (35 attendees) we aim to build a vibrant community and foster a collaborative environment to discuss current and future challenges. The attendees will not only hear about lessons learned from a group of invited speakers, but also learn how to gain access to our test beds and be able to offer theirs in exchange.

Website: https://caatb.github.io/aatb-bofs/

Back to Birds of a Feather Archive Listing