The International Conference for High Performance Computing, Networking, Storage, and Analysis

Birds of a Feather Archive

An Integrated Deep Reinforcement Learning Agent for Sunfish and HPC Workload Manager Composable Disaggregated Resource Scheduling


Authors: Michael Aguilar (Sandia National Laboratories), Russ Herrell (OpenFabrics Alliance), Phil Cayton (Intel Corporation), Nathan Hanford (Lawrence Livermore National Laboratory (LLNL)), Catherine Appleby (Sandia National Laboratories)

Abstract: The Sunfish Composable Disaggregated Infrastructure framework, combined with a deep reinforcement learning agent for scheduling, integrates with both HPC workload managers and container orchestrators to reduce application run-time latency, increase data center batch run efficiency, dynamically create ephemeral IO burst buffers, and mitigate problems from degraded hardware. Managing disaggregated resource pools with Sunfish minimizes idle resources and allows burst buffer allocations that create optimized execution environments for modern workloads, such as MOD/SIM and AI/ML. We will disclose our work integrating Sunfish with the Flux workload manager on a national lab testbed and discuss additional use cases within the industry.

Long Description: Current HPC architectural solutions implement high-density, resource rich compute nodes with increased memory per core, core count, and memory per compute nodes. Most HPC and AI workloads execute on such data center compute nodes with static sets of resources. When applications are executed on these compute nodes they may not have efficient access to hardware resources (e.g., accelerators, GPUs, memory, storage) resident on other nodes. Without cross-node access, CPU and GPU cores, FPGAs, storage, and memory on these densely packed compute nodes can remain isolated and stranded while still using energy and generating heat. Additionally, these densely packed compute nodes need to be provided with stable, low latency, high-bandwidth IO to prevent data starvation.

The emergence of Composable Disaggregated Infrastructure (CDI), where hardware resources can be allocated, configured, and combined by software does provide HPC systems with improved resource utilization and solutions to stranded resources. The OpenFabrics Alliance along with DMTF and SNIA is developing Sunfish, an open-source CDI Resource Management Framework. Sunfish configures fabric interconnects and manages CDI resources in dynamic HPC infrastructures using client-friendly abstractions.

The goal of Sunfish is to enable interoperability through common interfaces to enable clients (e.g., HPC workload managers, applications, and administrative tools) to efficiently connect workloads with resources in complex heterogenous ecosystems, without having to worry about the underlying network technology. With Sunfish, a CDI Management solution can associate and augment remote compute resources from available shared pools over network fabrics. Shared pools can provide resources to different types of running batch jobs, as they are needed. Once batch jobs complete, Sunfish provides up-to-data information on released resources that can be allocated to other batch jobs.

Sunfish is now focusing on increasing support with hardware vendors for Sunfish Hardware Agents, extending the framework for supporting more fabrics and client libraries. Sunfish developers are defining a composition manager reference that can be used to apply policies for automating composition of resources available in a system. These automatic CDI compositions integrate well with some HPC Workload Managers and Container Orchestrators. We are now working with the Flux Workload Manager and NVMe-oF Agents.

This BoF is targeting communities of: * fabric developers and providers * developers of CDI solutions and tools including automation, composition, and orchestration tools * developers and users of parallel programming libraries and applications relying on composable resources * developers of solutions that rely on accurate, easy access to fabric information e.g., workload managers, task brokers, telemetry services, operations management, performance tuning applications.

This BoF is a Call to Action for those communities to discuss CDI management, HPC management Use Cases, provide feedback on the most urgent set of problems facing them, and to collect requirements for steering the next efforts for the Sunfish framework. We are also calling for members to join the OFA Sunfish Working Group to participate in the design and development of the framework.

Website: https://github.com/openfabrics



Back to Birds of a Feather Archive Listing