The International Conference for High Performance Computing, Networking, Storage, and Analysis

Doctoral Showcase Archive

AI-Driven Resource Optimization for High Performance Computing: A Comprehensive Framework


Author: Manikya Swathi Vallabhajosyula (The Ohio State University)

Advisor: Rajiv Ramnath (The Ohio State University)

Abstract: Shared HPC centers are often underutilized because jobs are commonly mis-specified for walltime, memory, and accelerators. This mis-specification causes queue churn, idle hardware, and long turnaround times. The main challenge is structural: researchers face a steep learning curve across different nodes, policies, and cost models. As a result, they often "guess and submit" with limited guidance. This work introduces the following center-focused solutions that use predictive models to guide scheduling.

(A) Estimators (black-box + white-box): Two complementary predictors estimate runtime and memory usage based on hardware and configuration. Black-box learners fit from prior runs; white-box models use operator/graph features and scaling laws to generalize. When modelled together, they predict resources with limited training data.

(B) HARP framework: HARP systematizes data generation, model building, and selection. It selects estimators based on measured error under site policy (queue limits, billing), resulting in a policy-compliant plan for walltime, memory, and devices.

(C) Estimator with Scheduler integration: A scheduler composes estimator outputs with TAPIS to produce valid submissions, select queues/partitions, and trade off time and cost. Supports resubmission strategies and “what-if” planning.

(D) Closed-Loop Orchestration and Path to Agentic Scheduler: Kafka streams job and filesystem signals to the Intelligence Plane, where estimators enforce policies that drive scheduler daemons, data-generation, and orchestration tasks. Future work extends this loop with goal/constraint inference, as well as drift-triggered self-updates, enabling autonomous model training. This is accompanied by an optional LLM for user interaction and decision explanation, as well as an MCP-ready design for adaptive scheduling and planning.


Thesis Canvas: pdf



Back to Doctoral Showcase Archive Listing