SC25 Proceedings

Workshops Archive

Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

Workshop: Frontiers in Generative AI for HPC Science and Engineering: Foundations, Challenges, and Opportunities

Authors: Ozan Gokdemir (University of Chicago, Argonne National Laboratory (ANL)) and Neil Getty, Robert Underwood, Sandeep Madireddy, Franck Cappello, Arvind Ramanathan, Ian Foster, and Rick Stevens (Argonne National Laboratory (ANL))

Abstract: As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B–14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.

Back to Frontiers in Generative AI for HPC Science and Engineering: Foundations, Challenges, and Opportunities Archive Listing Back to Full Workshop Archive Listing