SC25 Proceedings

Workshops Archive

Beyond End-to-End: Understanding the Limits of LLMs in Scientific Problem Solving

Workshop: Frontiers in Generative AI for HPC Science and Engineering: Foundations, Challenges, and Opportunities

Authors: Youyuan Liu (Temple University); Sheng Di, Neil Getty, Tanwi Mallick, and Robert Underwood (Argonne National Laboratory (ANL)); and Sian Jin (Temple University)

Abstract: Multimodal large language models (MLLMs) are now widely used across many applications, including scientific question answering that requires combining visual and textual inputs. However, existing benchmarks in this area are mostly end-to-end, making it difficult to pinpoint where models fail. To address this gap, we design an evaluation framework that decomposes scientific question answering into subtasks for fine-grained assessment. We evaluate two MLLMs, Gemini 2.5 Pro and Qwen2.5-VL-32B-Instruct, on questions involving high-resolution visual data. Results show that accurate answers are unattainable without scripting or tool use. Although both models can solve individual subtasks, such as mapping cities to coordinates or computing pixel positions, they often fail to integrate these abilities in end-to-end reasoning, producing large deviations. Our findings highlight the importance of benchmarks that expose reasoning bottlenecks and suggest that agent-based or multi-model approaches may be required to achieve reliable performance on complex scientific tasks.

Back to Frontiers in Generative AI for HPC Science and Engineering: Foundations, Challenges, and Opportunities Archive Listing Back to Full Workshop Archive Listing