The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

PathLlama: A Language Model for Automated Cancer Surveillance


Workshop: The 11th Computational Approaches for Cancer Workshop (CAFCW25)

Authors: Patrycja Krawczuk, John Gounley, Abhishek Shivanna, and Mayanka Chandrashekar (Oak Ridge National Laboratory (ORNL)); Elizabeth Hsu (National Cancer Institute); and Heidi Hanson (Oak Ridge National Laboratory (ORNL))

Abstract: Transforming unstructured information into structured common data models (CDM) is a critical step for enabling cancer surveillance and advancing precision medicine. CDMs standardize the structure and content of oncologic data extracted from electronic health records. Unfortunately, traditional Extract Transform Load processes for electronic health data capture are generally rule-based, error-prone, and produce static datasets unsuitable for near real-time information retrieval.

The Modeling Outcomes using Surveillance Data and Scalable AI for Cancer (MOSSAIC) project developed and deployed a hierarchical self-attention (HiSAN) model capable of autocoding approximately 30% of National Cancer Institute Surveillance, Epidemiology, and End Results (SEER) registry cancer pathology reports [1], [2]. While a significant step forward, this falls short of the broader goal of automatically coding all pathology reports. Fully automating CDM conversion would facilitate clinical trial matching, decision support dashboards, real-time case ascertainment, and population health surveillance.

The distribution of cancer phenotypes in real-world data is highly imbalanced. While HiSAN performs well on classes well-represented during training, its accuracy and confidence degrade substantially for less common categories. Large language models (LLMs) offer a promising solution for underrepresented oncological entities, owing to their ability to leverage context and pretraining. Rather than relying solely on general-purpose models, domain adaptation or continual pretraining of LLMs may further improve performance by helping models learn the specialized vocabulary, abbreviations, and context typical of clinical text. In this study, we finetune LLMs for SEER pathology report classification, with and without additional domain-adaptive pretraining, and compare the results to the HiSAN baseline [2].

Based on Llama 3 8B, PathLlama was developed by finetuning for cancer pathology report classification, with and without domain adaptation. The domain adaptation task was next token prediction and the pretraining dataset was composed of a large corpus of approximately 10M cancer pathology reports and abstracts from SEER and about 500k clinical notes and radiology reports from MIMIC [3]. The PathLlama models were finetuned to classify site (70 categories), subsite (330), laterality (7), histology (677), and behavior (4). The finetuning dataset was 4052951 reports from six SEER registries: Kentucky, Louisiana, New Jersey, New Mexico, Seattle/Puget Sound, and Utah. The finetuning dataset was randomly split into 80%/10%/10% for training, test, and validation, ensuring all reports associated with a single case belong to the same split.

Finetuning results are shown in Table I. We observe that the micro F1 scores, dominated by majority classes due the imbalance in the dataset, improve only slightly from the HiSAN to either of the PathLlama models. The most notable improvements in micro F1 come from the domain-adapted PathLlama for subsite and laterality. In contrast, more significant improvements occur for macro F1, particularly for subsite, laterality, and histology. For these three tasks, the domain-adapted PathLlama model also substantially outperforms the PathLlama base model. From these macro F1 results, we find that the contextual and pretraining advantages of Llama itself are indeed sufficient to markedly improve classification performance on underrepresented classes. However, domain adaptation offers additional benefit, further enhancing performance that justifies the increased computational cost associated with extended pretraining.


Back to The 11th Computational Approaches for Cancer Workshop (CAFCW25) Archive Listing Back to Full Workshop Archive Listing