199: Reporting Standards for Medical Foundation and Language Models

Paper Discussed in this Episode:

Reporting checklist for foundation and large language models in medical research (REFINE): an international consensus guideline. Mese I, Akinci D’Antonoli T, Bluethgen C, et al. Diagn Interv Radiol 2026.

Episode Summary: In this special journal club edition of the digital pathology podcast, we tackle a massive structural problem in medical imaging and AI: the rapid adoption of foundation models and large language models (LLMs) that are completely outgrowing our traditional evaluation frameworks. We examine the groundbreaking 2026 REFINE consensus guideline that addresses the opaque and stochastic nature of generative AI, forcing researchers to fundamentally change how they report on these tools to move away from black-box unpredictability toward true reproducibility.

In This Episode, We Cover:

• The "Wooden Ruler" Problem: Traditional AI reporting tools, such as CLAIM and TRIPOD-AI, were built under the assumption that algorithms are deterministic, meaning they give the exact same output every time. Generative AI is inherently stochastic and sensitive to subtle variables, making old checklists function like rigid wooden rulers trying to measure a fluid target.

• The REFINE Framework: Created via a rigorous Delphi consensus process by 57 contributors from 17 countries, this robust 44-item, 6-section checklist is a massive global effort. It features a deliberate "N/A" filtering mechanism to practically accommodate highly diverse text, imaging, and multimodal study designs.

• Prompting is the New Coding: We explore why researchers must now treat prompt engineering with the exact same rigor as traditional source code. The guideline requires full transparency on prompting strategies, session memory policies, and precisely how patient clinical context (like BI-RADS or ICD codes) is integrated into the model.

• Corralling the Chaos (Stochasticity & The Human Element): Controlling an LLM requires detailing generation parameters like "temperature," which dictates model creativity. Crucially, studies must also document the prompt operator's characteristics, as a senior attending radiologist will intuitively guide a model very differently than a first-year resident, drastically skewing the output.

• The Contamination Crisis: We discuss the existential threat of dataset contamination, which occurs when an LLM has already memorized public test datasets (like MIMIC-CXR) during its pre-training phase. The guideline demands rigorous checks against the model's knowledge cut-off dates and full transparency regarding the use of synthetic data.

• Clinical Reality Check: A model's performance in a vacuum is meaningless if it cannot seamlessly integrate into a hospital's clinical workflow, such as its PACS. We detail why researchers must now explicitly outline clinical non-use cases, map out data privacy safeguards, and conduct formal failure analyses to categorize errors like hallucinations.

Key Takeaway: The REFINE guideline marks a critical maturation point for medical AI research. By rigorously addressing the unique chaotic elements of generative AI—such as prompt sensitivity, stochastic generation, and dataset contamination—this framework ensures that future medical AI studies provide a trustworthy, reproducible foundation of evidence that frontline clinicians can safely rely on for patient care

Support the show

Get the "Digital Pathology 101" FREE E-book and join us!

Fler avsnitt av Digital Pathology Podcast