239: Can AI Copilots Keep Up with Pathologists?

Send us Fan Mail

Can AI copilots really keep up with pathologists when the cases are new, the workflow is messy, and the benchmark is actually protected from leakage?

In this episode of DigiPath Digest #48, I focus on one paper: DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset. I chose this paper because I think the field needs more of this kind of work. Less hype. More evaluation. Less “look what AI can do.” More “how do we test it in a way that actually means something?”

In this session, I look at what makes DALPHIN important for pathologists, lab leaders, and digital pathology trailblazers trying to make sense of pathology AI right now. The paper benchmarks three models against human pathologists: two general-purpose models, Gemini 2.5 Pro and GPT-5, and one pathology-specific model, PathChat+. The dataset includes 1,236 images from 300 cases, covering 130 diagnoses, 14 pathology subspecialties, and cases from six countries. Human performance is benchmarked with 31 pathologists from 10 countries.

What I like about this paper is that it does not stop at top-line performance. It deals with the benchmarking problem itself. The authors built a sequestered, indirectly accessible ground truth so the evaluation data could not simply be scraped into model training. That matters because without that protection, benchmarking can become an illusion of genius rather than a real test of generalization.

The results are interesting and more nuanced than a simple win-or-lose story. PathChat+ reached expert-level performance in four of six tasks, Gemini in two of six, and GPT in one of six. That tells us something important already: pathology-specific training matters. But it also does not mean pathology is solved. In organ recognition, expert pathologists still outperformed all the models. In rare cancers, none of the models reached expert-level performance. And in ambiguous cases, the models still struggled with something human pathologists do all the time: expressing uncertainty.

I also spend time on one of the most practical parts of the paper: model behavior. Gemini tended to overcall. GPT tended to undercall. PathChat was more balanced. That matters in practice. A pathologist using a copilot needs to know the tool’s calibration bias before they can safely interpret what it is telling them. I also talk about anchoring bias in conversational interfaces, where early hallucinations can propagate through later answers if memory is not reset between questions. That is not just a technical curiosity. That is a workflow and safety issue.

Why should you listen? Because this episode is really about a bigger question: What kind of evidence should pathologists demand before AI copilots enter real workflows? If you want to understand validation, data leakage, rare-case performance, uncertainty, and why these tools should still be treated as co-pilots rather than autopilots, this is a useful paper to know.

Episode Highlights

01:20 – Why I chose the DALPHIN preprint and why benchmarking matters right now.

05:38 – What is in the DALPHIN dataset: 300 cases, 130 diagnoses, 14 subspecialties, 6 countries.

07:57 – Top-line performance: PathChat+ reaches expert-level performance in 4 of 6 tasks.

09:41 – The benchmarking trap of data leakage and why DALPHIN’s sequestered ground truth matters.

12:19 – Why real pathology diagnosis is not text-only and why macro + micro context matters.

15:26 – Tissue recognition, neoplasm detection, ambiguity, and conversational memory: how the testing was structured.

21:29 – The diagnostic personalities of the models: overcalling, undercalling, and balanced behavior.

24:36 – Rare cancers: where AI copilots still fall short of expert human performance.

28:00 – Why binary outputs are not enough when pathology often lives in uncertainty.

31:37 – Anchoring bias and conversational memory: how early hallucinations can keep propagating.

37:11 – Why these tools should be treated as co-pilots, not autopilots.

40:29 – Resources for beginners: Digital Pathology 101 and continued AI literacy.

Resources mentioned

DALPHIN preprint: arXiv:2605.03544v1
DALPHIN evaluation platform: dalphin.grand-challenge.org
PathChat+ pathology-specific AI model discussed in the benchmark.
Digital Pathology 101 free eBook by Dr. Aleksandra Zuraw.
Educational streams on tissue recognition and computer vision literacy mentioned in the session.

Support the show

Get the "Digital Pathology 101" FREE E-book and join us!

Fler avsnitt av Digital Pathology Podcast