Korfmann K et al., Proceedings of the National Academy of Sciences (PNAS) - This episode examines cxt, a decoder-only transformer that performs next-coalescence prediction by translating local mutational context into pairwise TMRCA estimates. Trained on stdpopsim simulations, cxt delivers rapid, scalable coalescence-time inference, calibrated posteriors, and practical adaptations for empirical data. Key terms: language models, coalescent theory, uncertainty, stdpopsim, simulation-based inference.
Study Highlights:
The authors develop cxt, an autoregressive transformer that predicts discretized pairwise coalescence times from SFS-weighted mutation windows, framing TMRCA inference as a translation task. Trained on extensive stdpopsim simulations, cxt matches state-of-the-art accuracy in well-specified settings and generalizes to many out-of-sample species with some loss of accuracy. The model produces well-calibrated approximate posteriors, enables rapid GPU inference (millions of predictions in minutes), and can be fine-tuned or adapted for large Ne, missing data, or small sample sizes. Applications to human and Anopheles genomes recover known signals at LCT, HLA, inversion regions, and the Rdl insecticide-resistance locus.
Conclusion:
cxt reframes coalescent inference as a language-modeling problem, providing a fast, scalable, and adaptable tool that learns priors from simulations to infer local TMRCA and aggregate demography while offering uncertainty quantification through approximate posteriors.
Music:
Enjoy the music based on this article at the end of the episode.
Article title:
Accessible, realistic genome simulation with selection using stdpopsim
First author:
Korfmann K
Journal:
Proceedings of the National Academy of Sciences (PNAS)
DOI:
10.1073/pnas.2518956123
Reference:
Korfmann K., Pope N. S., Meleghy M., Tellier A., Kern A. D. Coalescence and translation: A language model for population genetics. Proc. Natl. Acad. Sci. U.S.A. 2026;123:e2518956123. doi:10.1073/pnas.2518956123
License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/
Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00
Official website https://basebybase.com
On PaperCast Base by Base you'll discover the latest in genomics, functional genomics, structural genomics, and proteomics.
Episode link: https://basebybase.com/episodes/cxt-language-model-for-population-genetics
QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2026-04-11.
QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Audited transcript sections describing the cxt model (architecture and next-coalescence prediction), training on stdpopsim, performance benchmarks, generalization to unseen species, empirical data applications (LCT, HLA, Rdl), missing data handling/adapters, and environmental considerations.
- transcript topics: ARG basics and coalescent theory as context; cxt architecture: decoder-only transformer and next-coalescence prediction; input encoding: mutational densities, SFS, rotary embeddings; training data: stdpopsim simulations and catalog breadth; benchmark comparisons: Singer+Polegon and SMC++; generalization to stdpopsim v0.3 and unseen species
QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 8
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0
Metadata Audited:
- article_doi
- article_title
- article_journal
- license
Factual Items Audited:
- cxt is a decoder-only transformer that autoregressively predicts local coalescence times (TMRCA) via next-coalescence prediction
- Trained on stdpopsim simulations; generalizes across demographies, including unseen species
- Inference is fast (millions of TMRCAs in minutes) on a single NVIDIA A100 GPU
- cxt yields well-calibrated approximate posteriors for TMRCA
- Compared to Singer+Polegon and SMC++, cxt is competitive and often superior in well-specified/out-of-distribution scenarios
- Empirical data show clear LCT and HLA signals in humans and Rdl dynamics in Anopheles; missing data handling improves robustness
QC result: Pass.
Fler avsnitt av Base by Base
Visa alla avsnitt av Base by BaseBase by Base med Gustavo Barra finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
