Feng H et al., Nat Commun - A comprehensive, unbiased benchmark compares five DNA foundation models across 57 datasets and multiple tasks, finding mean token embeddings improve classification and that model strengths vary by task and pre-training. Key terms: DNA foundation models, mean token embedding, sequence classification, variant effect, gene expression.
Study Highlights:
The study evaluated DNABERT-2, NT-v2, HyenaDNA, Caduceus-Ph, and GROVER on 57 datasets spanning sequence classification, gene expression prediction, variant effect quantification, and TAD recognition. Mean token embedding consistently and significantly outperformed summary-token and max pooling for sequence classification. Model performance was task-dependent: Caduceus-Ph excelled at human TFBS and promoter tasks, NT-v2 led pathogenic variant identification, HyenaDNA scaled efficiently and benefited from multi-species pre-training, while specialized models outperformed general foundations on QTL prediction. Zero-shot embeddings provided modest gene expression prediction and NT-v2 attention patterns did not reveal inherent TAD recognition.
Conclusion:
Mean token pooling yields more robust sequence-level representations and model choice should align with task, input length, and pre-training data for best genomic performance
Music:
Enjoy the music based on this article at the end of the episode.
First author:
Feng H
Journal:
Nat Commun
DOI:
10.1038/s41467-025-65823-8
Reference:
Feng H, Wu L, Zhao B, Huff C, Zhang J, Wu J, Lin L, Wei P & Wu C. Benchmarking DNA foundation models for genomic and genetic tasks. Nat Commun. 2025;16:10780. https://doi.org/10.1038/s41467-025-65823-8
License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/
Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00
Official website https://basebybase.com
On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.
Episode link: https://basebybase.com/episodes/dna-foundation-models-benchmark
QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2025-12-31.
QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Audited the transcript's coverage of core scientific claims: DNA foundation models benchmarking, pooling strategies (mean token embedding), zero-shot embeddings with a downstream classifier, VEQ dichotomy, multispecies pre-training and cross-species generalization, long-sequence performance, TAD recognition limitations
- transcript topics: DNA foundation models and zero-shot embeddings; Pooling strategies for sequence representations (mean token embedding vs summary/max pooling); Downstream classification using zero-shot embeddings (random forest); Variant effect quantification: pathogenic vs QTL (VEQ dichotomy); Multispecies pre-training and cross-species generalization; Cross-species transfer in promoter identification (Arabidopsis example)
QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 8
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0
Metadata Audited:
- article_doi
- article_title
- article_journal
- license
Factual Items Audited:
- Mean token embedding consistently improves sequence classification across all foundation models and yields measurable AUROC gains.
- Zero-shot embeddings with frozen weights are evaluated with a downstream random forest classifier to measure embedding quality without fine-tuning.
- VEQ dichotomy: generalists excel at pathogenic variant identification; specialized models excel at tissue-specific QTL predictions.
- Multispecies pre-training improves generalization; 14 of 49 tasks show statistically significant improvements, with some tasks favoring human-only pre-training.
- Cross-species transfer evidence: HyenaDNA pre-trained on human genomes shows cross-species transfer advantages (e.g., Arabidopsis promoter identification).
- NT-v2 self-attention does not inherently recognize higher-order chromatin structures (TADs) in zero-shot mode.
QC result: Pass.
Fler avsnitt av Base by Base
Visa alla avsnitt av Base by BaseBase by Base med Gustavo Barra finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
