Martin et al., Nat Commun (2025) - Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets. Key terms: viromics, protein-language-model, genome-embeddings, triplet-loss, host-prediction.
Study Highlights:
PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework.
Conclusion:
PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications
Music:
Enjoy the music based on this article at the end of the episode.
First author:
Martin
Journal:
Nat Commun (2025)
DOI:
10.1038/s41467-025-66049-4
Reference:
Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4
License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/
Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00
Official website https://basebybase.com
On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.
Episode link: https://basebybase.com/episodes/protein-set-transformer
QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2025-12-01.
QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Substantive audit of the core scientific claims and results described in PST工作 (architecture, training, data, evaluation, functional insights, host prediction, generalizability, and biosafety/licensing) as presented in the transcript.
- transcript topics: PST architecture: genome modeled as set of proteins with context; Protein embeddings and genome-position/strand augmentation; Triplet loss training: Chamfer distance and PointSwap; Pretraining data scale: >100k viral genomes, >6M proteins; Performance: PST-TL outperforms PST-CTX and PST-MLM; Remote relation detection: ASI correlations with PST embeddings
QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 8
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0
Metadata Audited:
- article_doi
- article_title
- article_journal
- license
Factual Items Audited:
- Genomes are modeled as sets of proteins with context (not as linear sequences)
- ESM2 protein embeddings are augmented with two learnable vectors representing protein position and coding strand
- Training uses triplet loss with Chamfer distance and PointSwap augmentation
- Pretraining dataset scales: >100k viral genomes encoding >6M proteins
- PST-TL outperforms alternative methods (PST-CTX, PST-MLM, etc.)
- PST embeddings correlate with ASI for remote relationships (positive ASI correlation when AI is near zero)
QC result: Pass.
Fler avsnitt av Base by Base
Visa alla avsnitt av Base by BaseBase by Base med Gustavo Barra finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.
