Sveriges mest populära poddar
Base by Base

215: Protein Set Transformer for high-diversity viromics

20 min1 december 2025

Martin et al., Nat Commun (2025) - Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets. Key terms: viromics, protein-language-model, genome-embeddings, triplet-loss, host-prediction.

Study Highlights:
PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework.

Conclusion:
PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications

Music:
Enjoy the music based on this article at the end of the episode.

First author:
Martin

Journal:
Nat Commun (2025)

DOI:
10.1038/s41467-025-66049-4

Reference:
Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4

License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/

Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00

Official website https://basebybase.com

On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.

Episode link: https://basebybase.com/episodes/protein-set-transformer

QC:
This episode was checked against the original article PDF and publication metadata for the episode release published on 2025-12-01.

QC Scope:
- article metadata and core scientific claims from the narration
- excludes analogies, intro/outro, and music
- transcript coverage: Substantive audit of the core scientific claims and results described in PST工作 (architecture, training, data, evaluation, functional insights, host prediction, generalizability, and biosafety/licensing) as presented in the transcript.
- transcript topics: PST architecture: genome modeled as set of proteins with context; Protein embeddings and genome-position/strand augmentation; Triplet loss training: Chamfer distance and PointSwap; Pretraining data scale: >100k viral genomes, >6M proteins; Performance: PST-TL outperforms PST-CTX and PST-MLM; Remote relation detection: ASI correlations with PST embeddings

QC Summary:
- factual score: 10/10
- metadata score: 10/10
- supported core claims: 8
- claims flagged for review: 0
- metadata checks passed: 4
- metadata issues found: 0

Metadata Audited:
- article_doi
- article_title
- article_journal
- license

Factual Items Audited:
- Genomes are modeled as sets of proteins with context (not as linear sequences)
- ESM2 protein embeddings are augmented with two learnable vectors representing protein position and coding strand
- Training uses triplet loss with Chamfer distance and PointSwap augmentation
- Pretraining dataset scales: >100k viral genomes encoding >6M proteins
- PST-TL outperforms alternative methods (PST-CTX, PST-MLM, etc.)
- PST embeddings correlate with ASI for remote relationships (positive ASI correlation when AI is near zero)

QC result: Pass.

Base by Base med Gustavo Barra finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.