Sveriges mest populära poddar
Intellectually Curious

The Perception Encoder: A Unified Path to Robust Vision-Language Learning

21 min7 maj 2025
We unpack a groundbreaking approach called the Perception Encoder (PE), a single, scalable model trained with global vision-language contrastive learning on images and videos. Learn how PE surprisingly learns task-relevant features for OCR, object detection, depth estimation, and tracking without task-specific pretraining. We break down the training recipe, important ablations (progressive resolution, high-res training, Rope-E, attention pooling), and why robustness matters beyond standard benchmarks. Plus, how a three-phase video data engine builds high-quality captions to train PE on video, and what this could mean for the future of universal visual pre-training.


Note:  This podcast was AI-generated, and sometimes AI can make mistakes.  Please double-check any critical information.

Sponsored by Embersilk LLC

Fler avsnitt av Intellectually Curious

Visa alla avsnitt av Intellectually Curious

Intellectually Curious med Mike Breault finns tillgänglig på flera plattformar. Informationen på denna sida kommer från offentliga podd-flöden.