Start / Machine Learning Guide / Mlg 033 transformers

MLG 033 Transformers

43 min • 9 februari 2025

Links:

Notes and resources at ocdevel.com/mlg/33
3Blue1Brown videos: https://3blue1brown.com/
Try a walking desk stay healthy & sharp while you learn & code
Try Descript audio/video editing with AI power-tools

Background & Motivation

RNN Limitations: Sequential processing prevents full parallelization—even with attention tweaks—making them inefficient on modern hardware.
Breakthrough: “Attention Is All You Need” replaced recurrence with self-attention, unlocking massive parallelism and scalability.

Core Architecture

Layer Stack: Consists of alternating self-attention and feed-forward (MLP) layers, each wrapped in residual connections and layer normalization.
Positional Encodings: Since self-attention is permutation invariant, add sinusoidal or learned positional embeddings to inject sequence order.

Self-Attention Mechanism

Q, K, V Explained:
- Query (Q): The representation of the token seeking contextual info.
- Key (K): The representation of tokens being compared against.
- Value (V): The information to be aggregated based on the attention scores.
Multi-Head Attention: Splits Q, K, V into multiple “heads” to capture diverse relationships and nuances across different subspaces.
Dot-Product & Scaling: Computes similarity between Q and K (scaled to avoid large gradients), then applies softmax to weigh V accordingly.

Masking

Causal Masking: In autoregressive models, prevents a token from “seeing” future tokens, ensuring proper generation.
Padding Masks: Ignore padded (non-informative) parts of sequences to maintain meaningful attention distributions.

Feed-Forward Networks (MLPs)

Transformation & Storage: Post-attention MLPs apply non-linear transformations; many argue they’re where the “facts” or learned knowledge really get stored.
Depth & Expressivity: Their layered nature deepens the model’s capacity to represent complex patterns.

Residual Connections & Normalization

Residual Links: Crucial for gradient flow in deep architectures, preventing vanishing/exploding gradients.
Layer Normalization: Stabilizes training by normalizing across features, enhancing convergence.

Scalability & Efficiency Considerations

Parallelization Advantage: Entire architecture is designed to exploit modern parallel hardware, a huge win over RNNs.
Complexity Trade-offs: Self-attention’s quadratic complexity with sequence length remains a challenge; spurred innovations like sparse or linearized attention.

Training Paradigms & Emergent Properties

Pretraining & Fine-Tuning: Massive self-supervised pretraining on diverse data, followed by task-specific fine-tuning, is the norm.
Emergent Behavior: With scale comes abilities like in-context learning and few-shot adaptation, aspects that are still being unpacked.

Interpretability & Knowledge Distribution

Distributed Representation: “Facts” aren’t stored in a single layer but are embedded throughout both attention heads and MLP layers.
Debate on Attention: While some see attention weights as interpretable, a growing view is that real “knowledge” is diffused across the network’s parameters.

Senaste avsnitt

MLA 027 AI Video End-to-End Workflow

14 juli | 72 min

MLA 026 AI Video Generation: Veo 3 vs Sora, Kling, Runway, Stable Video Diffusion

12 juli | 41 min

MLA 025 AI Image Generation: Midjourney vs Stable Diffusion, GPT-4o, Imagen & Firefly

9 juli | 59 min

MLG 036 Autoencoders

30 maj | 66 min

MLG 035 Large Language Models 2

8 maj | 45 min

MLG 033 Transformers

Senaste avsnitt

MLA 027 AI Video End-to-End Workflow

MLA 026 AI Video Generation: Veo 3 vs Sora, Kling, Runway, Stable Video Diffusion

MLA 025 AI Image Generation: Midjourney vs Stable Diffusion, GPT-4o, Imagen & Firefly

MLG 036 Autoencoders

MLG 035 Large Language Models 2

MLG 034 Large Language Models 1

MLA 024 Code AI MCP Servers, ML Engineering

MLA 023 Code AI Models & Modes

MLA 022 Code AI: Cursor, Cline, Roo, Aider, Copilot, Windsurf

MLG 033 Transformers

MLA 021 Databricks: Cloud Analytics and MLOps

MLA 020 Kubeflow and ML Pipeline Orchestration on Kubernetes

MLA 019 Cloud, DevOps & Architecture

MLA 017 AWS Local Development Environment

MLA 016 AWS SageMaker MLOps 2

MLA 015 AWS SageMaker MLOps 1

MLA 014 Machine Learning Hosting and Serverless Deployment

MLA 013 Tech Stack for Customer-Facing Machine Learning Products

MLA 012 Docker for Machine Learning Workflows

MLG 032 Cartesian Similarity Metrics

MLA 011 Practical Clustering Tools

MLA 010 NLP packages: transformers, spaCy, Gensim, NLTK

MLA 009 Charting and Visualization Tools for Data Science

MLA 008 Exploratory Data Analysis (EDA)

MLA 007 Jupyter Notebooks

MLA 006 Salaries for Data Science & Machine Learning

MLA 005 Shapes and Sizes: Tensors and NDArrays

MLA 003 Storage: HDF, Pickle, Postgres

MLA 002 Numpy & Pandas

MLA 001 Degrees, Certificates, and Machine Learning Careers

MLG 029 Reinforcement Learning Intro

MLG 028 Hyperparameters 2

MLG 027 Hyperparameters 1

MLG 026 Project Bitcoin Trader

MLG 025 Convolutional Neural Networks

MLG 024 Tech Stack

MLG 023 Deep NLP 2

MLG 022 Deep NLP 1

MLG 020 Natural Language Processing 3

MLG 019 Natural Language Processing 2

MLG 018 Natural Language Processing 1

MLG 017 Checkpoint

MLG 016 Consciousness

MLG 015 Performance

MLG 014 Shallow Algos 3

MLG 013 Shallow Algos 2

MLG 012 Shallow Algos 1

MLG 010 Languages & Frameworks

MLG 009 Deep Learning

MLG 008 Math for Machine Learning

MLG 007 Logistic Regression

MLG 006 Certificates & Degrees

MLG 005 Linear Regression

MLG 004 Algorithms - Intuition

MLG 003 Inspiration

MLG 002 Difference Between Artificial Intelligence, Machine Learning, Data Science

MLG 001 Introduction