blank

Mustango: Toward Controllable Text-to-Music Generation

2025-07-25T00:00:00+00:00

Melechovsky, Jan, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. “Mustango: Toward Controllable Text-to-Music Generation.” arXiv:2311.08355. Preprint, arXiv, June 3, 2024. https://doi.org/10.48550/arXiv.2311.08355.

Code

Fundamentals of Music
MusicBench
Mustango
Inference
Experiments Overview:

Mustango is a diffusion-based text-to-music generation system that goes beyond general text conditioning. It introduces fine-grained control over musical attributes such as chords, beats, tempo, and key, enabling more structured and musically meaningful audio synthesis from natural language prompts.

MuNet: Music-Aware UNet Denoiser

At the core of Mustango is MuNet, a music-domain-informed UNet module that guides the reverse diffusion process. It integrates:

General text embeddings (from FLAN-T5), and
Predicted musical features (chords, beats, etc.) via hierarchical cross-attention layers.

This enables Mustango to generate music that faithfully follows the structural elements described in the input text.

MusicBench: A Richly Augmented Dataset

To train controllable models, the authors introduce MusicBench, a dataset built on top of MusicCaps with:

Over 52,000 samples
Text captions enhanced with automatically extracted and paraphrased descriptions of chords, tempo, key, and rhythm
Audio augmentations including pitch shifts, tempo changes, and volume variations

Challenges in Diffusion-based Music Generation

Musical structure enforcement: Music must obey formal rules like key signatures and chord progressions, which are difficult to evaluate and condition on.
Data scarcity: High-quality paired text-music datasets are limited and often lack rich musical annotations.
Representational depth: Most captions lack structural and harmonic detail. Mustango tackles this by learning to infer and control these aspects from text.

Fundamentals of Music

To understand musical features, we will use "We Will Rock You" by Queen as our example, assuming we have no prior music knowledge.

What is Music Made Of?

Think of music like a recipe with four main ingredients:

Beats = The steady pulse (like your heartbeat)
Chords = Multiple notes played together (like harmony)
Key = The musical "home base"
Tempo = How fast or slow the music moves

1. Beats and Downbeats - The Musical Heartbeat

A beat is like the steady tick of a clock in music. It's the pulse you naturally tap your foot to.

The most common time signature is 4/4, which means:

4 beats per measure
Each beat gets a quarter note duration

A measure is a recurring pattern of beats that creates the rhythmic structure of music. Think of it as a rhythmic "sentence" that repeats throughout the song.

Beat Type 1 | Beat Type 2 | Beat Type 3 | Beat Type 4
     ↓           ↓           ↓           ↓
   STRONG       weak      medium       weak
  (downbeat)

Beat:     1    2    3    4  |  1    2    3    4
Pattern: STOMP STOMP CLAP -- | STOMP STOMP CLAP --
Sound:   "WE"  "WILL" "ROCK" | "YOU"  (rest) (clap)
Type:     1     2     3    4 |   1     2     3    4

STOMP - STOMP - CLAP - (silence)
  1   -   2   -   3   -    4

Beat Types Explained

Beat 1 (Downbeat): Strongest, first stomp, marching step
Beat 2: Weaker, second stomp
Beat 3: Medium strength, the clap
Beat 4: Weakest, often silence

Measure 1: STOMP(1) - STOMP(2) - CLAP(3) - silence(4)
Measure 2: STOMP(1) - STOMP(2) - CLAP(3) - silence(4)
Measure 3: STOMP(1) - STOMP(2) - CLAP(3) - silence(4)

Genre Emphasis

Rock/Pop: Emphasis on beats 2 and 4
Classical/Folk: Emphasis on beats 1 and 3
Reggae: Off-beat emphasis (skank)

Output of Beat Timings:

Beat Type | Time (seconds)
    1     |     0.0
    2     |     0.5
    3     |     1.0
    4     |     1.5
    1     |     2.0
    2     |     2.5

2. Chords - Musical Building Blocks

A chord is when you play multiple notes at the same time.

Single note = one voice
Chord = choir

The Musical Alphabet: A, B, C, D, E, F, G (then repeats)

Basic Chord Types

Major Chords: Happy, bright (e.g., C Major = C, E, G)

Minor Chords: Sad, emotional (e.g., A Minor = A, C, E)

Piano Keys: C  D  E  F  G  A  B  C  D  E  F  G
            |     |     |        |     |
C Major:    C     E     G        
G Major:                   G     B     D

"Twinkle, Twinkle, Little Star"
Twinkle, twinkle    = C major
little star         = G major
How I wonder        = C major
what you are        = G major

Chord Progression: Most songs change chords to create tension and release. "We Will Rock You" mostly stays on one chord (E minor).

Chord Inversion: Rearranging note order (same notes, different stacking)

3. Keys - The Musical Home Base

Key = Home note everything gravitates toward

We Will Rock You key = E minor

E feels stable and complete
Dark, serious tone (minor)
You can hear the key by humming the final note of the song

4. Tempo - The Speed of Music

Tempo = Beats Per Minute (BPM)

"We Will Rock You" tempo = 114 BPM

~2 beats per second
Moderate tempo (80–120 BPM)
Easy for crowd participation

How All Four Features Work Together

Time:     0s    1s    2s    3s    4s    5s    6s    7s
Beats:    1 2 3 4 | 1 2 3 4 | 1 2 3 4 | 1 2 3 4
Pattern:  STOMP STOMP CLAP - | STOMP STOMP CLAP -
Chord:    [---- E minor ----][---- E minor ----]
Key:      [------------- E minor throughout -------------]
Tempo:    [------------- 114 BPM steady ----------------]

Why This Combination is Powerful

Simple beat pattern = Easy for crowds
Single chord = Hypnotic repetition
Minor key = Powerful emotion
Moderate tempo = Accessible and energetic

MusicBench

Feature Extraction

Extracts musical features from audio and converting them into text-based control information to improve music generation systems.

Beats and downbeats
Chords
Keys
Tempo

These features serve a dual purpose: they enhance text prompts with specific musical information and guide the music generation process during the reverse diffusion phase.

1. Beat and Downbeat Extraction

Uses BeatNet which outputs:

\[b \in \mathbb{R}^{(L_{beats} \times 2)}\]

This mathematical notation means: $b$ is a matrix with $L_{beats}$ rows and 2 columns

$L_{beats}$ represents the total number of beats detected in the audio → Each row is one beat event.
$2$: Beat Type $\in \{1, 2, 3, 4\}$ and Time

Data Structure:

First dimension (column 1): Beat type according to meter
Second dimension (column 2): Precise timing in seconds when each beat occurs

2. Tempo Extraction

Calculation Method: Averaging the reciprocal of time intervals between beats

Mathematical Process:

Measure time intervals between consecutive beats
Take the reciprocal (1/interval) to get instantaneous tempo
Average these values across the entire piece
Convert to BPM (beats per minute)

This approach accounts for tempo variations within a song rather than assuming constant tempo.

3. Chord Extraction

Uses Chordino which outputs:

\[c \in \mathbb{R}^{(L_{chords} \times 3)}\]

This means:

c is a matrix with $L_{chords}$ rows and 3 columns
$L_{chords}$ represents the number of chord segments identified

Data Structure:

First dimension (column 1): Chord roots
- Examples: C, D♭, F#, etc. (the fundamental note of each chord)
Second dimension (column 2): Chord quality/type
- Examples: major, minor, maj7, dim, sus4, etc.
Third dimension (column 3): Inversion information
- Indicates whether the chord is in root position or inverted
- Example: C major could be C-E-G (root), E-G-C (first inversion), or G-C-E (second inversion)

4. Key Extraction

Tool Used: Essentia’s - KeyExtractor algorithm

Purpose: Identifies the overall tonal center or key signature of the piece

Examples: C major, A minor, F# major, etc.

Description Enrichment

The extracted numerical features are converted into natural language descriptions using predefined templates.

Example Control Sentences:

"The song is in the key of A minor. The tempo of this song is Adagio.
The beat counts to 4. The chord progression is Am, Cmaj7, G."

Augmentation and Music Diversification

Resulted in 11-fold increase in training data.

Why Standard Audio Augmentation Fails for Music?

Traditional Approach (e.g., Tango model): Take two audio samples → Normalize them to similar audio levels → Superimpose (layer) the audio tracks → Concatenate their text descriptions

Why This Fails for Music:

Overlapping rhythms: Two different rhythmic patterns create chaotic, unmusical results
Harmonic dissonance: Combining different chord progressions creates unpleasant harmonic clashes
Conceptual mismatch: Mixing a “sad piano ballad” with an “upbeat rock song” creates conceptually incoherent training examples

The Three-Dimensional Augmentation Strategy

Instead of combining multiple audio sources, they modify individual music samples along three fundamental musical dimensions:

Pitch Augmentation (Melodic Dimension)
1. PyRubberband2, Range: ±3 semitones, Distribution: Uniform
2. Technical Details:
  1. Semitone range rationale: Keeps instrument timbre relatively untouched
  2. Larger pitch shifts would cause unnatural timbre changes (e.g., making a piano sound artificially high or low)
  3. 3 semitones = approximately a minor third interval (musically significant but not timbre-destroying)
3. Musical Impact:
  1. Changes the perceived pitch/key of the music
  2. Maintains the relative intervals between notes (preserving melody shape)
  3. Creates training examples in different keys from the same source material
Speed Augmentation (Rhythmic Dimension)
1. Range: ±(5% to 25%) speed change, Distribution: Uniform
2. Technical Implications:
  1. 5-25% range represents musically meaningful tempo variations
  2. Slower speeds (−25%) create more relaxed, contemplative versions
  3. Faster speeds (+25%) create more energetic, urgent versions
  4. Maintains pitch relationships while altering rhythmic feel
3. Musical Impact:
  1. Changes the perceived energy and mood
  2. Affects the rhythmic groove and feel
  3. Creates training examples spanning different tempo ranges from single sources
Volume Augmentation (Dynamic Dimension)
1. Method: Gradual volume changes (crescendo and decrescendo), Minimum volume: 0.1 to 0.5 times original amplitude (uniform distribution), Maximum volume: Kept untouched
2. Technical Design:
  1. Gradual changes rather than sudden volume jumps (more musically natural)
  2. Crescendo: Gradual volume increase
  3. Decrescendo: Gradual volume decrease
  4. Amplitude range: 10-50% of original volume for minimum, preserving dynamic range
3. Musical Impact:
  1. Simulates different recording conditions and mixing styles
  2. Creates variations in perceived intensity and drama
  3. Maintains musical expressiveness while varying dynamic profiles

MusicBench

Derived from MusicCaps, which contains:

5,521 music audio clips (10 seconds each)
Clips include ~4-sentence English captions
Sourced from AudioSet
Due to missing audio, the usable dataset was reduced to 5,479 samples

Dataset Splitting

Initial Split:
1. Divided into TrainA and TestA
Control Prompts Addition: TranB - TestB
1. Appended 0–4 control sentences (describing music features) to form:
  1. TrainB from TrainA
  2. TestB from TestA
2. Control prompt count probabilities: 0 → 25%, 1 → 30%, 2 → 20%, 3 → 15%, 4 → 10%
Paraphrasing (Text Robustness):
1. TrainC created by paraphrasing TrainB captions using ChatGPT
2. Later, all captions (original & augmented) were also paraphrased using ChatGPT
3. Final training prompts use 85% paraphrased / 15% original

Data Filtering (Quality Control)

Low-quality samples filtered out using keyword filter:

Removed any sample whose caption contains: “quality” (often refers to “poor quality”) and “low fidelity”.
Remaining high-quality subset: 3,413 samples

Audio Augmentation (Total ~37k samples):

Each high-quality sample augmented into 11 variants:
- Pitch shifts (6 total): ±1, ±2, ±3 semitones (excl. 0)
- Tempo alterations (4 total)
- Volume alteration (1 total)
These augmented samples form a dataset of ~37,543 samples

Final Training Set Construction

Final set created by combining:
- TrainA + TrainB + TrainC
- All augmented audio samples (with paraphrased and original captions)
Resulting in a total of 52,768 samples → This final training set is called MusicBench

Test Set Details

TestA/TestB contain:
- 200 low-quality samples
- 200 high-quality samples
Purpose: Designed to be challenging for evaluating controllability of the Mustango model

Mustango

Mustango is a text-to-music generation model composed of two main components:

Latent Diffusion Model (LDM)
MuNet (Music-informed UNet Denoiser)

We have a latent audio representation — basically, a compressed version of music ($z_0$) created using a VAE (a type of encoder).

The model corrupts this clean music into random noise step by step (forward process), and then learns to reverse that and rebuild the music from noise — step by step — using a smart denoiser.

Latent Diffusion Model (LDM)

Goal is to Reduce computation while retaining expressivity by operating in a compressed latent space instead of raw audio.

VAE (Variational Autoencoder):
- Encodes raw waveform to latent representation $z_0$
- Pretrained VAE from AudioLDM is used
Diffusion Process:
- Forward: Corrupts $z_0$ into noise $z_n \sim \mathcal{N}(0, I)$ using Gaussian schedule
  \[q(z_n | z_{n-1}) = \mathcal{N}(\sqrt{1 - \beta_n} z_{n-1}, \beta_n I)\]
- Reverse: Reconstructs $z_0$ from $z_n$ using MuNet (denoiser) conditioned on music and text

Objective

Train the denoiser $\hat{\epsilon}_\theta(z_n, C)$ to estimate noise via a noise prediction loss:

\[\mathcal{L}_{\text{LDM}} = \sum_{n=1}^N \gamma_n \mathbb{E}_{\epsilon_n, z_0} \left[ \| \epsilon_n - \hat{\epsilon}_\theta^{(n)}(z_n, C) \|^2 \right]\]

MuNet

Acts as the core denoiser in reverse diffusion. It’s designed to:

Integrate music-domain knowledge
Accept joint conditions:
\[C := \{\tau \text{ (text)}, b \text{ (beats)}, c \text{ (chords)}\}\]

\[\begin{aligned}&\textbf{Input: } z_n \\&\Downarrow \\&\text{Apply Cross Attention:} \\&\quad \text{Text } \tau \rightarrow \text{FLAN-T5} \rightarrow A_\tau \\&\quad \text{Beats } b \rightarrow \text{Enc}^b(b) \rightarrow A_b \\&\quad \text{Chords } c \rightarrow \text{Enc}^c(c) \rightarrow A_c \\&\Downarrow \\&\text{UNet}(A_c) \rightarrow \text{Output: } \hat{\epsilon}_\theta(z_n, C)\end{aligned}\]

MHA is the multi-headed attention block for the cross attentions, where Q, K, and V are query, key, and value, respectively
FLAN-T5 is the text encoder model adopted from Tango.
Cross-attention is applied to the beat first, as a consistent rhythm is fundamental basis for the generated music.
MuNet consists of UNet
- Total L downsampling, middle, and upsampling blocks—and multiple conditioning cross-attention blocks.
Both $Enc^b$ and $Enc^c$ leverage SOTA Fundamental Music Embedding (FME)
Beat Encoder $Enc^b$:
- One-hot beat type: $\text{OH}_b(b[:, 0])$
- Music Positional Encoding on beat time: $\text{MPE}(b[:, 1])$
- Combined and passed through linear projection $W_b$
\[\text{Enc}^b(b) = W_b(\text{OH}_b(b[:,0]) \oplus \text{MPE}(b[:,1]))\]
Chord Encoder $Enc^c$:
- $\text{FME}(c[:,0])$: Fundamental Music Embedding of chord root
- $\text{OH}_t(c[:,1])$: One-hot chord type
- $\text{OH}_i(c[:,2])$: One-hot chord inversion
- $\text{MPE}(c[:,3])$: Positional encoding of chord timing
- Combined then projected: $W_c$
  \[\text{Enc}^c(c) = W_c(\text{FME}(c[:,0]) \oplus \text{OH}_t(c[:,1]) \oplus \text{OH}_i(c[:,2]) \oplus \text{MPE}(c[:,3]))\]

Reverse Diffusion Process

The reverse diffusion process reconstructs the latent audio prior $z_0$ from pure noise $z_N \sim \mathcal{N}(0, I)$, step-by-step, using a parametrized denoiser $\hat{\epsilon}_\theta^{(n)}(z_n, C)$

Reverse Transition Distribution

\[p_\theta^{\text{mus}}(z_{n-1} \mid z_n, C) = \mathcal{N}(\mu_\theta^{(n)}(z_n, C), \tilde{\beta}_n)\]

At each step n, the model samples $z_{n-1}$ from a Gaussian with:

Mean $\mu_\theta^{(n)}$
Variance $\tilde{\beta}_n$
1. Mean for Reverse Step

\[\mu_\theta^{(n)}(z_n, C) = \frac{1}{\sqrt{\alpha_n}} \left[z_n - \frac{1 - \alpha_n}{\sqrt{1 - \bar{\alpha}_n}} \hat{\epsilon}_\theta^{(n)}(z_n, C)\right]\]

Where:

\[\alpha_n = 1 - \beta_n\]
\[\bar{\alpha}_n = \prod_{i=1}^n \alpha_i\]
$\hat{\epsilon}_\theta^{(n)}$: predicted noise at step n - Predicted by the model
This formula adjusts $z_n$ by subtracting estimated noise and rescales it to predict $z_{n-1}$
1. Variance of Reverse Step

\[\tilde{\beta}_n = \frac{1 - \bar{\alpha}_{n-1}}{1 - \bar{\alpha}_n} \beta_n\]

This scales the forward diffusion variance $\beta_n$ into an appropriate reverse-time variance.
1. Diffusion Coefficients

\[\alpha_n = 1 - \beta_n,\quad \qquad \bar{\alpha}_n = \prod_{i=1}^{n} \alpha_i\]

$\alpha_n$ controls how much signal is preserved at each step.
$\bar{\alpha}_n$ is the cumulative product of all previous $\alpha$’s and is used in rescaling.

Training Details

Authors used three types of input dropout during training:

Dropout #1 – Drop everything (5% chance)
1. The model sees no text, no beat, and no chord input.
2. Why? So it learns to generate music even with zero guidance (like unconditional generation).
Dropout #2 – Drop one input at a time (5% chance)
1. It might drop just text, or just beats, or just chords.
2. Helps the model handle incomplete input (e.g., caption but no beat info).
Dropout #3 – Mask parts of a prompt
1. The longer the prompt, the more likely it is to be partially masked.
  \[\text{Mask chance} = \min\left(100, \frac{10N}{M}\right)\%\]
  where:
  - N = number of sentences in the current prompt
  - M = average sentences across prompts
2. Then, remove 20–50% of sentences randomly
3. This teaches the model to deal with short or incomplete captions.

Hardware

Training used 4× Tesla V100 GPUs and 8× RTX 8000s.
Took 5–10 days
Effective batch size: 32

Inference

During training, the model is given the actual (ground-truth) beats and chords.

But during inference (real-world use), it must predict those from just the text description.

Given only a text caption like:

“Soothing techno music with 120 bpm in the key of D major.”

The model must automatically figure out:

Where the beats are (when each beat happens)
What the chords are (and when they change)

To do this, Mustango uses two separate pre-trained transformer models:

1. Beat Predictor — Using DeBERTa Large

Input: Text caption
Output:
1. Beat Count (1 to 4): The beat count (meter) of corresponding music
  - Predicted using classification on the first token (4-class classification)
2. Beat Timings (the rhythm): The sequence of interval duration between the beats (their timing)
  - Predicted as float values from the next tokens (durations between beats)

Example: Suppose the model predicts:

Beat Count: 2
Intervals: t1, t2, t3, t4, …

Then the beat positions will be:

Beat 1 at t1
Beat 2 at t1 + t2
Beat 1 again at t1 + t2 + t3
Beat 2 again at t1 + t2 + t3 + t4
And so on…

(Repeated in alternating fashion for 10 seconds)

2. Chord Predictor — Using FLAN-T5 Large

Predict the chords used in the music, and when each chord happens.

Input:
- The text caption
- Verbalized beat sequence from DeBERTa:
  
  Timestamps: t1, t1 + t2, t1 + t2 + t3 . . . , Max Beat: 2
Output:
- Chord progression over time in natural language
- For example:
  
  “Am at 1.11; E at 4.14; C#maj7 at 7.18”

This is a sequence-to-sequence generation task, where the model outputs something that looks like music sheet annotations.

Any chords predicted after 10 seconds are ignored (since all music samples are only 10 seconds long)

How Good Are the Beat & Chord Predictors?

During inference, Mustango predicts beats & chords from text. But do these predicted features work well?

When Control Sentences Are Present (TestB), Predictors do very well: 94.5% accuracy
When Control Sentences Are Missing (TestA): Performance dips, but still better than Tango
- This means the predictors don’t hurt Mustango’s quality when control is missing

Final Output

Now you have:

Beat sequence: when each beat hits
Chord sequence: when chords start and change

These are passed into the MuNet denoiser, and the final music is generated using reverse diffusion.

Classifier-Free Guidance at Inference

\[\hat{\epsilon}_\theta^{(n)}(z_n, C) = w \cdot \epsilon_\theta^{(n)}(z_n, C) + (1 - w) \cdot \epsilon_\theta^{(n)}(z_n)\]

Purpose: Improves generation quality and controllability
Explanation:
- The model is trained to predict noise both with and without conditions C
- At inference, both versions are interpolated using a guidance scale ww
  - $w > 1$: more faithful to the condition
  - $w = 0$: unconditional generation
  - $w = 1$: default, no guidance

Experiments

Questions explored:

How good is the music quality produced by Mustango?
Is Mustango better than other models like Tango, MusicGen, AudioLDM2?
Can Mustango follow control instructions well (like beat, chord, key)?
Is their dataset (MusicBench) strong enough to train a model from scratch?

Models Compared

Mustango Variants:

Model Name	Description
Tango on MusicCaps	Simple baseline, no MuNet, small dataset
Tango on MusicBench	Same architecture, better data
Mustango on MusicBench	Adds MuNet + good data
Pretrained Tango → AudioCaps → MusicBench	Transfer learning version
Pretrained Mustango → MusicBench	Strongest variant (MuNet + pretrained)

Other State-of-the-Art Models:

Model	Notes
MusicGen (small, medium)	Text-to-music model
AudioLDM2 (music version)	Text-to-audio model trained on music

Training & Evaluation Dataset

All models were trained using the AdamW optimizer with a learning rate of 4.5e−5
The beat and chord predictors were trained separately
Because some models already saw MusicCaps during pretraining, they created a new fair test set called FMACaps (1,000 music clips from Free Music Archive with AI-generated captions)

Inference Setup

All models generated 10-second audio clips
Used 200 diffusion steps
Classifier-free guidance scale = 3

Inference times on V100 GPU:

Model	Time
Tango	34 sec
MusicGen-M	51 sec
Mustango	76 sec

Objective Evaluation

How Did They Measure Quality?

They used 2 types of metrics:

Audio Quality Metrics

Metric	What it tells us
FD (Fréchet Distance)	Statistical similarity to real music
FAD (Fréchet Audio Distance)	Human-perception-inspired metric
KL (Kullback-Leibler)	Divergence between feature distributions

Controllability Metrics

Measured how well the generated music follows the prompt (especially for beats, chords, tempo, key):

Metric	Meaning
Tempo Bin (TB)	Tempo (bpm) falls in correct bin
TBT	Tempo within bin or neighbor bin
CK / CKD	Correct key (exact or equivalent)
PCM / ECM / CMO / CMOT	Chord match (with various leniencies)
BM	Correct beat count

Objective Results

Audio Quality Findings:

Mustango (even when trained from scratch) performed as well or better than large pretrained models
Mustango had the best FAD, which means better musicality
The augmentation strategy (MusicBench) really works — it’s a solid alternative to large-scale pretraining

Controllability:

Control Type	Who won
Tempo	MusicGen slightly better
Beats	Similar across all models
Key	Mustango (trained on MusicBench) best
Chords	Mustango wins by a large margin (especially on FMACaps)

Mustango excels in Key and Chord control, which are musically important.

Subjective Evaluation

Two Groups Evaluated:

General audience — 48 people in Round 1, 17 in Round 2
Experts — 4 trained musicians per round

They listened to samples and rated:

Metric Name	Meaning
AQ	Audio quality
REL	Relevance to caption
OMQ	Overall musical quality
RC	Rhythm consistency
HC	Harmony and consonance
MCM	Musical Chord Match
MTM	Musical Tempo Match

All ratings used a 7-point scale.

Subjective Results

Round 1:

Mustango from scratch had the best ratings overall
Experts confirmed: Mustango had the best chord match (MCM)
Conclusion: MusicBench works, MuNet helps, Mustango is very controllable

Round 2:

Mustango beat MusicGen and AudioLDM2 in REL (relevance to text)
Similar performance in OMQ, HC, MTM
MusicGen won in RC (rhythm) — slightly better rhythm matching
Mustango won in Chord Matching (MCM)

Is Pretraining Mustango Necessary?

What they tried:
- Used a Tango model that was pre-trained on 1.2 million audio-text pairs (from AudioCaps etc.) and Then fine-tuned it on Mustango’s data
What they found:
- It didn’t help Mustango generate better music because the pretraining was on general audio, not music specifically.
- However: It might help for music + environmental sounds, like: “Jazz with thunder in the background”

How well does Mustango really do?

Strengths:
- Great controllability — far better than previous models
- Very good music quality, even though: It was trained only on a public small-ish dataset and Competing models (like MusicGen) used huge private datasets
Still, other models have some advantages:
- MusicGen produces: Higher audio quality in some cases and Longer musical structure (beyond 10 seconds)

Limitations:

Model works mainly on Western music styles — control info like “chord” and “key” might not apply to Indian or Chinese music
Can only generate 10 seconds of music due to compute limits
Not yet optimized for long-form pieces (verse-chorus etc.)

Noise2Music: Text-conditioned Music Generation with Diffusion Models

2025-07-25T00:00:00+00:00

6 Mar 2023 - Link - Website

1 Summary

Goal: Turn a plain‑language prompt (“a slow lo‑fi guitar ballad for a rainy afternoon”) into a 30‑second, 24 kHz stereo clip.

Approach – Train several diffusion models that run one after another (a cascade). The early stages sketch a coarse spectral “layout”; later stages fill in detail so the final waveform sounds clean and full‑bandwidth.

Why a cascade of diffusion models?

A single diffusion model that jumps straight from noise→high‑fidelity audio would need huge compute and might blur fine structure. Splitting the job lets each stage specialise:

Generator – predicts a low‑resolution latent audio representation conditioned on the text.
Cascader(s) – progressively upsample and refine that latent into the final waveform (16kHz), optionally re‑checking the text each step.
Super‑resolution: A final superresolution cascader is used to generate the 24kHz audio from the 16kHz waveform.
- All models are based on 1D U-Nets

Two options for the intermediate representation:

Spectrogram (log-mel)
Audio with lower fidelity (3.2kHz waveform)

Results:

Generated audio faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Ground finegrained semantics of the prompt.

Data

Text labels for the audio are generated by employing a pair of pretrained deep models:

Use a large language model to generate a large set of generic music descriptive sentences as caption candidates;
Pre-trained music-text joint embedding model is used to score each unlabeled music clip against all the caption candidates and select the captions with the highest similarity score as pseudo labels for the audio clip.
Annotate O(150K) hours of audio sources

Over the past five years, the recipe for dramatic jumps in sample quality has been simple: bigger datasets + bigger models.

How models ingest “what I want”

Fixed, human‑interpretable vocabularies
- Jukebox encodes each clip as one of ~8 k artist/genre labels extracted from its metadata.
- Mubert maps a user prompt onto a hand‑curated tag set (e.g., “chill”, “EDM”, “focus‑music”).
- Pros: Easy to reason about.
- Cons: Can’t express “dreamy underwater lo‑fi.”
Free‑form natural language embeddings
- AudioGen, MusicLM and Noise2Music feed the raw prompt through a frozen text encoder (e.g., MuLan or a language‑model encoder).
- Pros: Unlimited expressiveness; prompts can describe mood, setting, instrumentation, era, etc.
- Cons: The mapping from prose → sound is learned, not predefined, so training data must be rich.

3 Methods

3.1 Diffusion models in a nutshell

Diffusion models turn pure noise into a data sample by iterative denoising. Two ingredients go in at each step:

Conditioning signal $c$ – here, the text‑prompt embedding.
Noisy input $x_t$ – a corruption of the target waveform at “time” t, where t ∈ [0, 1]. Noise magnitude is set by a schedule $σ_t$.

During training the model $θ$ learns to predict the exact noise vector $ϵ$ that was added:

\[\mathcal{L} \;=\; \mathbb{E}_{x,c,\epsilon,t}\!\bigl[ w_t \,\lVert \theta(x_t, c, t) \;-\; \epsilon \rVert^2 \bigr],\tag{1}\]

where $w_t$ is a hand‑chosen weight (details below).

Choosing the loss weight $w_t$:

Option	Rationale
Simplified ($w_t$ = 1)	Easiest to implement, works well for many tasks.
Sigma‑scaled ($w_t$ = $σ_t^2$)	Emphasises accuracy at late (cleaner) timesteps.

Noise‑schedule variants

Schedule	Shape	Typical use‑case
Linear	$σ_t$ grows linearly with t	Classic DDPM baseline.
Cosine	Slower rise near t = 0, faster near t = 1	Often yields crisper samples with fewer steps.

Sampling: knobs you can turn

At inference we start with pure noise at t = 1 and march back to t = 0 (“ancestral” or DDPM sampling). Two important dials:

Dial	Symbol	Effect
Stochasticity	$γ ∈ {0, 1}$	$γ$ = 0 gives deterministic DDIM‑like steps; γ = 1 keeps full randomness.
Denoising Step schedule	${t_0 … t_n}$	Any partition of [0, 1] works—e.g., 50, 100, or 1 000 steps.

$γ$ : Sets how much fresh Gaussian noise is re‑added at every reverse‑diffusion step.

Classifier‑free guidance (CFG)

To better align outputs with the text prompt, the authors adopt CFG:

During training: Randomly drop the prompt for a subset of samples → model learns both conditional $θ(x_t, c)$ and unconditional $θ(x_t, ·)$.
During sampling: Blend the two predictions:

\[\hat\epsilon \;=\; w\;\theta(x_t, c) \;+\; (1-w)\;\theta(x_t, \cdot), \quad w>1.\]

Larger w tightens adherence to the prompt but risks clipping; the paper counters this with dynamic clipping that scales intermediate values to a safe range.

3.2 Model Architecture — “Efficient U‑Net 1‑D”

Backbone. A 1‑D adaptation of the Efficient U‑Net:

Down / Up blocks: They shrink the audio signal to a smaller size (down‑sampling) or stretch it back up (up‑sampling). Inside each block, the model mixes basic convolutions with attention layers to learn both local details and long‑range relationships.
Combine layer: The combine layer enables a single vector to interact with a sequence of vectors, where the single vector is used to produce a channel-wise scaling and bias.
- A single vector (like the “time‑step” embedding) can turn channels up or down and add a bias, letting the model adapt its behaviour at each diffusion step.
More on Combine Layers
1. Inputs
  - Feature map A sequence of vectors coming from the convolution/attention stack. Shape: (length, channels).
  - Condition vector z A single 1‑D vector (e.g., the diffusion‑time embedding, or any global conditioning info). Shape: (channels).
2. Learned transform
  - The layer passes $z$ through two tiny neural nets (often single linear layers) to produce
    - Scale s - one value per channel
    - Bias b - one value per channel
3. Channel‑wise modulation
  - For every position $i$ in the sequence and every channel c
    \[\text{output}_{i,c}=s_c \times \text{feature}_{i,c}+b_c\]
  - This is just an affine transform (scale + shift), but the scales/biases change with $z$.
4. Why it matters
  - Lets a global signal (time step, overall prompt embedding, etc.) instantly tweak the local activations without extra convolutions.
  - Makes conditioning cheap and expressive—similar in spirit to FiLM layers used in vision models.

Four conditioning routes

Noise input $x_t$ (always left‑most in the stack).
Diffusion‑time embedding fed via Combine layers.
Text prompt sequence enters through cross‑attention.
Low‑resolution audio or spectrogram (aligned) can be injected at the U‑Net bottleneck.

3.3 Cascaded Diffusion: three‑stage pipeline

Noise2Music follows the Generator → Cascader → Super‑Resolution recipe.

3.3.1 Waveform Model

Generator

Input: Text prompt
A sequence of vectors derived from the text input is produced and fed into the network as a cross-attention sequence
Outputs: 3.2 kHz waveform

Cascader

Inputs: Conditioned on both the text prompt and the low-fidelity audio generated by the generator model based on the text prompt.
Outputs: 16 kHz waveform
Method:
- The text conditioning takes place via cross attention.
- Low-fidelity audio is upsampled and stacked with $x_t$ and fed into the model.
- The upsampling is done by applying fast Fourier transform (FFT) to the low-fi audio sequence and then applying inverse FFT to obtain the high-fi audio from the low-fi Fourier coefficients.

3.3.2 Spectrogram Model

Generator

Outputs: 80‑×‑100 fps log‑mel spectrogram (80 channels and a frequency of 100 features per second)
Pixel values of the log-mel spectrogram are normalized to lie within [−1, 1]

Vocoder

Output: 16kHz audio that is conditioned only on the spectrogram

3.3.3 SUPER-RESOLUTION CASCADER

Generate 24kHz audio from the 16kHz waveform produced by either model.
The 16kHz audio is up-sampled and stacked with $x_t$ as input to the model.
Text conditioning is not used for this model.

3.4 Text Understanding

T5 encoder:

Prompt’s token‑level embeddings without pooling are feed into cross‑attention layers throughout the U‑Net

3.5 Pseudo‑Labeling for Music Data [DATA Creation]

3.5.1 Why pseudo‑labels are needed

High‑quality music + free‑form caption pairs are rare.
Without them, a text‑to‑music model can’t learn subtle descriptors like “laid‑back highway‑driving synthwave.”
Solution: auto‑generate rich captions for millions of unlabeled tracks instead of hand‑annotating them.

3.5.2 Models Used

MuLan: A contrastive model with audio and text encoders that share an embedding space.

Lets you measure “text–audio similarity” with cosine distance (zero‑shot classification).

LaMDA: LLM trained for dialogue.

Used here to write human‑style music descriptions.

3.5.3 Building three caption vocabularies

Name	Size	How it’s made	Purpose / style
LaMDA‑LF	4 M long‑form sentences	Prompt LaMDA with title + artist of 150 000 popular songs → clean & deduplicate.	Conversational, user‑prompt‑like prose.
Rater‑LF	35 333 sentences	Split 10 028 expert captions from MusicCaps into single sentences.	Human‑written, descriptive.
Rater‑SF	23 906 short tags	Collect all short aspect tags from the same raters (mood, genre, instrument, etc.).	Compact, label‑like keywords.

3.5.4 Assigning captions to an unlabeled clip

Segment clip into 10‑s windows → feed each window to MuLan’s audio encoder.
Average those embeddings → one vector per clip.
Encode every caption in the vocabularies with MuLan’s text encoder.
Retrieve the K = 10 closest captions (cosine similarity).
Sample K′ = 3 of those 10, with probability $∝$ 1 / global_frequency (rare captions get a boost).
- Balances the label distribution and increases diversity.
Store the selected captions as pseudo‑labels for that clip.

Net effect: each 30‑s clip can receive up to 12 pseudo‑labels (3 from LaMDA‑LF, 3 from Rater‑LF, 6 from Rater‑SF) in addition to any inherent metadata.

3.5.5 Warm‑up experiment: MuLaMCap

Source: AudioSet’s music subtree — 388 262 train clips + 4 497 test clips (each 10 s).
Labels per clip: 3 × 3 + 3 × 3 + 6 × 6.
Purpose: sanity‑check the pipeline before scaling to millions of tracks.

3.6 Training‑Data Mining at Scale [DATA]

Raw audio pool
- 6.8 million full‑length music tracks are collected.
- Each track is chopped into six non‑overlapping 30‑second clips → ~340 000 hours total.
Sample rates
- 24 kHz clips train the super‑resolution stage (it must output 24 kHz).
- 16 kHz versions of the same clips train every other model stage (saves compute).

Text labels attached to every clip

Label source	Count per clip	What it adds
Song title	1	“Hotel California”
Named‑entity tags	variable	Genre, artist, instrument, year, etc.
LaMDA‑LF pseudo‑labels	3	Rich sentences like “slow acoustic ballad for a summer evening.”
Rater‑SF pseudo‑labels	6	Compact tags such as “laid‑back,” “highway‑driving,” “lo‑fi beats.”

Why skip Rater‑LF?

Those captions appear in the MusicCaps evaluation set; excluding them avoids train‑test leakage.

Why mix “objective” and “subjective” labels?
- Objective tags (genre, artist) nail down obvious metadata.
- Pseudo‑labels add nuances—mood, activity, fine‑grained compositional hints.
- Together they give the model both facts and feelings to learn from.
Quality anchor inside the noisy sea
- The authors add ≈ 300 hours of internally curated, attribution‑free tracks.
- Each track’s rich metadata is concatenated into a single prompt string.
- Acts as a clean, high‑signal subset to stabilize training.

Net result: a gigantic, diverse corpus where every 30‑s clip carries 10 + text descriptors that range from “objective metadata” to “subjective vibes,” providing the breadth of supervision a text‑to‑music diffusion model needs.

4 Experiments and Results

4.1 Model training details

Models trained: 4 separate 1‑D U‑Nets

Waveform Generator (3.2 kHz)
Waveform Cascader (16 kHz)
Spectrogram Generator (80 × 100 fps log‑mel)
Spectrogram Vocoder (16 kHz)

Final 24 kHz “super‑res” U‑Net is a light extension of the cascader.

Loss Weighting

σ²‑weighted MSE for spectrogram generator (critical for convergence)
- Weighs the loss more heavily on the “back end” (late or cleaner timesteps) of the denoising schedule.
Either σ² or constant 1 for others

Note: All the models, with the exception of the vocoder, are trained on audio-text pairs, while the vocoder is only trained on audio.

Text batch per clip

Long‑form descriptions (3 items) comes from LaMDA‑LF vocabulary and stored as three different strings.
Short tags and metadata - mashed together, then chopped to size
1. All of them are concatenated into one line
2. If this string exceeds the token limit fixed in Table 2 (say 64 tokens), it is split into equal‑length chunks so that each chunk fits the limit.
3. Every chunk counts as an additional candidate caption.
Total = 3 long sentences + 1 – 2 short chunks (depending on length).

During training the loader randomly picks one element from that list and feeds it to the U‑Net as the text conditioning for this audio example.

So across epochs the network sees the same audio paired sometimes with a rich prose description, other times with a terse tag bundle—helping it learn both broad language and concise labels.

More Details

Optimizer:

Adam, $β_1$ = 0.9, $β_2$ = 0.999

LR schedule:

Cosine LR Scheduler, Max LR: 1 × 10⁻⁴
End Point: Step 2.5 M, Warm-Up steps: 10 k

Exponential Moving Average (EMA)

Individual parameter updates from each mini‑batch are noisy. Averaging them over time gives a smoother, typically better‑generalising set of weights for inference.

\[\theta_{t} = (1-\alpha)\,\theta_{t-1} + \alpha\,\theta_{t}\]

Decay factor d = 1 − α , d = 0.9999 and used at inference time.
Keep a second weight copy while training (EMA)
Snapshot those EMA weights to disk
At inference time → Load only the EMA copy (ignore the noisy “online” weights).
Why this works
- Reduces training‑loss noise.
- EMA weights have seen every past setting of the model, so outliers cancel out.
- Empirically they yield crisper audio and fewer artifacts, especially for diffusion and GAN‑style generators.

Batch Size

4096 for Super-res cascader (since its lightweight)
2048 for rest

CFG During Training

Hide prompt for 10 % of samples (cross‑attention outputs zeroed)
Teaches model to handle both conditional and unconditional cases, enabling CFG at inference.

Sequence length seen by each model

Generators: full 30 s clip
Cascader & vocoder: random 3–4 s windows
- Cascader/vocoder don’t use self‑attention → can train on snippets, saving memory.

Data augmentations (for cascader & vocoder)

Randomly corrupt the conditioning low-fidelity audio or the spectrogram input by applying diffusion noise

Random diffusion time is chosen within [0, $t_{max}$] and applied to the intermediate representation of the audio, i.e., the upsampled low-fi audio or the spectrogram.
Cascader $t_{max}$: 0.5
Vocoder and super-res $t_{max}$: 1.0

Blur Augmentation of conditioning input

For the cascader model, a 1D blur kernel of size 10 is used with a Gaussian blur kernel whose standard deviation ranges from 0.1 to 5.0.
For the vocoder model, a 2D 5x5 blur kernel is applied with the standard deviation ranging from 0.2 to 1.0.

4.2 Model inference and serving

4.2.1 Model Inference

Three knobs you can turn
- Denoising schedule – how you spread the diffusion steps along time t∈[0, 1].
- Stochasticity γ – 0 = deterministic (DDIM‑style), 1 = full randomness (DDPM‑style).
- CFG scale w – how strongly the result must obey the text prompt (larger w → tighter match, but riskier artefacts).
What “denoising schedule” really means
- Imagine you have N small time jumps $δ₁…δ_N$ that must add up to 1.
- Front‑heavy: many tiny steps right at the start (when audio is still noisy).
- Uniform: equal spacing throughout.
- Back‑heavy: more steps near the end (when audio is already fairly clean).
- Given a fixed budget of steps, choosing where to spend them is a trade‑off between global structure (benefits from early steps) and fine detail (benefits from late steps).
Hyper‑parameter sets actually used
- Each of the four U‑Nets (generator, cascader, spectrogram generator, vocoder) gets its own trio of settings (schedule type, γ, CFG scale).
- Those exact numbers live in Table 3 of the paper; the principle is the same: early models lean slightly “front‑heavy” and higher γ for creativity, while later refiners go “back‑heavy” and lower γ for polish.

4.3 Evaluation

Parameter Selection for the Models
- Team used a handful of private “dev prompts,” listened, and chose the versions that subjectively sounded best within their compute budget.
- All metrics are computed on the 16 kHz outputs straight from the cascader/vocoder — the 24 kHz super-resolution stage is skipped during evaluation.
4.3.2 Evaluation Metrics
1. Fréchet Audio Distance (FAD): same idea as FID for images. Three encoders give three flavours:
  - VGGish → general sonic quality.
  - Trill → vocal-centric quality.
  - MuLan audio encoder → high-level musical semantics.
2. MuLan similarity: cosine similarity in the MuLan embedding space. Used two ways:
  - Text ↔ generated audio (how well the clip matches its prompt).
  - Ground-truth audio ↔ generated audio.
  - Randomly shuffled pairs give a “chance level” baseline.
Evaluation datasets
- MagnaTagATune (MTAT) — 21 638 clips with up to 188 tag labels concatenated into a single prompt; model generates a full 29-s clip.
- AudioSet-Music-Eval — 1 482 ten-second clips; tags concatenated; model generates 30 s, middle 10 s are scored.
- MusicCaps — 5.5 K ten-second clips with rater-written free-form captions; model generates 30 s, middle 10 s are scored.

4.4 Results

4.5 Inference-time ablations

Classifier-free guidance (CFG) scale
- Sweet spot around 5–10; beyond that, FAD rises and audio gets over-compressed or distorted.
- Generator’s CFG weight matters more than its denoising schedule; for the cascader it’s the opposite.
Denoising schedule shape
- Cascader is very sensitive: front-heavy schedules hurt quality; back-heavy gives best FAD & similarity.
- Generator is less sensitive; uniform vs. mildly front-heavy are both acceptable.
Step count vs. quality (cost curve)
- More steps in the cascader/vocoder nearly always help; extra steps in the generator give diminishing returns after a point.
- Plot shows the elbow where doubling steps adds little perceptual gain — useful for setting latency targets.

5 More

Spectrogram vs. waveform cascades

Spectrogram path
- Much cheaper to train and serve because the input sequence is short.
- Naturally keeps high‑frequency detail that a 3 kHz low‑fi waveform cannot contain.
- Down‑side: intermediate representations are hard for engineers to interpret/debug.
Waveform path
- Every intermediate output is an actual audio snippet, which makes debugging and hyper‑parameter tuning easier.
- Training/serving is costlier and sequence length limits scalability to very long clips.

Open research directions

Better interpretability and controllability.
Stronger text–audio alignment (fewer “off‑prompt” generations).
Lower training and inference cost.
Longer outputs, plus tasks such as music in‑painting or style transfer—analogous to image editing with diffusion “paint‑over” techniques.

Text-to-Audio-Models

2025-07-25T00:00:00+00:00

My Journey into Text to Audio Models

I am studying text to audio models with more focus towards Music Generation models.

Text-To-Music Models

1. Mustango: Toward Controllable Text-to-Music Generation

Mustango is a diffusion-based text-to-music model that enables structured control over chords, beats, tempo, and key directly from natural-language prompts.

MusicBench — Dataset Pipeline

Seed corpus: 5521 MusicCaps clips (10 s + captions).
Control sentences: append 0–4 beat/chord/key/tempo lines
Paraphrase: ChatGPT rephrasing
Filter: drop “poor‑quality/low‑fidelity” captions
11× augment: ±1‑3 semitones, ±5–25 % speed, crescendo/decrescendo volume → ≈37 k new samples.

Mustango Model

Latent space: AudioLDM VAE → latent z.
MuNet denoiser: UNet + hierarchical cross‑attention.
- Inputs: FLAN‑T5 text emb; beat & chord encodings. (Beat encoder and Chord encoder)
Inference helpers:
- DeBERTa beat predictor (meter + intervals).
- FLAN‑T5 chord predictor (time‑stamped chords).
Output: 10‑s waveform obeying tempo, key, chords, beats when provided; graceful fallback when not.

2. Noise2Music: Text-conditioned Music Generation with Diffusion Models

Generate a 30-second, 24 kHz stereo music clip from a plain-language prompt.

Training‑Data Pipeline

Raw audio pool: 6.8 M full‑length tracks → chopped into 30 s clips (~340 k h).
Caption vocabularies (built offline)
- LaMDA‑LF – 4M rich sentences (LLM‑generated).
- Rater‑LF / SF – 35k long + 24k short human sentences/tags from MusicCaps.
Embedding space scoring: Encode every clip (MuLan‑audio) & every caption (MuLan‑text).
Pseudo‑labelling: For each clip pick top‑10 captions by cosine sim → sample 3 low‑frequency ones from each vocab (bias toward rarer labels).
Extra metadata: Append title, artist, genre, year, instrument tags.
Quality anchor: Inject ~300 h curated, attribution‑free tracks with rich manual metadata.
Dual‑rate storage: Keep 24 kHz (for super‑res stage) + 16 kHz copies (for the rest).
Final payload: Every 30 s clip carries 10 + text descriptors spanning objective tags → subjective vibes.

Model Stack (three‑stage diffusion cascade)

Stage	I/O	Role	Key details
Waveform Generator	Text → 3.2 kHz audio	Sketch global structure.	1‑D Efficient‑U‑Net; text fed via cross‑attention; CFG during sampling.
Waveform Cascader	Text + 3.2 kHz → 16 kHz audio	Upsample & refine.	Receives up‑sampled low‑fi audio + prompt; blur/noise augmentation during training.
Super‑Res Cascader	16 kHz → 24 kHz audio	Restore full bandwidth.	No text conditioning; lightweight U‑Net.

Spectrogram path (alt): parallel generator + vocoder pair that works in log‑mel space; cheaper but less interpretable.

3. Stable Audio - Fast Timing-Conditioned Latent Audio Diffusion

A convolutional VAE that efficiently compresses and reconstructs long stereo audio.
It uses latent diffusion
It adds timing embeddings.

Dataset Construction

Collect 806284 stereo tracks (≈ 19500 h) from AudioSparx.
Pre‑process audio
- Resample to 44.1 kHz, stereo.
- Slice / pad each file to a fixed 95.1 s window (4 194 304 samples).
Build text prompts from metadata on‑the‑fly
- Randomly sample descriptors (genre, mood, BPM, instruments).
- Emit either free‑form or structured text strings.
Final sets
- Same corpus trains the VAE, CLAP (text encoder), and latent diffusion U‑Net.

Model Pipeline

Stage	Key points
1. VAE	32× compression
2. Text encoder (CLAPours)	trained from scratch
3. Timing embeddings	seconds_start, seconds_total; concatenated with text features
4. Latent U‑Net diffusion	907 M params;
5. Inference	DPMSolver++

Outcome: 44.1 kHz stereo audio, up to 95 s, fast (latent) diffusion with precise duration control via timing conditioning.

More blog posts coming soon as I continue my learning journey…

Stable Audio - Fast Timing-Conditioned Latent Audio Diffusion

2025-07-25T00:00:00+00:00

Evans, Zach, C. J. Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. “Fast Timing-Conditioned Latent Audio Diffusion.” arXiv:2402.04825. Preprint, arXiv, May 13, 2024. https://doi.org/10.48550/arXiv.2402.04825.

Model-Code - Metrices - Demo

Summary
Related Work
Architecture
Training
Methodology
Experiments
Conclusions

1 Summary

The Problem with Audio Diffusion

Raw audio is massive in size and complexity. That means:

Training is slow and memory-intensive.
Inference (actually generating audio) is even slower, especially for long clips or stereo output.
Another practical issue: most audio diffusion models only generate fixed-length clips.
- A model trained on 30-second chunks will always give you exactly 30 seconds — even when your prompt suggests something shorter or longer. This is unnatural, especially for:
  - Music, which has structure (like intros and outros)
  - Sound effects, which can be quick or long

Stable Audio

A convolutional VAE that efficiently compresses and reconstructs long stereo audio.
It uses latent diffusion — meaning it learns to denoise in a compressed, lower-dimensional space (the latent space), not on raw audio. This is much faster and allows for longer generation.
It adds timing embeddings — so you can tell the model how long the output should be.

That combo allows it to:

Generate up to 95 seconds of full-quality audio in just 8 seconds
Offer precise control over the duration and content
Render stereo audio at 44.1kHz — the same sample rate used in CDs

Rethinking Audio Evaluation

Fréchet Distance with OpenL3: Measures how realistic the audio sounds by comparing it to real-world audio using perceptual embeddings.
KL Divergence: Quantifies how well the semantic content of the generated audio matches a reference.
CLAP Score: Assesses how well the generated audio aligns with the text prompt.

They also go a step further by assessing:

Musical structure (does it feel like a song, or just a loop?)
Stereo correctness (do the left and right channels make sense?)
Human perception (via qualitative studies)

2.1 Autoregressive Models: Great Sound, Painfully Slow

What they are:

Autoregressive models generate audio one step (or token) at a time. Think of it like writing a sentence word by word — each decision depends on what came before.

Examples:

WaveNet (2016): Generated audio from scratch using raw waveform values — high-quality but painfully slow.
Jukebox (2020): Compressed music into multi-scale latent tokens, then used transformers to model them.
MusicLM / MusicGen / AudioLM: Modern versions that use text prompts instead of artist/genre metadata and work on compressed audio tokens.

Problem: Even with compression, these models are slow to generate audio because of their step-by-step nature.

2.2 Non-Autoregressive Models: Faster, Still Limited

What they are:

These models try to speed up generation by skipping the step-by-step process.

Examples:

Parallel WaveNet, GAN-based methods: Tried adversarial training.
VampNet / MAGNeT / StemGen: Use masked modeling (like BERT) or other tricks to avoid sequential generation.
Flow-matching models: Try to morph noise into data in a more direct way.

Problem: Many are limited in the duration they can handle (up to 20 seconds) or don’t focus on structured or stereo music.

2.3 Diffusion Models

End-to-End Diffusion: These generate raw waveforms directly (e.g., CRASH, DAG, Noise2Music). Powerful, but training is costly and slow.
Spectrogram Diffusion: Generate images of sound (spectrograms) and convert them back to waveforms.
1. Riffusion: Generated audio by tweaking Stable Diffusion for spectrograms.
2. Needs a separate vocoder (like HiFi-GAN) to reconstruct audio.
Latent Diffusion (Stable Audio’s Approach)
1. Moûsai, AudioLDM, JEN-1: All use VAE-based latents to make the job easier and faster.
2. AudioLDM: Generates spectrograms first, then inverts to audio.
3. Moûsai: Diffuses latents and directly decodes into audio.
4. JEN-1: A multitask model with dimensionality-reduced latents.
5. Stable Audio’s Differentiator: Also uses latent diffusion, but focuses on 44.1kHz stereo audio, supports up to 95 seconds, and introduces timing conditioning — something none of these others do.

2.4 High Sampling Rate & Stereo Audio

Most past models:

Work in mono or low sample rates (16–24kHz).
Or generate short clips.

Only a few (e.g., Moûsai, JEN-1) can handle stereo and high quality — but not efficiently, and not with variable length.

Stable Audio’s edge: One of the first models to combine:

44.1kHz stereo
Up to 95 seconds
Variable-length control via timing conditioning

2.5 Timing Conditioning

Introduced by Jukebox, which used it in an autoregressive way (e.g., where in the song a chunk came from).

Stable Audio’s innovation: Brings timing embeddings into the world of latent diffusion — a first. These embeddings help the model control duration of output precisely, which is crucial for realistic music or sound effects.

2.6 Evaluation Metrics

Problem: Most metrics (e.g., from Kilgour et al.) were designed for 16kHz, mono, short-form audio.

Stable Audio introduces:

OpenL3 Fréchet Distance: Like FID for music — checks realism.
KL divergence for semantic alignment: Checks if generated audio matches the idea.
CLAP score: Measures text-to-audio alignment.
Qualitative assessments: Musicality, stereo image, structure.

2.7 Multitask Generation

Some recent models (e.g., JEN-1) try to generate speech + music + sound in one system.

Stable Audio’s focus: Just music and sound effects — not speech — for better domain-specific quality.

3 Architecture

At a high level, it consists of three core components:

A Variational Autoencoder (VAE) to compress and decompress the audio
A Conditioning system using text and timing embeddings
A U-Net-based diffusion model that learns how to turn noise into music — fast and controllably

Let’s walk through each of them.

3.1 Variational Autoencoder (VAE): Compressing Audio for Fast Diffusion

Training and sampling on raw 44.1kHz stereo audio would be painfully slow and memory-intensive. That’s why Stable Audio uses a VAE to shrink the raw audio into a learnable latent space — a compact, lossy representation that still retains musical essence.

Key Features:

Input: Stereo audio (2 channels) of arbitrary length.
Output: Latent tensor with 64 channels and 1/1024th the original length. That’s a 32× compression in size.
Architecture: Based on the Descript Audio Codec, but without quantization.
Activations: Uses Snake activations, which help better reconstruct the audio at high compression — better than more common models like EnCodec, though at the cost of using more VRAM.

This design allows the model to handle long-form stereo audio efficiently, which would otherwise be computationally infeasible.

🐍 What is Snake Activation?

Snake activation is a type of activation function introduced to help neural networks better represent periodic and high‑frequency patterns — like those commonly found in audio or waveforms.

The function is defined as:

$$\text{Snake}(x) \;=\; x \;+\; \frac{1}{\alpha}\,\sin^2(\alpha x)$$

x — input value
α — learnable parameter controlling the sinusoid’s frequency

First proposed by Ziyin et al., 2020, the layer excels on continuous signals (e.g. audio).

Why Use Snake?

Standard activations don’t natively capture oscillations.
Audio is highly periodic and rich in high‑frequency detail.
Snake helps models learn and preserve those details during encoding/decoding.

Intuition

$$x + \frac{1}{\alpha}\sin^2(\alpha x)$$

Linear term x → stable gradients.
Sinusoid → adaptive “wiggle”.
α → learns the optimal frequency per neuron.

Comparison to Other Activations

Activation	Pros	Cons
ReLU	Simple, fast	Cannot model periodic signals
GELU	Smooth gradients	Still not ideal for oscillations
Sinusoidal (SIREN)	Excellent for periodic data	Fixed frequency, harder to train
Snake	Learnable periodicity + linear term	Slightly higher compute / VRAM

3.2 Conditioning: Telling the Model What and How Long to Generate

To steer the model’s output, Stable Audio uses two kinds of conditioning signals: Text prompts and Timing embeddings.

📝 Text Encoder: CLAP to the Rescue

The team uses a CLAP-based encoder — a contrastive language-audio pretraining model.
It’s trained from scratch on their own dataset (not just the open-source CLAP).
Instead of using the final layer (as many do), they use the next-to-last hidden layer, inspired by practices in visual-language models like CLIP and Stable Diffusion. This layer tends to preserve more useful context for generation.
These text embeddings are passed to the U-Net via cross-attention layers.

Why not T5 or MuLan?

Because CLAP learns audio-text relationships, making it more suitable for describing sound-rich prompts like “ambient rainforest with tribal drums”.

🕒 Timing Embeddings: Fine-Grained Control Over Duration

Stable Audio pioneers the idea of timing-aware diffusion for audio. Here’s how it works:

From each training clip, two timing values are recorded:
- seconds_start: Where the chunk begins in the original audio
- seconds_total: The full duration of the original audio

📌 Example:

If you sample a 95-sec chunk from a 180-sec track starting at 14s:

seconds_start = 14
seconds_total = 180

These are then turned into learned per-second embeddings, and concatenated with the text features. They are fed into the model via cross-attention.

During inference, you can set:

seconds_start = 0, seconds_total = 30 to get a 30-sec output
The remaining time (e.g. 65 sec) is padded with silence in the latent space

💡 Why this matters:

Supports variable-length generation
Eliminates hardcoded clip lengths
Allows users to request specific durations

And yes — silence padding is easy to trim afterward.

3.3 Diffusion Model: The Brain Behind the Music

The actual denoising (i.e. generation) happens in a U-Net diffusion model with 907M parameters. It’s inspired by Moûsai and tailored to scale up with long latents.

U-Net Design

4 Levels of encoder-decoder blocks
Downsampling factors: 1×, 2×, 2×, 4× (i.e. progressively compress along length)
Channel sizes: 1024, 1024, 1024, 1280
Skip connections between encoder and decoder layers maintain resolution-specific features

Inside Each Block

2 Conv residual layers
1 to 3 attention layers:
- Self-attention
- Cross-attention for text + timing
Bottleneck block between encoder and decoder with 1280 channels
Fast attention kernels (from Dao et al., 2022) to optimize memory and speed

Conditioning Layers

FiLM (Feature-wise Linear Modulation) layers inject timestep noise level info (i.e. how noisy the latent currently is)
Cross-attention layers inject text + timing information

🎞️ FiLM (Feature‑wise Linear Modulation)

FiLM, introduced by Perez et al., 2017, lets a neural network adapt its internal features using an external input — e.g. text, labels, or (for diffusion models) the timestep.

The Core Idea

Given a feature map $F \in \mathbb{R}^{C \times H \times W}$ and a conditioning vector $c$, FiLM learns per‑channel scale & shift:

$$\text{FiLM}(F;\gamma,\beta) \;=\; \gamma(c)\,F \;+\; \beta(c)$$

$\gamma(c)$ — MLP outputs channel‑wise scales
$\beta(c)$ — MLP outputs channel‑wise shifts

In Diffusion Models

The timestep $t$ is embedded, passed through MLPs to get $\gamma(t)$ and $\beta(t)$, then applied:

$$\text{FiLM}(x) = \gamma(t)\,x + \beta(t)$$

Effect: The network “knows” how noisy the input is and modulates its features accordingly — gentle cleaning early on, fine‑grain denoising later.

Why Not Just Concatenate the Timestep?

More expressive — can amplify or suppress specific channels per step.
Modular — injects conditioning exactly where needed.
Widely adopted — Imagen, Muse, Latent Diffusion, etc.

3.4 Inference: Fast, Controlled Sampling

During inference, Stable Audio uses:

DPMSolver++: A fast, high-quality diffusion sampler
Classifier-free guidance (CFG): Amplifies the conditioning signal (scale = 6)
100 diffusion steps: Chosen as a balance between speed and audio quality (details in Appendix A)

💡 The final audio:

Can be up to 95 sec
Will contain silence after your specified seconds_total
Silence can be trimmed post-hoc — works reliably due to strong timing embeddings (as validated in Section 6.3)

⚡ DPMSolver++ (Fast Diffusion Sampler)

DPMSolver++ (Denoising Probabilistic Matching Solver++) is a fast & accurate sampler for diffusion models, introduced by Lu et al., 2022.

Why Sampling Matters

Diffusion starts with pure noise and denoises over T steps.
Each step = one forward pass → speed bottleneck.
Vanilla DDPM needs 1000+ steps; DPMSolver++ can deliver high‑quality samples in ≈ 15 – 100 steps.

What Makes DPMSolver++ Special?

ODE‑based formulation — models the true probabilistic path.
Higher‑order solvers — 2nd / 3rd‑order integration for accuracy at large step sizes.
Explicit update rules — maintain the diffusion process’s statistical properties.
Outperforms DDIM, PLMS, etc., at similar step counts.

Practical Upshot

Swap in DPMSolver++ → ~10× faster inference with negligible (or no) loss in perceptual quality.

4 Training

4.1 Dataset: The Backbone

Stable Audio is trained on a massive dataset of 806,284 audio files totaling 19,500 hours from AudioSparx, a stock music provider.

Dataset Breakdown:

Music: 66% of the files (or 94% of total audio hours)
Sound effects: 25% of files (5% of hours)
Instrument stems: 9% of files (1% of hours)

Each file comes with rich text metadata, including:

Descriptions (e.g., “epic orchestral cinematic rise”)
BPM
Genre
Mood
Instrument labels

📌 The dataset is public for consultation — a win for transparency and reproducibility.

4.2 Training the VAE: Compressing Without Losing Musicality

The VAE (used to compress audio into latents) was trained on 16 A100 GPUs using automatic mixed precision (AMP) for 1.1 million steps.

AMP

What is Automatic Mixed Precision (AMP)?

AMP is a technique that allows deep learning models to use both 16-bit (float16) and 32-bit (float32) floating-point numbers during training — automatically.

Traditionally, models are trained in float32 precision (a.k.a. FP32), which is precise but:

Slower to compute
Uses more GPU memory

With AMP:

Some operations (like matrix multiplications) are done in float16 (FP16) — faster and smaller
Others (like loss computation or gradient updates) stay in float32 — more stable and accurate

The “automatic” part means you don’t need to manually specify which ops use which precision — your framework (like PyTorch or TensorFlow) figures it out for you.

Pros:

Faster training: On GPUs like NVIDIA A100s or V100s, FP16 operations are 2–8× faster than FP32.
Lower memory usage: FP16 uses half the memory, so you can train larger models or bigger batches.
Same or similar accuracy: Thanks to dynamic loss scaling and smart casting, AMP usually retains almost all the performance of full-precision training.

Challenges:

FP16 has a narrower range of values (can underflow or overflow), which may cause instability if used naively.
That’s why AMP keeps sensitive operations in FP32, like:
- Loss calculation
- Gradients accumulation
- Batch norm updates

Strategy

Phase 1: Train both encoder and decoder for 460,000 steps.
Phase 2: Freeze the encoder, fine-tune the decoder for 640,000 more steps — this improves reconstruction fidelity without changing latent space.

Loss Functions

They used a carefully crafted loss mix focused on stereo audio fidelity:

Loss Type	Description
🎧 STFT Loss	Multi-resolution sum-and-difference STFT (to ensure left/right stereo correctness), applied after A-weighting to match human hearing
🧠 Adversarial Loss	From a multi-scale STFT discriminator with patch-based hinge loss (encourages realism)
🧪 Feature Matching	Matches internal features of real vs generated audio
📉 KL Loss	Keeps the latent space well-behaved

Window sizes for STFT:

[2048, 1024, 512, 256, 128, 64, 32] (for reconstruction) and

[2048, 1024, 512, 256, 128] (for adversarial discriminator)

Loss weights:

STFT loss: 1.0
Adversarial: 0.1
Feature matching: 5.0
KL divergence: 1e-4

This blend ensures high fidelity, structure, and stereo realism in reconstruction.

4.3 Training the Text Encoder: CLAP, from Scratch

They trained their CLAP model (contrastive language-audio pretraining) from scratch on the same dataset.

Setup:

100 epochs
Batch size: 6,144
Hardware: 64 A100 GPUs
Uses the original CLAP configuration:
- RoBERTa-based text encoder (110M parameters)
- HTSAT-based audio encoder (31M parameters)
Loss: Language-audio contrastive loss

🎯 Result: A multimodal text encoder deeply aligned with their dataset — outperforming open-source CLAP or T5 in text-to-audio alignment.

4.4 Training the Diffusion Model

Once the VAE and CLAP were ready, they trained the latent diffusion model.

Setup:

640,000 steps
64 A100 GPUs
Batch size: 256
Exponential moving average (EMA) of model weights
AMP enabled for memory-efficient training

Audio Preparation:

Resample to 44.1kHz
Slice to exactly 95.1 seconds (4,194,304 samples)
- Crop long files from random point
- Pad short ones with silence

Objective:

v-objective (Salimans & Ho, 2022): A more stable variant of denoising objective
Cosine noise schedule (smoothly decays noise over time)
Continuous timestep sampling

💡 Dropout (10%) applied to the conditioning inputs → this enables classifier-free guidance during inference (a trick borrowed from image models).

Note: Text encoder was frozen during diffusion training — so only the U-Net learns how to use its features.

4.5 Prompt Preparation: How Text Prompts Were Created

Each audio file had rich metadata, but not all of it was equally useful all the time.

So they used dynamic prompt construction during training:

Create synthetic natural-language prompts by randomly sampling metadata fields.
Two styles:
1. Structured:
  
  Instruments: Guitar, Drums | Moods: Uplifting, Energetic
2. Free-form:
  
  Guitar, Drums, Bass Guitar, Uplifting, Energetic
Shuffle the items to prevent the model from overfitting to order.

This makes the model robust — it can understand both natural text and structured metadata during inference.

5 Methodology

Generating high-quality, realistic, and text-aligned music or sound effects is already hard — but measuring how good that generation is? Even harder. Especially when you’re dealing with long-form, stereo, high-fidelity audio.

5.1 Quantitative Metrics

1. FDOpenL3 — Realism

The Fréchet Distance (FD) is a go-to metric in generative modeling. It checks how similar the statistics (mean, covariance) of generated content are to real content — in a learned feature space.

Stable Audio’s Twist

Instead of projecting audio into VGGish features (which are 16kHz and mono), they use OpenL3, which handles up to 48kHz and stereo.
Stereo-aware: They feed left and right channels separately, get OpenL3 features for each, and concatenate.
For mono baselines, they simply copy the features to both sides.

✅ FDOpenL3 evaluates:

Realism of generated long-form
Full-band stereo audio at 44.1kHz

2. KLPaSST — Semantic Alignment

How much do the generated sounds semantically match their reference content?

They use:

PaSST: A strong audio tagging model trained on AudioSet
Compute the KL divergence between the label probabilities of generated vs real audio

Stable Audio’s Twist:

PaSST only supports up to 32kHz, so they resample from 44.1kHz
Audio is segmented into overlapping chunks, logits are averaged, and softmax is applied

✅ KLPaSST captures:

Tag-level alignment (e.g., “rock”, “violin”, “clapping”)
Works for variable-length audio, not just 10-second snippets

3. CLAPscore — Prompt Adherence

CLAP (Contrastive Language-Audio Pretraining) is used to measure how well the generated audio matches the text prompt.

Stable Audio’s Twist:

Instead of using just a single 10s crop (like prior works), they use feature fusion:
- A global downsampled version of the full audio
- Plus 3 random 10s crops from beginning, middle, and end
This fused signal is encoded using CLAP-LAION (trained on 48kHz)
Both the text and audio embeddings are compared via cosine similarity

✅ CLAPscore tests:

How well long-form stereo audio adheres to the prompt
Works across full audio — intro, middle, and end

5.2 Qualitative Metrics

Beyond math and embeddings — what do humans think?

Human evaluation criteria:

Metric	Description
🎧 Audio quality	Is it high-fidelity or noisy/low-res?
✍️ Text alignment	Does the sound match the prompt?
🎵 Musicality	Are melodies/harmonies coherent?
🔊 Stereo correctness	Does the left/right channel sound appropriate?
🏗️ Musical structure	Does the music have an intro, middle, and outro?

Ratings Collected:

Audio quality, Text alignment, Musicality: Rated on a 0–4 scale (bad → excellent)
Stereo correctness & Musical structure: Binary (Yes/No)

Special rules:

Musicality/structure: Only evaluated for music
Stereo correctness: Only for stereo signals
Non-music: Only quality, alignment, stereo correctness

Evaluations were run using webMUSHRA, a standardized perceptual testing framework.

5.3 Evaluation Data

They used two popular text-audio benchmarks:

MusicCaps

5,521 music clips with 1 text caption each
YouTube-based, mostly stereo
Only 10-second clips — so Stable Audio generated longer clips (up to 95 sec)

AudioCaps

979 clips with 4,875 total captions
Also YouTube-based, mostly stereo
Focuses on environmental sounds and effects

Challenge:

Captions only describe the first 10 seconds, so reference comparisons are limited.
Stable Audio still generates longer audio — showing its ability to go beyond what’s seen during training.

5.4 Baselines

Some top models (e.g., Moûsai, JEN-1) weren’t comparable due to lack of open-source weights.

So they compared against open-source SOTA:

Model	Type	Notes
AudioLDM2	Latent Diffusion	48kHz mono and 16kHz variants
MusicGen	Autoregressive	Small and large models, stereo version available
AudioGen	Autoregressive	Medium-sized, for sound effects

Notes:

AudioLDM2 = best non-autoregressive open baseline
MusicGen-stereo = best autoregressive stereo baseline
MusicGen doesn’t model vocals, so vocal prompts were filtered in some tests

6 Experiments

6.1. How Good Is the Autoencoder?

To test how much audio quality is lost in the compression and decompression process (via the VAE), the authors:

Passed real training audio through the encoder → decoder pipeline
Compared the output to the original using FDOpenL3

🧠 Result: The autoencoded audio showed slightly worse FD scores than the original, but the degradation was minimal. Informal listening confirmed the fidelity is transparent — meaning humans barely notice the difference.

6.2. Which Text Encoder Works Best?

They tested:

CLAP-LAION (open-source)
CLAPours (trained from scratch on their dataset)
T5 (text-only encoder)

Each version was frozen during training, and the base model was trained for 350K steps.

🧠 Result: All performed comparably, but CLAPours slightly outperformed the others. Since it was trained on the same dataset as the diffusion model, it offered better vocabulary alignment and semantic grounding.

✔️ Final Choice: CLAPours — for consistency and performance.

6.3. How Accurate Is the Timing Conditioning?

They tested if the model could:

Generate audio of exactly the length requested via timing embeddings.
Do this across many durations (from short to long).

They used a simple energy-based silence detector to find where the real content ended in the generated 95s audio window.

🧠 Result:

The model closely follows the expected duration
Most accurate at short (≤30s) and long (≥70s) durations
Some variability around 40–60 seconds, likely due to fewer training examples of this length
Some misreadings caused by limitations of the silence detection method

6.4. How Does It Compare to State-of-the-Art?

Benchmarks are shown in Tables 1–3 (not included here), comparing Stable Audio against:

AudioLDM2
MusicGen (small, large, stereo)
AudioGen

Key Observations:

Best in audio quality and text alignment on MusicCaps
Slightly weaker on AudioCaps for text alignment, possibly due to fewer sound effects in its training set
Competitive in musicality and musical structure
Good at stereo rendering for music but weaker on stereo correctness for effects — possibly because some prompts don’t require spatial diversity
Importantly, it’s the only model consistently capable of generating intro → development → outro — real musical structure, not just loops

6.5. How Fast Is It?

They benchmarked inference time on a single A100 GPU (batch size = 1).

🧠 Result:

Much faster than autoregressive models (e.g., MusicGen, AudioGen)
Faster than AudioLDM2, even when generating higher-quality audio (44.1kHz stereo vs. 16kHz or mono)
Particularly faster than AudioLDM2-48kHz, which works at a similar bandwidth but takes longer

✅ Latent diffusion + optimized architecture + DPMSolver++ = speed with quality

Section 7: Conclusions

Stable Audio proves that it’s possible to build a system that is:

🎵 Flexible (supports music and sound effects)
⏱️ Fast (generates up to 95s in just 8s)
🎧 High-fidelity (44.1kHz stereo)
🧠 Controllable (via text + timing conditioning)
🧪 Well-evaluated (with new long-form-aware metrics)

It pushes the frontier in multiple areas:

One of the first systems to consistently generate structured music
Among the few to generate stereo sound effects
Introduces new metrics for evaluating long-form, full-band, stereo generation
Outperforms or competes with state-of-the-art in multiple benchmark

Denoising Diffusion Probabilistic Models

2025-07-23T00:00:00+00:00

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models.” arXiv:2006.11239. Preprint, arXiv, December 16, 2020. https://doi.org/10.48550/arXiv.2006.11239.

Code

Abstract
Diffusion Models
- The Forward Process (Data → Noise)
- The Reverse Process (Noise → Data)
Training Objective
Training Algorithm
- Connection to Score Matching
- More
Experiments

Abstract

High quality image synthesis results using diffusion probabilistic models.
- Latent‑variable model – The model assumes there’s an unobserved variable z that, after some transformation, produces your image x.
Trained on Weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics.

Diffusion Models

We use Diffusion probabilist model (aka Diffusion Models). They are a class of generative models that learn to create high-quality samples by reversing a gradual corruption process.

More Details on Diffusion models → diffusion-models.html

The Forward Process (Data → Noise)

Markov chain that gradually adds Gaussian noise to the data according to a variance schedule: $β_1, . . . , β_T$. It gradually corrupts the original data by adding Gaussian noise:

\[q(x_{1:T}|x_0) = ∏^T_{t=1} q(x_t|x_{t-1})\] \[q(x_t|x_{t-1}) = N(x_t; \sqrt{(1-β_t)}x_{t-1}, β_tI)\]

Key aspects:

$x_0$: Original clean data
$x_1, x_2, ..., x_t$: Progressively noisier versions
$β_t$: Variance schedule controlling how much noise is added at each step
This process is fixed and doesn’t require learning

The Reverse Process (Noise → Data)

The reverse process learns to undo the forward corruption:

\[p_θ(x_{0:T}) = p(x_T) ∏_{t=1}^T p_θ(x_{t-1}|x_t)\] \[p_\theta(x_{t-1} \mid x_t) \;=\; \mathcal{N}\!\Bigl( x_{t-1} \;;\; \mu_\theta(x_t, t), \;\Sigma_\theta(x_t, t)\Bigr)\]

Key aspects:

Starts from pure noise: $p(x_T) = N(x_T; 0, I)$
Each step is a learned Gaussian transition
$μ_θ$ and $Σ_θ$ are neural network predictions

Training Objective

The model is trained by optimising the variational bound:

\[\mathcal{L} \;=\; \mathbb{E}_q \left[ -\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right]\]

Efficient sampling property: The forward process allows sampling at any timestep t directly:

\[q(x_t \mid x_0) \;=\;\mathcal{N}\!\Bigl( x_t \;;\; \sqrt{\bar{\alpha}_t} \, x_0,\; (1 - \bar{\alpha}_t) I\Bigr)\]

where $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ and $\alpha_t = 1 - \beta_t$

Variational Lower Bound Loss

We want to model realistic image data. Our goal is to maximize the likelihood of real images under a generative model:

\[\max\; p_\theta(x_0) = \max \int p_\theta(x_{0:T})\; dx_{1:T}\]

But this marginalization over all possible noise trajectories is intractable. So, we approximate it using variational inference.

Starting from the log-likelihood:

\[\log p_\theta(x_0) = \log \int p_\theta(x_{0:T})\; dx_{1:T}\]

We reformulate the intractable log-likelihood using a known forward process (diffusion) $q(x_{1:T} \mid x_0)$:

\[\log p_\theta(x_0) = \log \int \frac{p_\theta(x_{0:T})\; q(x_{1:T} \mid x_0)}{q(x_{1:T} \mid x_0)} dx_{1:T}\] \[= \log \mathbb{E}_q\left[\frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right]\] \[\geq \mathbb{E}_q\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] \quad \text{[Jensen's inequality]}\]

This gives us the variational lower bound which we minimize during training:

\[\mathcal{L} = \mathbb{E}_q\left[-\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] = \mathbb{E}_q\left[-\log p_\theta(x_{0:T}) + \log q(x_{1:T} \mid x_0)\right]\]

In context of image generation: We’re finding the best denoising path to generate realistic images from noise.

🧮 Derivation of Above Loss

How we got to the above loss:

Original Integral:
$$I = \int f(x)\, dx \quad \text{where } f(x) = p_\theta(x_{0:T})$$
Multiply and Divide by $ g(x) $:
$$I = \int f(x) \cdot \frac{g(x)}{g(x)}\, dx \quad \text{where } g(x) = q(x_{1:T} \mid x_0)$$
Rearranged Form:
$$I = \int \frac{f(x)}{g(x)} \cdot g(x)\, dx$$
Recognize as Expectation:
$$I = \mathbb{E}_g\left[\frac{f(x)}{g(x)}\right]$$
Apply Jensen's Inequality:
For a convex function $ f $ and a random variable $ X $:
$$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$$
For a concave function like $ \log $, the inequality flips:
$$\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$$
Therefore:
$$\log \mathbb{E}_q\left[\frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] \geq \mathbb{E}_q\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] \quad \text{[Jensen's inequality]}$$

Expanding the Variational Bound

Reverse Process (learned):

\[p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t)\]

This defines how we turn noise into an image.

Forward Process (known):

\[q(x_{1:T} \mid x_0) = \prod_{t=1}^{T} q(x_t \mid x_{t-1})\]

This adds noise step by step to a clean image.

The Full Loss Function after using p and q from above:

Breaking the bound into individual terms:

\[\mathcal{L} = \mathbb{E}_q\left[-\log p(x_T) - \sum_{t=1}^{T} \log p_\theta(x_{t-1} \mid x_t) + \sum_{t=1}^{T} \log q(x_t \mid x_{t-1})\right]\] \[= \mathbb{E}_q\left[-\log p(x_T) + \log q(x_T \mid x_{T-1}) - \sum_{t=1}^{T-1} \log \frac{p_\theta(x_{t-1} \mid x_t)}{q(x_t \mid x_{t-1})} - \log p_\theta(x_0 \mid x_1)\right]\]

This measures how well our reverse process undoes the forward noise corruption.

Rewriting the Loss

By Bayes’ Rule:

\[q(x_t \mid x_{t-1}) \cdot q(x_{t-1} \mid x_0) = q(x_t \mid x_0) \cdot q(x_{t-1} \mid x_t, x_0)\]

This allows us to rewrite the loss as:

\[\mathcal{L} = \mathbb{E}_q\left[\mathrm{D_{KL}}(q(x_T \mid x_0) \parallel p(x_T))\right] + \sum_{t=2}^{T} \mathbb{E}_q\left[\mathrm{D_{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_\theta(x_{t-1} \mid x_t))\right] + \mathbb{E}_q\left[-\log p_\theta(x_0 \mid x_1)\right]\]

How?

1. Rewriting each log term as a KL divergence

We use the identity:

$$\mathrm{D_{KL}}(q(z) \| p(z)) = \mathbb{E}_{q(z)}[\log q(z) - \log p(z)]$$

This allows us to convert pairs of log terms into KL divergences, wherever we can express a tractable pair of distributions.

2. Final timestep prior matching

We isolate the final timestep:

$$-\log p(x_T) + \log q(x_T \mid x_{T-1}) \approx \mathrm{D_{KL}}(q(x_T \mid x_0) \parallel p(x_T))$$

Because:

$$q(x_T \mid x_0) = \int q(x_T \mid x_{T-1}) q(x_{T-1} \mid x_0) \, dx_{T-1}$$

And it's tractable, so we merge these into one KL term.

3. KL terms for t = 2 to T

For steps t = 2 to T, using Bayes' rule:

$$q(x_t \mid x_{t-1}) \cdot q(x_{t-1} \mid x_0) = q(x_t \mid x_0) \cdot q(x_{t-1} \mid x_t, x_0)$$

Taking logs and summing:

$$\log q(x_t \mid x_{t-1}) - \log p_\theta(x_{t-1} \mid x_t) = \log q(x_{t-1} \mid x_t, x_0) - \log p_\theta(x_{t-1} \mid x_t)$$

Thus, each of those becomes a KL divergence:

$$\mathrm{D_{KL}}(q(x_{t-1} \mid x_t, x_0) \| p_\theta(x_{t-1} \mid x_t))$$

4. Final step t = 1

There's no posterior $ q(x_0 \mid x_1, x_0) $, so we leave the log term as-is:

$$-\log p_\theta(x_0 \mid x_1)$$

Interpretation for image generation: We encourage our model to align with the known noise process and accurately reconstruct the original image.

Named Components:

\[\mathcal{L} = \mathcal{L}_T + \sum_{t=2}^{T} \mathcal{L}_{t-1} + \mathcal{L}_0\]

Each term in the loss plays a specific role:

where

Prior Matching: $L_T$

KL between final noisy state and prior

\[\mathcal{L}_T = \mathrm{D_{KL}}(q(x_T \mid x_0) \parallel p(x_T))\]

Pushes noisy images to align with Gaussian noise
Ensures the final forward process state:
$q(x_T \mid x_0)$ matches the prior: $p(x_T) = \mathcal{N}(0, I)$
When $\beta_t$ are fixed (not learned), this becomes a constant and can be ignored during training
No parameters to optimize here!

Denoising KL Terms: $L_{t-1}$

KL between forward and reverse process at each step

\[\mathcal{L}_{t-1} = \mathbb{E}_q\left[\mathrm{D_{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_\theta(x_{t-1} \mid x_t))\right]\]

Forward posterior is tractable. We can compute it exactly:

\[q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\left(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I\right)\]

With:

\[\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t\] \[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t\]

Ensures reverse denoising steps are accurate.

What it does:

Ground truth target:
$q(x_{t-1} \mid x_t, x_0)$ — the “true” way to denoise $x_t$ when we know $x_0$
Model prediction:
$p_\theta(x_{t-1} \mid x_t)$ — what our model thinks is the right way to denoise
Training signal:
Make the model’s denoising match the ground truth denoising

Reconstruction loss for final denoising step - $L_0$

\[\mathcal{L}_0 = \mathbb{E}_q\left[-\log p_\theta(x_0 \mid x_1)\right]\]

Handles the final step from slightly noisy image to clean discrete pixels
Uses a discrete decoder to ensure proper pixel values {0,1,…,255}

Parameterisation Trick: Predicting Noise instead of Image

Traditional approach: Directly predict $μ_θ(x_t, t)$

Loss:

\[\mathcal{L}_{t-1} = \mathbb{E}_q\left[\frac{1}{2\sigma_t^2} \| \tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t) \|^2\right] + C\]

Noise Prediction

Instead of predicting clean images directly, we reparameterize:

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \varepsilon\]

Then train the model to predict $\varepsilon$, the noise and Loss becomes:

\[\mathcal{L}_{t-1} = \mathbb{E}_{x_0, \varepsilon}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \| \varepsilon - \varepsilon_\theta(x_t, t) \|^2\right]\]

What this means:

Instead of predicting the denoised image directly, predict the noise
The model learns: “Given a noisy image, what noise was added?”
Much more stable and effective training signal!

The Simplified Objective

The full variational bound has complex weighting terms: $\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)}$

The paper proposes ignoring these weights:

\[\mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \varepsilon}\left[ \| \varepsilon - \varepsilon_\theta(x_t, t) \|^2 \right]\]

Where:

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \varepsilon\] \[t \sim \text{Uniform}(1, T)\]

This balances all noise levels equally and avoids overfitting to low-noise timesteps.

Problem with original weighting:

Small t (little noise): Large weight, easy task
Large t (lots of noise): Small weight, hard task
Model focuses on easy denoising tasks!

Solution with uniform weighting:

Equal attention to all noise levels
Model learns difficult denoising better
Better sample quality in practice

Training Algorithm

Training

Sample real data: Get a training image from your dataset
Random timestep: Choose how much noise to add (uniform across all levels)
Sample noise: Generate the specific noise to add
Create noisy version: Apply the forward process in one step
Predict noise: Ask the model “what noise was added?”
Update: Make the model better at noise prediction

Single-step training: Instead of running the full T-step forward process, we can jump directly to any timestep t using the closed-form formula.
Stochastic training: Each batch sees different noise levels, so the model learns to denoise across all levels simultaneously.
Simple objective: Just predict noise - no complex distributions or adversarial training.

Connection to Score Matching

The noise prediction objective is equivalent to denoising score matching:

\[\nabla_{x_t} \log p(x_t) \approx -\frac{\varepsilon}{\sqrt{1 - \bar{\alpha}_t}}\]

What this means:

Predicting noise $\varepsilon$ is equivalent to predicting the gradient of the log-probability
The model learns the “gradient field” pointing toward high-probability regions
Sampling follows these gradients to find realistic images

Why this Matter?

Theoretical foundation: Connects diffusion models to the rich theory of score-based generative models.
Sampling interpretation: The reverse process becomes Langevin dynamics following learned gradients.
Stability: Score matching is known to be more stable than adversarial training.

Variance Schedule Choice

The paper fixes the forward process variances $β_t$ to constants rather than learning them.

This simplification means the forward process q has no learnable parameters
The term $L_T$ becomes constant and can be ignored during training
- Linear schedule: $β_t$ increases linearly from $β_1$ to $β_T$
- Typical values: $β_1 = 0.0001$, $β_T = 0.02$
- T steps: Usually T = 1000 for training

Covariance Choice:

The model sets $\Sigma_\theta(x_t,t) = \sigma_t^2 I$ (diagonal, time-dependent constants):

$\sigma_t^2 = \beta_t$: Optimal when $x_0 \sim N(0,I)$
$\sigma_t^2 = \tilde{\beta}_t$: Optimal when $x_0$ is deterministic
Both choices gave similar empirical results

Image Preprocessing:

Images are scaled from {0,1,…,255} to [-1,1]
Ensures consistent neural network input scaling
Starting point is standard normal prior $p(x_T)$

Practical Implementation Details

Training Tips:
- EMA for sampling
- Gradient clipping
- Cosine learning rate schedule
- Data augmentation

Experiments

Experiment Setup

T = 1000: Number of diffusion steps, matching prior work to keep neural network evaluations comparable.
Noise schedule: Linearly increasing variances from $β_1 = 10^{-4}$ to $β_T = 0.02$, which keeps added noise small but enough to reach near-complete destruction of the original signal by the end.
Signal-to-noise control: Final KL divergence from Gaussian is $\approx 10^{-5}$ bits/dim — ensures the model learns well.

Model Architecture

U-Net: Based on unmasked PixelCNN++ with group normalization and shared weights across time.
Time embeddings: Injected using Transformer sinusoidal embeddings.
Self-attention: Added at 16×16 feature resolution.

Training Objective Ablation

True Variational Bound: Best for compression (lossless codelength).
Simplified Objective: Best for sample quality.
Predicting mean $\tilde{\mu}$:
- Works well with variational bound.
- Performs worse with simple MSE objective.
Learned variance: Leads to instability and poor quality.
Fixed variance: More stable.

Progressive Generation

Generate images progressively from random bits (reverse process).
Large-scale features appear early; details come later.
Shows that Gaussian diffusion allows coarse-to-fine image generation.

Interpolation in Latent Space

Interpolate two images in latent space (at same timestep t), then decode via reverse process.
Results:
- Smooth and meaningful transitions in pose, hair, background, etc.
- Eyewear remains unchanged, showing model’s bias or lack of variation in that feature.
- Larger t → blurrier but more varied (i.e., creative) results.

Connection to Autoregressive Models

Rewriting the variational bound shows diffusion is like autoregressive decoding with a continuous and generalized bit ordering.
Gaussian noise acts like masking but may be more natural and effective.
Unlike true autoregressive models, diffusion can use T < data dimension, allowing flexibility in sampling speed or model power.

Key Takeaways

Diffusion models:
- Achieve high-quality image synthesis even without conditioning.
- Show strong lossy compression ability (good perceptual reconstructions).
- Can act as a generalization of autoregressive models.
Model architecture and training objective greatly affect performance.
Progressive generation, interpolation, and decoding are all efficient and visually plausible.

Diffusion Models

2025-07-17T00:00:00+00:00

My Journey into Diffusion Models

I am studying diffusion models from scratch, diving deep into the mathematical foundations and practical implementations. This blog serves as a central hub that summarizes all my readings, notes, and reference blogs that I’m writing on diffusion models. As I explore this fascinating field, I’ll be documenting my learnings through detailed posts that break down complex concepts into digestible explanations.

Blog Posts on Diffusion Models

1. Step-by-Step Diffusion: An Elementary Tutorial

This covers the fundamentals of diffusion models including:

How diffusion models work by gradually adding and removing noise
DDPM (stochastic sampling) and DDIM (deterministic sampling) algorithms
Flow matching as a generalization beyond Gaussian noise
Practical implementation details and best practices

2. Denoising Diffusion Probabilistic Models

Covers a Image generation model created using diffusion.
Focuses on Loss for the training and mathemetical derivations.

More blog posts on diffusion models coming soon as I continue my learning journey…

Step-by-Step Diffusion: An Elementary Tutorial

2025-07-17T00:00:00+00:00

Nakkiran, Preetum, Arwen Bradley, Hattie Zhou, and Madhu Advani. “Step-by-Step Diffusion: An Elementary Tutorial.” arXiv, June 23, 2024. https://doi.org/10.48550/arXiv.2406.08929.

Fundamental of Diffusion
Stochastic Sampling: DDPM
Deterministic Sampling: DDIM
Flow Matching
Diffusion in Practice
Further Reading and Resources

1. Fundamental of Diffusion

Goal of Generative Modelling: Given i.i.d. samples from an unknown distribution $p^*$, we create a method that can generate new samples by sampling from an approximation of $p^*(x)$.

i.i.d. samples: Independent and identically distributed samples

Each sample was drawn independently and all samples come from same underlying distribution $p^*$.

Example: We have a training set of 10,000 dog photos:

These photos represent samples from some true distribution $p_{dog}(x)$ over all possible dog images and we don’t know the mathematical form of $p_{dog}(x)$
Our goal is to create a system that can generate new, realistic dog images that look like they could have come from the same distribution

Idea: Learn a transformation from some easy-to-sample distribution (such as Gaussian noise) to our target distribution $p^*$.

Diffusion models offer a general framework for learning such transformations.
The clever trick of diffusion is to reduce the problem of sampling from distribution $p^{*}(x)$ into to a sequence of easier sampling problems.

1.1 Gaussian Diffusion

Forward Pass

Systematically transforms target data (like images of dogs) into pure noise through a series of small, random steps.

Starting point: We have some data $x_0$ sampled from target distribution $p^*$ (e.g., real dog images).

The forward process: You create a sequence $x_0 \rightarrow x_1 \rightarrow x_2 \rightarrow \ldots \rightarrow x_T$ by repeatedly adding small amounts of Gaussian noise:

\[x_{t+1} = x_t + \eta_t, \quad \text{where } \eta_t \sim \mathcal{N}(0, \sigma^2)\]

This means each step adds independent Gaussian noise with variance $\sigma^2$.

Final state: After $T$ steps, the distribution $p_T$ becomes approximately Gaussian $\mathcal{N}(0, \sigma^2)$.

This happens because we are repeatedly adding independent Gaussian noise and the Central Limit Theorem ensures that the result approaches a Gaussian distribution. Variance grows linearly with the number of steps.

\[\text{(See figure below)}\]

Images as

So we can approximately sample from $p_T$ by just sampling a Gaussian.
We can directly sample $x_t$ given $x_0$ without computing all intermediate steps. (Sum of Gaussians is Gaussian)

Reverse Sampling

Strategy

The authors propose to solve generative modelling by decomposing it into many simpler “reverse sampling” steps:

Instead of: Learning to generate samples from $p^*$ directly (very hard)
Do this: Learn to go backwards one step at a time: $p_T \rightarrow p_{T-1} \rightarrow p_{T-2} \rightarrow \ldots \rightarrow p_0 = p^*$

Why This Decomposition Helps

The key insight is that adjacent distributions ($p_{t-1}, p_t$) are very similar because we only add a small amount of noise $\sigma$ at each step. This makes the reverse step much easier to learn than the full generative problem.

Think of it like this:

Hard: Transform pure noise into a realistic dog image in one step
Easy: Remove a tiny bit of noise from an almost-clean dog image

The DDPM Reverse Sampler

DDPM: Denoising Diffusion Probabilistic Models

The “obvious” approach is to learn the conditional distribution $p(x_{t-1} \mid x_t)$ for each step. Given a noisy sample $x_t$, we want to predict what the slightly less noisy version $x_{t-1}$ should be.

Fact 1: When $\sigma$ is small, the conditional distribution $p(x_{t-1} \mid x_t)$ is approximately Gaussian.

This means:

\[p(x_{t-1} \mid x_t = z) \approx \mathcal{N}(\mu_{t-1}(z), \sigma^2)\]

So instead of learning an arbitrary complex distribution, we only need to learn the mean function $\mu_{t-1}(z)$.

\[\text{(See figure below)}\]

Images as

The Regression Formulation

Since we know the distribution is Gaussian with known variance $\sigma^2$, learning the mean is equivalent to solving a regression problem:

\[\mu_{t-1} = \arg\min \mathbb{E}\left[\|f(x_t) - x_{t-1}\|^2\right]\]

This can be rewritten as:

\[\mu_{t-1} = \arg\min \mathbb{E}\left[\|f(x_{t-1} + \eta_t) - x_{t-1}\|^2\right]\]

where $\eta_t \sim \mathcal{N}(0, \sigma^2)$ is the noise we added.

Theorem: For any joint distribution over random variables $(X, Y)$, the conditional expectation $\mathbb{E}[Y \mid X]$ is the function that minimizes the mean squared error:

\[\mathbb{E}[Y \mid X] = \arg\min_{f} \mathbb{E}\left[(f(X) - Y)^2\right]\]

The Beautiful Connection to Denoising

Notice what this regression objective is asking: given a clean signal $x_{t-1}$ plus some noise $\eta_t$, predict the original clean signal.

This is exactly the image denoising problem! We can use standard denoising techniques (like convolutional neural networks) to solve it.

The authors have reduced the complex problem of generative modeling to the well-understood problem of regression/denoising.

Instead of learning to generate realistic images from scratch, we learn to remove small amounts of noise—doing this many times in sequence to gradually transform pure noise into realistic samples.

This is why diffusion models work so well: they break down an impossibly hard problem into many manageable denoising steps that neural networks are already good at solving.

1.2 Diffusions in the Abstract

Diffusion models follow a universal pattern that works across many different settings—not just Gaussian noise, but also discrete domains, deterministic processes, and more.

Discrete Domains: Instead of working with continuous values (like pixel intensities 0.0 to 1.0), we work with discrete, finite sets of possibilities. For example, text generation where each position can be one of a finite vocabulary.
Deterministic Processes: The reverse sampler produces the same output every time you give it the same input—there’s no randomness involved.

The Abstract Recipe

Step 1: Choose your endpoints

Start with target distribution $p^*$ (what you want to generate)
Choose a base distribution $q$ that’s easy to sample from (e.g., Gaussian noise, random bits)

Step 2: Create an interpolating sequence

Build a sequence of distributions that smoothly connects these endpoints:

\[p_0 = p^* \rightarrow p_1 \rightarrow p_2 \rightarrow \ldots \rightarrow p_T = q\]

The key requirement is that adjacent distributions ($p_{t-1}, p_t$) are “close” in some meaningful sense.

Step 3: Learn reverse samplers

For each step $t$, learn a function $F_t$ that can transform samples from $p_t$ back to $p_{t-1}$.

The Reverse Sampler Definition

This is the formal definition of what we need to learn:

Definition: A reverse sampler $F_t$ is a function such that if you:

Take a sample $x_t$ from distribution $p_t$
Apply $F_t$ to get $F_t(x_t)$
The result is distributed according to $p_{t-1}$

Mathematically:

\[F_t(z) : z \sim p_t \implies F_t(z) \sim p_{t-1}\]

Why This Abstraction is Powerful

Flexibility: This framework works for:

Continuous domains (images with Gaussian noise)
Discrete domains (text, categorical data)
Deterministic processes (no randomness in the reverse step)
Stochastic processes (with randomness)

Multiple implementations: The same abstract framework gives us:

DDPM (stochastic, Gaussian-based)
DDIM (deterministic version)
Flow-matching (continuous-time generalization)

The Key Insight About “Closeness”

The magic happens because adjacent distributions are “close.” This means:

The reverse sampling step $F_t$ doesn’t need to do much work
Learning becomes feasible because we’re making small adjustments rather than dramatic transformations

The Coupling Perspective

Given the marginal distributions ${p_t}$, there are many possible ways to define the joint relationships between consecutive steps. These are called “couplings” in probability theory.

This means we have freedom in how we design the reverse sampler—we can choose whichever coupling is most convenient for learning or sampling.

Why This Matters

This abstraction shows that diffusion models aren’t just about “adding noise”—they’re about:

Interpolation: Creating smooth paths between complex and simple distributions
Decomposition: Breaking hard problems into many easier steps
Flexibility: Adapting the same core idea to many different domains and applications

1.3 Discretisation

We need to be more precise about what we mean by adjacent distributions $p_t$, $p_{t-1}$ being “close”.

The Continuous-Time Perspective

The authors are shifting from thinking about discrete steps ($x_0$, $x_1$, $x_2$, …) to a continuous-time process $p(x,t)$ where:

$t = 0$: We have our target distribution $p^*$
$t = 1$: We have our base distribution (noise)
$t \in [0,1]$: We have intermediate distributions

The discrete steps are just a discretisation of this continuous process:

\[p_k(x) = p(x, k \cdot \Delta t) \qquad \text{where} \; \Delta t = 1/T\]

Finer discretisation = closer adjacent distributions:

Large $T \rightarrow$ small $\Delta t \rightarrow$ many small steps $\rightarrow$ adjacent distributions are very close
Small $T \rightarrow$ large $\Delta t \rightarrow$ few big steps $\rightarrow$ adjacent distributions are farther apart

This explains why diffusion models work better with more steps!

The Variance Scaling Problem and $\sqrt{\Delta t}$ Scaling

Here’s a subtle but crucial issue: If we naively add noise $\sigma^2$ at each step, then after $T$ steps we’d have total variance $T \cdot \sigma^2$. This means:

More steps $\rightarrow$ higher final variance
Fewer steps $\rightarrow$ lower final variance

But we want the final distribution to be the same regardless of how many steps we take.

Solution

To fix this, they scale the noise variance by $\Delta t$:

\[\sigma = \sigma_q \sqrt{\Delta t} = \sigma_q \sqrt{1/T}\]

Why this works: After $T$ steps, the total variance becomes:

\[\text{Total variance} = T \times \sigma_q^2 \Delta t = T \times \sigma_q^2 \times (1/T) = \sigma_q^2\]

So regardless of $T$, the final variance is always $\sigma_q^2$!

The New Notation

This scaling ensures that as $T \rightarrow \infty$ (continuous limit), the process converges to a well-defined continuous-time stochastic process.

From now on:

t represents continuous time in $[0,1]$, not discrete steps
$\Delta t = 1/T$ is the step size
$x_t$ means “x at time t” (not “x at step t”)

The forward process becomes:

\[x_{t+\Delta t} = x_t + \eta_t, \qquad \text{where} \; \eta_t \sim N(0, \sigma_q^2 \Delta t)\]

The Cumulative Effect

\[x_t \sim N(x_0, \sigma_t^2) \qquad \text{where} \; \sigma_t := \sigma_q \sqrt{t}\]

This beautiful formula shows that:

At $t = 0$: $\sigma_0 = 0$ (no noise, original data)
At $t = 1$: $\sigma_1 = \sigma_q$ (full noise level)
At $t = 0.5$: $\sigma_{0.5} = \sigma_q \sqrt{0.5}$ (intermediate noise)

This discretization framework:

Unifies discrete and continuous views of diffusion
Ensures consistency across different numbers of steps
Enables theoretical analysis of the continuous limit
Connects to stochastic differential equations (SDEs)

2. Stochastic Sampling: DDPM

This section introduces the DDPM (Denoising Diffusion Probabilistic Models) sampler - the classic stochastic approach to diffusion sampling. Let me break this down:

The DDPM sampler learns to predict what the previous (less noisy) timestep looked like given the current (more noisy) timestep. Specifically, it learns:

\[\mu_t(z) := E[x_t \mid x_{t+\Delta t} = z]\]

This means: “Given that we observe value $z$ at time $t+\Delta t$, what was the expected value at the previous time $t$?”

The Training Process

Objective: Learn the conditional expectation functions ${\mu_t}$ by solving a regression problem:

\[\mu_t = \arg\min \; E[||f(x_{t+\Delta t}) - x_t||^2]\]

What this means:

Take pairs of ($x_t$, $x_{t+\Delta t}$) from the forward diffusion process
Train a neural network to predict the cleaner version $x_t$ given the noisier version $x_{t+\Delta t}$
This is literally a denoising problem!

Practical implementation: Instead of learning separate functions for each timestep, we typically train a single neural network $f_\theta(x, t)$ that takes both the noisy sample and the time $t$ as input.

Sampling Algorithm 1: Stochastic Reverse Sampler (DDPM-like Sampler)

Once trained, the reverse sampler works as follows:

For input sample $x_t$, and timestep $t$, output:

\[\hat{x}_{t-\Delta t} \leftarrow \mu_{t-\Delta t}(x_t) + N(0, \sigma_q^2 \Delta t)\]

Breaking this down:

$\mu_{t-\Delta t}(x_t)$: Use the learned function to predict the mean of the previous timestep
$+ N(0, \sigma_q^2 \Delta t)$: Add Gaussian noise with the same variance as the forward process
The result is a sample from the previous timestep

The Full Generation Process

Step 1: Start with pure noise: $x_1 \sim N(0, \sigma_q^2)$
Step 2: Apply Algorithm 1 repeatedly:

\[x_1 \rightarrow x_{1-\Delta t} \rightarrow x_{1-2\Delta t} \rightarrow ... \rightarrow x_0\]

Step 3: The final $x_0$ is your generated sample

Why This Works (Conceptually)

The magic relies on Fact 1: that the true conditional distribution $p(x_{t-\Delta t} \mid x_t)$ is approximately Gaussian when $\Delta t$ is small.
If this is true, then:
- We only need to learn the mean $\mu_{t-\Delta t}(x_t)$ (since we know the variance is $\sigma_q^2 \Delta t$)
- We can sample from this conditional by taking the predicted mean plus Gaussian noise
- Each step undoes a small amount of the forward corruption

The Stochastic Nature

Notice that this sampler is stochastic - even if you start with the same noise $x_1$, you’ll get different samples $x_0$ because of the added noise at each step. This is different from deterministic samplers like DDIM.

2.1 Correctness of DDPM: Look in paper for the proof

The Problem: We needed to prove that DDPM’s reverse sampler actually works - that it can successfully generate samples from our target distribution.

The Key Question: Why is the reverse process (going from noisy to clean) approximately Gaussian?

The Answer:

Used Bayes’ rule to express the reverse conditional probability $p(x_{t-\Delta t} \mid x_t)$
Applied Taylor expansion around the current point
Completed the square to show it has Gaussian form

The Result:

\[p(x_{t-\Delta t} \mid x_t) = N(\text{mean}, \sigma_q^2 \Delta t)\]

where the mean involves the “score” (gradient of log probability).

Why This Matters:

Since the reverse process is Gaussian, we only need to learn its mean
Learning the mean is just a regression problem (predicting clean from noisy)
This justifies why DDPM works: each reverse step is a simple denoising operation

The Bottom Line: DDPM works because when you add small amounts of noise, reversing that process is approximately Gaussian, which makes it learnable through standard regression techniques.

2.2 Algorithms

Pseudocode 1: DDPM Training

What it does: Trains the neural network to do denoising regression.

Step by step:

Get clean data: Sample $x_0$ from target distribution (e.g., real images)
Pick random time: Sample $t$ uniformly from $[0,1]$
Add noise up to time t: Create $x_t = x_0 + N(0, \sigma_q^2 t)$
Add one more step of noise: Create $x_{t+\Delta t} = x_t + N(0, \sigma_q^2 \Delta t)$
Train to denoise: $\text{Loss} = \left\| f_\theta(x_{t+\Delta t}, t+\Delta t) - x_t \right\|^2$

Key insight: The network learns to predict the cleaner version $x_t$ given the noisier version $x_{t+\Delta t}$ and the time $t+\Delta t$.

Pseudocode 2: DDPM Sampling

What it does: Generates new samples using the trained model.

Step by step:

Start with pure noise: $x_1 \sim N(0, \sigma_q^2)$
Go backwards in time: For $t = 1, 1-\Delta t, 1-2\Delta t, ..., \Delta t$
Predict + add noise: $x_{t-\Delta t} = f_\theta(x_t, t) + N(0, \sigma_q^2 \Delta t)$
Return final result: $x_0$ is your generated sample

Key insight: Each step predicts the cleaner version, then adds noise to account for uncertainty (this is the stochastic part).

Pseudocode 3: DDIM Sampling (Preview)

What it does: Deterministic version of sampling (no added noise).

Key difference: Instead of adding random noise, it uses a deterministic update rule with a mixing coefficient $\lambda$.

Important Notes

Training is simultaneous: The network learns to denoise at ALL timesteps at once.
Sampling goes backwards: We go from $t=1$ (pure noise) to $t=0$ (clean data)
Same network for all steps: $f_\theta(x,t)$ handles all timesteps using the time input $t$

2.3 Variance Reduction: Predicting $x_0$

This section explains an important practical trick used in diffusion models! Let me break it down:

The Two Training Approaches

Original approach: Train the network to predict $E[x_{t-\Delta t} \mid x_t]$ - the previous timestep
Alternative approach: Train the network to predict $E[x_0 \mid x_t]$ - the original clean data

Why This Works (Claim 2):

We have:

\[E[(x_{t-\Delta t} - x_t) \mid x_t] = \frac{\Delta t}{t} E[(x_0 - x_t) \mid x_t]\]

and its equivalent to:

\[E[x_{t-\Delta t} \mid x_t] = \left(\frac{\Delta t}{t}\right) E[x_0 \mid x_t] + \left(1 - \frac{\Delta t}{t}\right) x_t\]

This means: if you can predict the clean image $x_0$, you can easily compute what the previous timestep $x_{t-\Delta t}$ should be.

The Intuitive Explanation

The noise symmetry argument:

When you observe $x_t$, it’s the sum: $x_0 + \eta_1 + \eta_2 + \ldots + \eta_t$ (all the noise steps)
You can’t tell which noise came from which step—they all “look the same”
So instead of predicting one noise step $\eta_{t-\Delta t}$, you can predict the average of all noise steps
The average has much lower variance than individual steps!

Why This is Better (Variance Reduction)

Problem with predicting $x_{t-\Delta t}$: You’re trying to estimate one noisy step from another noisy observation—high variance.

Solution with predicting $x_0$: You’re averaging over all the noise steps, which reduces variance significantly.

Think of it like this:

High variance: “Given this noisy image, what did the slightly less noisy version look like?”
Low variance: “Given this noisy image, what did the original clean image look like?”

The second question is easier because you’re not trying to distinguish between very similar noise levels.

Important Warning

Critical point: The model predicts $E[x_0 \mid x_t]$, which is the expected value, not a sample!

What this means:

If you’re generating faces, $E[x_0 \mid x_t]$ might be a blurry average of all possible faces
It won’t look like a real face—it’s a mathematical expectation
This is normal and expected!

Common misconception: People think “predicting $x_0$” means the model outputs something that looks like a real sample. It doesn’t—it outputs the average of all possible samples.

Practical Implementation

In practice:

Train the model to predict $E[x_0 \mid x_t]$ (better variance)
During sampling, use the relationship in Claim 2 to convert this back to $E[x_{t-\Delta t} \mid x_t]$
Apply the sampling algorithm as usual

The Mathematical Relationship

The division by $\left(\frac{t}{\Delta t}\right)$ in the formula represents the number of steps taken so far. Since we’ve accumulated $\left(\frac{t}{\Delta t}\right)$ noise steps, we divide the total predicted noise by this amount to get the average per step.

3. Deterministic Sampling: DDIM

DDIM: Denoising Diffusion Implicit Model → A deterministic alternative to the stochastic DDPM sampler.

Algorithm 2: Deterministic Reverse Sampler (DDIM-like)

Instead of using the stochastic sampler that adds random noise at each step, DDIM uses a deterministic function that always produces the same output for the same input.

For input sample $x_t$, and step index $t$, output:

\[\hat{x}_{t-\Delta t} = x_t + \lambda \left( \mu_{t-\Delta t}(x_t) - x_t \right)\]

Where:

\[\lambda = \frac{\sigma_t}{\sigma_{t-\Delta t} + \sigma_t}\]

and

\[\sigma_t = \sigma_q \sqrt{t}\]

$\mu_{t-\Delta t}(x_t) = E[x_{t-\Delta t} \mid x_t]$ is the conditional expectation (what we’d predict on average)
$\lambda = \frac{\sigma_t}{\sigma_{t-\Delta t} + \sigma_t}$ is a scaling factor
$\sigma_t = \sigma_q \sqrt{t}$ from the noise schedule

Understanding the Formula

Let’s interpret what this update is doing:

Step 1: $\mu_{t-\Delta t}(x_t) - x_t$

This is the “direction” we need to move to get from the current noisy sample to the predicted less-noisy sample.

Step 2: $\lambda (\mu_{t-\Delta t}(x_t) - x_t)$

We scale this direction by factor $\lambda$. This determines how far we actually move.

Step 3: $x_t + \lambda (\mu_{t-\Delta t}(x_t) - x_t)$

We take a step in that direction from our current position.

Why This Scaling Factor $\lambda$?

The scaling factor $\lambda$ has a nice interpretation:

When $\sigma_{t-\Delta t} \approx \sigma_t$ (small time step), then $\lambda \approx \frac{1}{2}$ (take a moderate step)
When $\sigma_{t-\Delta t} \ll \sigma_t$ (large time step), then $\lambda \approx 1$ (take the full predicted step)
When $\sigma_{t-\Delta t} \gg \sigma_t$ (this shouldn’t happen in forward process), then $\lambda \approx 0$

Deterministic vs Stochastic

DDPM (Stochastic):

Samples from $p(x_{t-\Delta t} \mid x_t)$
Same input can give different outputs
Adds randomness at each step

DDIM (Deterministic):

Uses a fixed function $F_t(x_t)$
Same input always gives same output
No randomness in the reverse process

The Transport Map Perspective

Instead of thinking about sampling from conditional distributions, DDIM thinks about transport maps—functions that transform one distribution into another.

The goal is to show that the function $F_t$ defined by the DDIM update “pushes” the distribution $p_t$ to $p_{t-\Delta t}$:

\[F_t \,\sharp\, p_t \approx p_{t-\Delta t}\]

The notation $F\,\sharp\,p$ means “the distribution you get when you apply function $F$ to samples from distribution $p$”.

Advantages of DDIM:

Faster sampling: Can take bigger steps since it’s deterministic
Reproducible: Same starting noise always gives same result
Interpolation: Can smoothly interpolate between samples
Fewer steps: Often works well with far fewer steps than DDPM

Connection to other methods: This deterministic approach connects to flow-matching and other continuous-time methods.

We need to Prove that DDIM is correct and works:

The authors will prove this works by:

Point-mass case: Show it works for the simplest distributions (single points)
Marginalization: Extend to full distributions by considering all possible points

This is similar to how flow-matching methods are analyzed—by showing the transport map works pointwise and then extending to distributions.

The key insight is that even though we’re not sampling from $p(x_{t-\Delta t} \mid x_t)$, we can still achieve the same marginal distribution $p_{t-\Delta t}$ through this deterministic transport.

3.1 Case 1: Single Point

Avoiding complicated math: Refer to paper

What are we trying to prove?

We want to show that DDIM (the deterministic sampler) actually works. But proving it for complicated distributions is hard, so we start with the simplest possible case.

The simplest case: One dot

Imagine our target is just a single dot at position 0. That’s it—we want to generate samples that are exactly at position 0.

What happens when we add noise?

Start: We have a dot at position 0
After some time: The dot has moved randomly and is now somewhere else (due to noise)
Our job: Figure out how to move it back toward 0

The obvious solution

If we know the dot started at 0, and now it’s at some noisy position, the obvious thing to do is shrink it back toward 0.
If the dot is currently at position 10, and we know it should be closer to 0, we should move it to maybe position 7 or 5 (somewhere closer to 0).

The key insight

The fancy DDIM formula is actually just doing this simple shrinking!

\[\text{New position} = \text{Old position} + \lambda (\text{Predicted position} - \text{Old position})\]

But in the simple case, this reduces to:

\[\text{New position} = (\text{shrink\_factor}) \times \text{Old position}\]

Where $\text{shrink_factor}$ is less than 1, so we’re moving the dot closer to 0.

Why this matters

This proves that DDIM works correctly in the simplest case. It’s doing exactly what we’d expect—gradually shrinking the noise to bring samples back to the target.

The bigger picture

DDIM looks complicated with all its formulas and Greek letters
But in the simplest case, it’s just gradually shrinking noisy samples back toward the target
This gives us confidence that it’s doing something sensible in more complex cases too

Think of it like this: if you wanted to guide a lost person back to their house, you’d tell them to walk in the direction of their house. DDIM is doing the same thing—it’s figuring out which direction to move to get closer to the target, then taking a step in that direction.

3.2 Velocity Fields and Gases

Instead of thinking about DDIM as a mathematical formula, we can think of it as a velocity field—like wind patterns that tell particles which way to move.

The DDIM update can be rewritten as:

\[\hat{x}_{t-\Delta t} = x_t + v_t(x_t) \cdot \Delta t\]

Where:

\[v_t(x_t) = \frac{\lambda}{\Delta t} \left( E[x_{t-\Delta t} \mid x_t] - x_t \right)\]

This looks just like physics: position = old position + velocity × time!

The Gas Analogy

Imagine a gas made of particles:

Each particle represents a possible sample
The density of particles at any location represents the probability of that sample
The gas starts with density pattern $p_t$ (more spread out/noisy)
We want it to end up with density pattern $p_{t-\Delta t}$ (less spread out/noisy)

How the Velocity Field Works

The velocity field $v_t(x)$ tells each particle at position $x$ which direction to move:

Direction: Toward where that particle “should” be (based on $E[x_{t-\Delta t} \mid x_t]$)
Speed: Proportional to how far it needs to move

When all particles move according to this velocity field, the overall gas density transforms from $p_t$ to $p_{t-\Delta t}$.

Note: Skipping Proofs

3.3 Case 2: Two Points

3.4 Case 3: Arbitrary Distributions

3.5 The Probability Flow ODE [Optional]

3.6 Discussion: DDPM vs DDIM

DDPM (Stochastic):

Takes a sample and produces a random output from $p(x_{t-\Delta t} \mid x_t)$
Same input can give different outputs each time

DDIM (Deterministic):

Takes a sample and produces the same output every time
Creates a fixed mapping from input to output

The Iteration Behaviour

When you run these algorithms from start to finish, they behave very differently:

DDPM: Independence from Starting Point

Key insight: If you start DDPM from different initial noise samples $x_1$, you’ll get samples that are essentially independent of where you started.
Why: The forward process “mixes” well—it scrambles the original data so much that the final noise $x_1$ contains almost no information about the original $x_0$.
Result: $p(x_0 \mid x_1) \approx p(x_0)$—the output doesn’t depend on the starting noise!
Analogy: Like shuffling a deck of cards so thoroughly that the final order tells you nothing about the original order.

DDIM: Strong Dependence on Starting Point

Key insight: DDIM creates a deterministic function from noise to data.
Why: Since it’s deterministic, the same starting noise $x_1$ always produces the same final output $x_0$.
Result: Different starting points lead to different, but predictable outputs.
Analogy: Like having a specific recipe—same ingredients always give the same dish.

The Mapping Perspective

This reveals something profound about DDIM:

DDIM as a Special Map

What it does: Creates a deterministic function from Gaussian noise $\rightarrow$ target distribution
Sounds familiar: This is similar to GANs and Normalizing Flows, which also map noise to data. But there’s a key difference… The Constraint Makes It Special
- GANs: Can learn any mapping that works—complete freedom
- DDIM: Must learn the specific mapping determined by the target distribution
Why this matters:
- Supervised vs Unsupervised: DDIM has a “correct answer” to learn toward
- Smoothness: The DDIM map inherits smoothness from the target distribution
- Structure: The mapping respects the geometry of the data

Practical Implications

DDPM Advantages:

Sample diversity: Randomness can help explore different modes
Robustness: Less sensitive to the exact starting point

DDIM Advantages:

Reproducibility: Same noise always gives same result
Interpolation: Can smoothly interpolate between samples
Speed: Often works with fewer steps
Control: Deterministic nature enables better control

The Learning Trade-off

Easier aspects of DDIM:

Has a “ground truth” target function to learn
Inherits nice properties from the target distribution
Supervised learning setup

Harder aspects of DDIM:

Must learn the specific “correct” mapping
Less flexibility than arbitrary mappings
May miss easier-to-learn alternatives

Visual Intuition

DDPM: Like a skilled artist who can paint many different dogs from the same reference photo—each painting is different but all are valid dogs.
DDIM: Like a precise photocopier that always produces the exact same copy from the same input—deterministic but perfectly reproducible.

The Philosophical Difference

DDPM: “Generate samples that look like they came from the target distribution”
DDIM: “Learn the specific transformation that the diffusion process implies”

This fundamental difference in philosophy leads to all the practical differences we observe in how these methods behave!

3.7 Remarks on Generalization

This section addresses a crucial practical issue that often gets overlooked in theoretical discussions of diffusion models: How do we actually learn these models from real data without just memorizing the training set?

The Core Problem

What we want: A model that learns the underlying distribution and can generate new, similar samples.

What we might get: A model that just memorizes the training data and can only reproduce exact copies of what it saw.

The Empirical Risk Minimization Trap

Standard approach: Train by minimizing prediction error on the training set.

The problem: If we minimize this error perfectly, we get a model that:

Perfectly predicts the training data
Only generates samples that are exactly from the training set
Never creates anything genuinely new

Why this fails: Perfect memorization of finite training data doesn’t help us learn the true underlying distribution.

Imagine learning to draw dogs:

Bad approach: Memorize every pixel of 1000 dog photos and only reproduce those exact photos
Good approach: Learn what makes something “dog-like” and generate new dog images

The Regularization Solution

The key insight: We need to prevent perfect memorization through regularization.

Explicit regularization: Add penalties to prevent overfitting
Implicit regularization: Natural limitations prevent memorization:

Finite model capacity: The neural network can’t memorize everything
Optimization randomness: SGD doesn’t find the perfect memorizing solution
Early stopping: We don’t train to perfect convergence

Why This Matters

For researchers: Understanding that perfect optimization isn’t the goal—we want controlled generalization.
For practitioners:
- Larger datasets help prevent memorization
- Some “imperfection” in training is actually beneficial
- Need to balance fitting the data vs. generalizing

The Security/Copyright Issue

Real concern: Models trained on copyrighted or private data might reproduce it exactly.
Evidence: Researchers have shown they can extract training images from models like Stable Diffusion with carefully crafted prompts.

Practical Takeaways

Don’t aim for perfect training loss—some generalization error is good
Use larger datasets when possible to reduce memorization
Implicit regularization from neural network training often helps naturally
Be aware of privacy/copyright implications of potential memorization

4 Flow Matching

Flow matching is a generalization of DDIM that provides much more flexibility in designing generative models.

The core ideas behind DDIM don’t actually require:

Gaussian noise
The specific Gaussian forward process
Any particular base distribution

Instead, the fundamental concept is about transporting distributions using vector fields.

The Two-Step Construction from DDIM

Looking back at how DDIM worked, there were really two key steps:

Step 1: Point-to-Point Transport

For any single target point $a$, we can construct a vector field $v[a]_t$ that transports a sample from the base distribution (like standard Gaussian) to exactly that point $a$.

Think of this as: “How do I move a particle from random noise to land exactly at point $a$?”

Example:

Target point $a$ = “golden retriever sitting”
Vector field $v[a]_t$ = instructions for how to move a noise sample to become exactly this image

Step 2: Combining Vector Fields

When we have multiple target points (or a whole distribution), we combine the individual vector fields into a single effective vector field.

This is like: “If I want to transport noise to match a complex distribution, I combine the ‘instructions’ for reaching each individual point.”

Example: If we have many target points, we need to combine all these individual vector fields into one unified vector field that can generate the entire distribution.

$v[a_1]_t$ → path to “golden retriever sitting”
$v[a_2]_t$ → path to “beagle running”
$v[a_3]_t$ → path to “poodle sleeping”
etc.

\[v_t(x) = \int v[a]_t(x) \cdot p^*(a) \, da\]

Or in discrete terms:

\[v_t(x) = \sum v[a]_t(x) \cdot P(\text{target} = a)\]

What This Means Intuitively

At any point $x$ and time $t$, the combined vector field tells you:

“Move in the direction that’s the average of all individual directions”
“Weight each direction by how likely that target is in your dataset”

Suppose at some point $x$ during the denoising process:

$v[a_1]_t(x)$ says “move right” (toward golden retriever)
$v[a_2]_t(x)$ says “move left” (toward beagle)
$v[a_3]_t(x)$ says “move up” (toward poodle)

And your dataset has:

50% golden retrievers
30% beagles
20% poodles

Then the combined vector field would be:

\[v_t(x) = 0.5 \times \text{"right"} + 0.3 \times \text{"left"} + 0.2 \times \text{"up"}\]

The Learning Process

In practice, we don’t know all the individual vector fields $v[a]_t$ ahead of time. Instead:

Sample pairs: Take samples $(x_0, x_1)$ where $x_0$ is from base distribution and $x_1$ is from target distribution
Construct path: For each pair, define a path from $x_0$ to $x_1$ (like a straight line)
Learn average: Train a neural network to predict the average velocity along all these paths

Connection to DDIM

In DDIM, this combination happens implicitly:

The conditional expectation $E[x_{t-\Delta t} \mid x_t]$ is already the result of combining all possible paths
The Gaussian assumptions make this combination mathematically tractable
The vector field emerges from the denoising objective

The Generalization

Flow matching asks: What if we drop all the Gaussian assumptions?

Instead of being limited to:

Gaussian base distributions
Gaussian forward processes
Specific noise schedules

We can now think about:

Any two points $x_0$ and $x_1$
Any two distributions $p$ (data) and $q$ (base)
Any smooth path connecting them

Why This Matters

In traditional diffusion models (DDPM/DDIM), the paths are curved because of how Gaussian noise is added and removed.

Why curved? The forward process adds noise gradually: clean $\rightarrow$ slightly noisy $\rightarrow$ more noisy $\rightarrow$ pure noise. The reverse process follows the same curved trajectory backwards.

Imagine a ball rolling down a curved hill—it doesn’t go straight down, it follows the curved surface.

More flexible paths: Instead of the specific curved paths that Gaussian diffusion creates, we can design:

1. Straight lines (rectified flows)

Instead of curved paths, we connect each noise sample to its corresponding data sample with a straight line.

If you start at noise point $x_1$ and want to reach data point $x_0$:

\[x(t) = (1-t)x_1 + t \cdot x_0\]

Why this is better:

Faster sampling: Straight lines are the shortest distance between two points
Fewer steps needed: You can take bigger steps along a straight path
More predictable: Easier to control and understand
Less computation: Simpler math than curved trajectories

Used in Stable Diffusion 3: This makes SD3 faster and more efficient than previous versions.

2. Custom trajectories

Design paths that are optimized for your specific data type or use case.

Like For images:

Paths that preserve image structure early in generation
Trajectories that handle different frequency components separately
Paths optimized for specific image types (faces, landscapes, etc.)

For text:

Paths that maintain syntactic structure while changing semantics
Trajectories that respect language hierarchies (words $\rightarrow$ sentences $\rightarrow$ paragraphs)

For 3D shapes:

Paths that preserve geometric constraints
Trajectories that respect physical laws (like gravity for fluid simulations)

For audio:

Paths that preserve harmonic structure
Trajectories optimized for different types of sounds (speech, music, etc.)

3. Paths that avoid low-probability regions

This is a sophisticated optimization that’s really powerful:

The problem: In high-dimensional spaces, there are regions where data almost never appears. Traditional diffusion might accidentally pass through these “impossible” regions.

Example with faces:

Low-probability region: Images with eyes in impossible positions, or faces that morph unnaturally
Good path: Stays in regions that look like plausible faces throughout the generation process

Visual analogy: Imagine you’re hiking from point A to point B. You could:

Take a straight line (might go through dangerous cliffs)
Take a curved path that stays on safe, well-traveled trails

How it works:

Instead of: noise $\rightarrow$ weird intermediate states $\rightarrow$ final image
Design: noise $\rightarrow$ always plausible-looking states $\rightarrow$ final image

Benefits:

Better intermediate results: Every step looks reasonable
More stable training: Less likely to get stuck in impossible configurations
Higher quality: Final results are more realistic
Conditional generation: Better control over the generation process

Different base distributions:

We’re not limited to Gaussian noise. We could use:

Uniform distributions
Other structured noise patterns
Even data-dependent base distributions

Broader applications:

This framework works for:

Continuous data (images, audio)
Discrete data (with appropriate metrics)
Structured data (graphs, molecules)
Any domain where you can define smooth interpolation

The Mathematical Framework

The core mathematical object is a vector field $v_t(x)$ that tells you:

At time $t$
At position $x$
Which direction and how fast to move

The flow is generated by solving the ODE:

\[\frac{dx}{dt} = v_t(x)\]

Modern Applications

Conditional flows: Generate samples conditioned on additional information (text, class labels, etc.)

This framework has become the foundation for many state-of-the-art generative models because of its flexibility and mathematical elegance.

4.1 Flows

This section formalizes the mathematical foundation of flows.

What is a Flow?

A flow is a collection of time-indexed vector fields:

\[v = \{ v_t \}_{t \in [0,1]}\]

Think of it as a velocity field that tells particles how to move at each point in space and time.

Physical analogy: Imagine a river with currents. At each location $(x, y)$ and time $t$, the current has a specific velocity and direction. The flow tells you: “If you’re at position $x$ at time $t$, move in direction $v_t(x)$.”

The Flow ODE

Any flow defines how particles move via the differential equation:

\[\frac{dx}{dt} = -v_t(x_t)\]

Starting condition: Begin at $x_1$ at time $t = 1$
Ending condition: End at $x_0$ at time $t = 0$

Note the negative sign: This is because time runs backwards from 1 to 0 (following diffusion convention where $t=0$ is clean data).

RunFlow Function

The $\text{RunFlow}(v, x_1, t)$ function solves the ODE and tells you:

Input: Starting point $x_1$, flow $v$, target time $t$
Output: Where the particle ends up at time $t$

Intuitive meaning: “If I start at $x_1$ and follow the flow $v$, where will I be at time $t$?”

Flows don’t just move individual points—they transport entire distributions:

Individual point: $x_1 \rightarrow \text{RunFlow}(v, x_1, 0) = x_0$
Entire distribution: $p_1 \rightarrow p_0$

The Ultimate Goal

We want to learn a flow $v^*$ such that:

\[q \xrightarrow{v^*} p\]

Where:

q: Easy-to-sample base distribution (like Gaussian noise)
p: Target distribution (like dog images)
v: The optimal flow that connects them

Generation Process

Once we have $v^*$, generating samples is simple:

Sample: $x_1 \sim q$ (sample from base distribution)
Transport: $x_0 = \text{RunFlow}(v^*, x_1, 0)$ (follow the flow)
Output: $x_0$ (this is your generated sample)

Connection to DDIM

DDIM is actually a special case of flow matching!

DDIM’s flow: The continuous-time limit of DDIM corresponds to the flow:

\[v_t(x_t) = \frac{1}{2t} E[x_0 - x_t \mid x_t]\]

Components:

Base distribution: Gaussian
DDIM sampling: Discretized method for evaluating RunFlow
DDPM training: Method for learning $v^*$ (but relies on Gaussian structure)

4.2 Pointwise Flows

Core idea: A pointwise flow connects one specific point $x_1$ to one specific point $x_0$.

What it does: Given any path from $x_1$ to $x_0$, the pointwise flow describes the velocity at each point along that path.

Mathematical definition: $v^{[x_1, x_0]}$ is a flow that satisfies the ODE with boundary conditions:

Starts at $x_1$ when $t = 1$
Ends at $x_0$ when $t = 0$

Key insight: Pointwise flows are not unique. You can choose different paths between the same two points: straight line, curved path, any smooth trajectory.

4.3 Marginal Flows

The problem: We have many individual pointwise flows, but we need one unified flow that handles the entire distribution.

The setup:

Pick a coupling $\Pi_{q,p}$ (way to pair noise samples with data samples)
For each pair $(x_1, x_0)$, use pointwise flow $v^{[x_1, x_0]}$
This gives us a “collection of particle trajectories”

The solution: Combine all pointwise flows into one marginal flow $v^*$ using weighted averaging:

\[v^*_t(x_t) = E[ v^{[x_1, x_0]}_t(x_t) \mid x_t ]\]

Intuitive meaning: At any point $x_t$ and time $t$, the marginal flow velocity is the average velocity of all particles that happen to be at $x_t$ at that time.

Why this works:

Individual particles follow their own pointwise flows
The bulk behavior emerges from averaging all individual behaviors
Result: one flow that transports $q \rightarrow p$

Gas analogy: Instead of tracking every individual gas particle, we describe the bulk fluid motion—the average velocity at each location.

Remaining challenges:

Which pointwise flow to choose? (straight lines? curves?)
How to compute $v$ in practice?

These questions drive the practical algorithms we’ll see next.

4.4 A Simple Choice of Pointwise Flow

The Three Design Choices

To build a flow matching model, we need to choose:

Base distribution $q$: What we sample from initially
1. Gaussian (most common)
2. Uniform
3. Annular (ring-shaped)
Coupling $\Pi_{q,p}$: How we pair base samples with target samples. Independent sampling—just sample from $p$ and $q$ separately and pair them randomly.
Pointwise flow: How we connect each pair

Linear Pointwise Flow

The simplest pointwise flow is straight-line interpolation:

\[v^{[x_1, x_0]}_t(x_t) = x_0 - x_1\]

This gives a constant velocity pointing from $x_1$ to $x_0$.

The resulting trajectory:

\[\text{RunFlow}(v^{[x_1, x_0]}, x_1, t) = t x_1 + (1-t) x_0\]

This is just linear interpolation between the two points!

At different times $t$:

$t = 1$: Position is $x_1$ (base distribution sample)
$t = 0.5$: Position is $0.5 x_1 + 0.5 x_0$ (halfway between)
$t = 0$: Position is $x_0$ (target distribution sample)

Physical interpretation: A particle moves at constant speed from $x_1$ to $x_0$, taking exactly 1 time unit to complete the journey.

4.5 Flow Matching

We want to compute the optimal vector field $v^*_t(x_t)$, but naively this requires sampling from $p(x_0 \mid x_t)$—which is exactly the hard problem we’re trying to solve! It’s circular reasoning.

The DDPM Trick Applied to Flow Matching

Just like in DDPM, we can avoid this circular problem by using regression:

Instead of trying to sample from $p(x_0 \mid x_t)$, we:

Sample from the joint distribution $(x_0, x_1)$—this is easy!
Compute $x_t$ deterministically using our chosen flow
Set up a regression problem to learn the expected vector field

The key insight is that:

\[v^*_t(x_t) = E[ v^{[x_1, x_0]}_t(x_t) \mid x_t ]\]

And by the fundamental regression theorem:

\[v^*_t = \arg\min_f E\left[ \| f(x_t) - v^{[x_1, x_0]}_t(x_t) \|^2 \right]\]

This means we can learn $v^*_t$ by minimizing squared error!

The Training Process

Pseudocode 4: Flow-matching train loss, generic pointwise flow [or linear flow]

\[\text{(See figure below)}\]

Let me walk through each step:

Step 1: $(x_1, x_0) \leftarrow \text{Sample}(\Pi_{q,p})$

Sample a source point $x_1$ from base distribution $q$ (e.g., Gaussian noise)
Sample a target point $x_0$ from data distribution $p$ (e.g., real image)
These form a training pair

Step 2: $t \leftarrow \text{Unif}[0, 1]$

Pick a random time point during the flow

Step 3: $x_t \leftarrow \text{RunFlow}(v^{[x_1, x_0]}, x_1, t)$

Starting from $x_1$, run the pointwise flow for time $t$ to get $x_t$
For linear flows: $x_t = t \cdot x_1 + (1-t) \cdot x_0$

Step 4: $L \leftarrow \| f_\theta(x_t, t) - v^{[x_1, x_0]}_t(x_t) \|^2$

$f_\theta(x_t, t)$: What our neural network predicts the velocity should be
$v^{[x_1, x_0]}_t(x_t)$: What the true velocity should be for this specific flow
For linear flows: $v^{[x_1, x_0]}_t(x_t) = x_0 - x_1$

The Sampling Process

Pseudocode 5: Flow-matching sampling

\[\text{(See figure below)}\]

Step 1: $x_1 \leftarrow \text{Sample}(q)$

Start with a random sample from the base distribution (noise)

Steps 2-4: Iterative integration

For each time step, update: $x_{t-\Delta t} \leftarrow x_t + f_\theta(x_t, t) \Delta t$
This is Euler integration of the ODE $\frac{dx}{dt} = f_\theta(x, t)$
We’re following the learned vector field from noise to data

The Beautiful Simplicity: This framework is elegant because:

No complex probability calculations—just regression
Flexible path design—choose any pointwise flow you want
Efficient sampling—straightforward ODE integration
Scalable training—standard neural network optimization

The key insight is that by breaking the problem into pointwise flows and then learning their average, we can solve generative modeling using simple, well-understood techniques.

https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html: helpful visualizations of flows, and uses notation more consistent with the current literature.

5 Diffusion in Practice

Samplers in Practice

The Speed Problem

DDPM and DDIM samplers are essentially the “Model T” of diffusion sampling. Each sampling step requires an expensive neural network forward pass, and even today’s best samplers need around 10 steps minimum.
This is a massive bottleneck. Imagine waiting 10+ seconds for a single image generation when users expect near-instantaneous results.

The SDE/ODE Connection Unlocks Better Samplers

Since DDPM and DDIM are discretizations of the reverse SDE and Probability Flow ODE respectively, we can leverage decades of numerical methods research.

Any ODE/SDE solver becomes a potential diffusion sampler:

Euler methods
Heun’s method
Runge-Kutta variants
Custom solvers designed for diffusion’s specific structure

This perspective transformed sampler development from ad-hoc tweaking to principled numerical analysis.

The Distillation Revolution

Distillation methods that train student models to match multi-step diffusion teachers in just one step:

Consistency Models
Adversarial Distillation

⚠️ Important caveat: These distilled models aren’t technically diffusion models anymore—they’re neural networks trained to mimic diffusion output, but they’ve abandoned the iterative denoising process entirely.

Noise Schedules

Why Schedules Matter

The noise schedule ($\sigma_t$) determines how much noise gets added at each timestep. This seemingly simple choice has profound implications for training stability, sample quality, and convergence speed.

Variance Exploding vs. Variance Preserving

Simple diffusion has $p(x_t) \sim N(x_0, \sigma_t^2)$ with $\sigma_t \propto \sqrt{t}$, meaning variance explodes over time. This is one of two major paradigms:

Variance Exploding (VE): Noise variance grows unboundedly
Variance Preserving (VP): Noise variance stays controlled

The Ho et al. Schedule (Still Industry Standard)

The most popular schedule comes from the original DDPM paper:

\[x_t = \sqrt{1 - \beta(t)} \cdot x_{t-1} + \sqrt{\beta(t)} \cdot \varepsilon_t\]

Where $\beta(t)$ is carefully chosen so that:

$t = 1$: Nearly clean data
$t = 1$: Pure noise
Variance remains bounded throughout

The Karras Reparameterization

Karras et al. [2022] introduced a more intuitive way to think about schedules using:

Overall scaling: $s(t)$
Variance: $\sigma(t)$

Their suggested schedule: $s(t) = 1, \sigma(t) = t$

This framework makes it much easier to reason about and experiment with different noise schedules.

The SDE Framework: Maximum Flexibility

The general SDE formulation gives us incredible flexibility:

\[dx_t = f(x_t, t)dt + g(t)dw_t\]

Examples of what this enables:

Our simple diffusion: $f = 0$, $g = \sigma_q$
Ho et al. schedule: $f = -\frac{1}{2}\beta(t)$, $g = \sqrt{\beta(t)}$
Karras schedule: $f = 0$, $g = \sqrt{2t}$

Likelihood Interpretations and VAEs.

Diffusion as Hierarchical VAE

Here’s a perspective that fundamentally changed how we think about diffusion models: they’re actually a special case of deep hierarchical VAEs. This isn’t just theoretical elegance—it has profound practical implications.

The key insight: Each diffusion timestep corresponds to one “layer” of a VAE decoder, with the forward diffusion process acting as a fixed (non-learned) encoder that produces the sequence of noisy latents $\{x_t\}$.

Why This Perspective Revolutionized Training

Traditional deep VAEs suffer from notorious training instability because gradients must flow through all layers. Diffusion’s Markovian structure breaks this dependency—each layer can be trained in isolation without forward/backward passing through previous layers.

This is why diffusion models train so much more stably than traditional deep generative models.

The Likelihood Advantage

The VAE interpretation gives us something incredibly valuable: actual likelihood estimates via the Evidence Lower Bound (ELBO). This means we can train diffusion models with principled maximum-likelihood objectives.

Plot twist: The ELBO for diffusion VAEs reduces to exactly the L2 regression loss we’ve been using, but with specific time-weighting that treats regression errors differently at different timesteps.

⚠️ The practical dilemma: The “principled” VAE-derived time-weighting doesn’t always produce the best samples. Ho et al. [2020] famously just dropped the time-weighting and uniformly weighted all timesteps—sometimes theory and practice diverge!

Parametrization: The $x_0$ / $\varepsilon$ / $v$-Prediction Wars

What Should Your Network Actually Predict?

This is one of the most important practical decisions you’ll make, and it’s not obvious. You have three main options:

1. Direct Prediction (What We’ve Been Doing)

\[\min \| f_\theta(x_t, t) - x_{t-\Delta t} \|^2\]

Network predicts the partially-denoised data.

2. $x_0$-Prediction

\[\min \| f_\theta(x_t, t) - x_0 \|^2\]

Network predicts the fully-denoised original data. This is nearly equivalent to direct prediction, differing only by a time-weighting factor of $1/t$.

3. $\varepsilon$-Prediction

\[\min \| f_\theta(x_t, t) - \varepsilon_t \|^2\]

Network predicts the noise that was added. Where $\varepsilon_t = (1/\sigma_t) E[x_0 - x_t \mid x_1]$.

4. $v$-Prediction
Network predicts $v = \alpha_t \varepsilon - \sigma_t x_0$—essentially predicting data at high noise levels and noise at low noise levels.

Why This Choice Matters Enormously

Mathematically, these are equivalent—they differ only by time-weightings. In practice, they behave very differently because:

Learning is imperfect—certain objectives may be more robust to errors
Different parametrizations have different failure modes
Some combinations are fundamentally problematic

Example failure case: $x_0$-prediction with schedules that heavily weight low noise levels often fails because the identity function achieves low loss but produces terrible samples.

The Error Landscape: What Actually Goes Wrong

Training-Time Errors

These are standard statistical learning errors in approximating the population-optimal regression function:

Approximation error: Your network architecture isn’t expressive enough
Estimation error: You don’t have enough training data
Optimization error: Your training procedure doesn’t find the global optimum

Sampling-Time Errors

These are discretization errors from using finite step-sizes $\Delta t$:

For DDPM: Error in the Gaussian approximation of the reverse process
For DDIM/Flow Matching: Error in simulating continuous-time flows discretely

The Interaction Problem

Here’s what makes this challenging: these errors interact and compound in complex, poorly understood ways. We don’t fully understand how regression errors translate into distributional errors of the final generative model.

Surprising twist: These “errors” can actually be beneficial on small datasets, acting as regularization that prevents the model from just memorizing training samples.

Key Practical Takeaways

VAE Perspective Guides Training Strategy: Understanding diffusion as hierarchical VAE explains why they train so stably and provides principled likelihood-based objectives (even if you sometimes ignore the principled weighting).
Parametrization Choice Is Critical: The $x_0$/$\varepsilon$/$v$-prediction choice significantly impacts training dynamics and sample quality. There’s no universal best choice—it depends on your specific use case and schedule.
Error Sources Are Inevitable But Manageable: Both training-time and sampling-time errors are unavoidable, but understanding their sources helps you make informed trade-offs between speed, quality, and robustness.
Theory vs. Practice Tension: The “principled” choices from theory don’t always win in practice. Be prepared to empirically validate theoretical insights rather than blindly following them.

Structure, Layout and Markdown for maintaining this self-notes website

2025-07-04T00:00:00+00:00

Markdown

Introduction
Basic Markdown Elements
Extended Markdown Features
Advanced Formatting
Quick Reference

Introduction

Markdown is a lightweight markup language that transforms plain text into beautifully formatted documents. This guide covers everything from basic syntax to advanced features.

Note: This guide follows the Markdown Crash Course methodology with enhanced formatting and organization.

Basic Markdown Elements

Headings: Creating Document Structure

Markdown provides six levels of headings, each serving a specific purpose in document hierarchy:

# Primary Title (H1)
## Section Headers (H2)
### Subsection Headers (H3)
#### Minor Headers (H4)
##### Small Headers (H5)
###### Smallest Headers (H6)

Output:

Primary Title (H1)

Section Headers (H2)

Subsection Headers (H3)

Minor Headers (H4)

Small Headers (H5)

Smallest Headers (H6)

Paragraphs and Line Breaks

Understanding paragraph formatting is crucial for readable content:

This is a standard paragraph. Text flows naturally within paragraph boundaries.

A blank line separates paragraphs, creating distinct content blocks.

For line breaks within paragraphs,  
add two spaces at the end of a line  
to create soft breaks without paragraph separation.

Output:

This is a standard paragraph. Text flows naturally within paragraph boundaries.

A blank line separates paragraphs, creating distinct content blocks.

For line breaks within paragraphs,
add two spaces at the end of a line
to create soft breaks without paragraph separation.

Extended Markdown Features

Text Styling and Emphasis

Create visual hierarchy and emphasis with various text formatting options:

*Italic text* or _italic text_
**Bold text** or __bold text__
***Bold and italic*** or ___bold and italic___
~~Strikethrough text~~
Highlighted text
Regular text with superscript and subscript

Output:

Italic text or italic text
Bold text or bold text
Bold and italic or bold and italic
~~Strikethrough text~~
Highlighted text
Regular text with ^superscript and _subscript

Code Display

Inline Code

Use backticks for inline code within sentences:

Use the `console.log()` function for debugging JavaScript applications.

Output: Use the console.log() function for debugging JavaScript applications.

Code Blocks

To display a larger block of code you can wrap your code in three ` characters.

You can also specify the language of your code block by adding the language name after the three ` characters.

// JavaScript example with syntax highlighting
function greetUser(name) {
    return `Hello, ${name}! Welcome to Markdown.`;
}

const message = greetUser("Developer");
console.log(message);

# Python example
def calculate_area(radius):
    """Calculate the area of a circle."""
    import math
    return math.pi * radius ** 2

area = calculate_area(5)
print(f"Circle area: {area:.2f}")

Advanced Formatting

Create various types of links for enhanced navigation:

[External link](https://blog.webdevsimplified.com)
[Relative link](/2023-06/markdown-crash-course)
[Reference link][1]


[1]: https://example.com "Reference link tooltip"

Output:

External link
Relative link
Reference link
https://direct-url-display.com

Images and Media

![Descriptive alt text](/assets/images/google.png "The Google Logo")

Blockquotes and Citations

Create elegant quotations and nested content:

> "The best way to predict the future is to create it."
> — Peter Drucker

> Primary quotation with important information
>> Nested quotation for additional context
>>> Deep nesting for complex hierarchies

Output:

“The best way to predict the future is to create it.”
— Peter Drucker

Primary quotation with important information

Nested quotation for additional context

Deep nesting for complex hierarchies

Lists and Organization

Unordered Lists

- **Primary item** with emphasis
- Secondary item
  - Nested sub-item
  - Another sub-item
    - Deep nesting example
- Final primary item

Output:

Primary item with emphasis
Secondary item
- Nested sub-item
- Another sub-item
  - Deep nesting example
Final primary item

Ordered Lists

**First step** (numbers auto-increment)
Second step with detailed explanation
Sub-step A
Sub-step B
Final step

Output:

First step (numbers auto-increment)
Second step with detailed explanation
1. Sub-step A
2. Sub-step B
Final step

Task Lists

- [x] ✅ Completed task
- [x] ✅ Another finished item
- [ ] ⏳ Pending task
- [ ] ⏳ Future task

Output:

✅ Completed task
✅ Another finished item
⏳ Pending task
⏳ Future task

Tables and Data Presentation

Below the first row you need to add a row where each column consists of at least three -s and optionally a : character on either side of the -s.
- The : character is used to align the text in the column.
- If you add a : character on the left side of the -s then the text will be left aligned.
- If you add a : character on the right side of the -s then the text will be right aligned.
- If you add a : character on both sides of the -s then the text will be center aligned
Finally, you can continue to add rows to your table with the same format as your first row.

| Feature | Description | Status |
|:--------|:------------|-------:|
| **Basic Syntax** | Core Markdown elements | ✅ Complete |
| **Extended Features** | GitHub Flavored Markdown | ✅ Complete |
| **Advanced Topics** | Complex formatting | 🔄 In Progress |
| **Best Practices** | Professional guidelines | ⏳ Planned |

Output:

Feature	Description	Status
Basic Syntax	Core Markdown elements	✅ Complete
Extended Features	GitHub Flavored Markdown	✅ Complete
Advanced Topics	Complex formatting	🔄 In Progress
Best Practices	Professional guidelines	⏳ Planned

Horizontal Rules and Separators

Create visual breaks in your content:

Content above separator

---

Content between separators

***

Content below separator

Output:

Content above separator

Content between separators

Content below separator

Quick Reference

Essential Syntax

# Headers              → # H1, ## H2, ### H3
*Emphasis*             → *italic*, **bold**, ***both***
`Code`                 → `inline` or ```block```
[Links](url)           → [text](url)
![Images](url)         → ![alt](url)
> Blockquotes          → > text
- Lists                → - item or 1. item
| Tables |             → | col1 | col2 |
---                    → Horizontal rule

GitHub Flavored Markdown

~~Strikethrough~~      → ~~text~~
- [ ] Tasks            → - [ ] todo, - [x] done

Conclusion

Mastering Markdown enables you to create professional, readable documentation with minimal effort. This guide provides the foundation for beautiful content creation across platforms like GitHub, documentation sites, and blogs.

Happy writing! 📝✨

Last updated: July 4, 2025
Version: 1.0

Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra

2024-05-14T00:00:00+00:00

Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:May 14, 2024 We’re introducing a series of updates across the Gemini family of models, including the new 1.5 Flash, our lightweight model for speed and efficiency, and Project Astra, our vision for the future of AI assistants. In December, we launched our first natively multimodal model Gemini 1.0 in three sizes: Ultra, Pro and Nano. Just a few months later we released 1.5 Pro, with enhanced performance and a breakthrough long context window of 1 million tokens.Developers and enterprise customers have been putting 1.5 Pro to use in incredible ways and finding its long context window, multimodal reasoning capabilities and impressive overall performance incredibly useful.We know from user feedback that some applications need lower latency and a lower cost to serve. This inspired us to keep innovating, so today, we’re introducing Gemini 1.5 Flash: a model that’s lighter-weight than 1.5 Pro, and designed to be fast and efficient to serve at scale.Both 1.5 Pro and 1.5 Flash are available in public preview with a 1 million token context window in Google AI Studio and Vertex AI. And now, 1.5 Pro is also available with a 2 million token context window via waitlist to developers using the API and to Google Cloud customers.We’re also introducing updates across the Gemini family of models, announcing our next generation of open models, Gemma 2, and sharing progress on the future of AI assistants, with Project Astra.Context lengths of leading foundation models compared with Gemini 1.5’s 2 million token capability1.5 Flash is the newest addition to the Gemini model family and the fastest Gemini model served in the API. It’s optimized for high-volume, high-frequency tasks at scale, is more cost-efficient to serve and features our breakthrough long context window.While it’s a lighter weight model than 1.5 Pro, it’s highly capable of multimodal reasoning across vast amounts of information and delivers impressive quality for its size.The new Gemini 1.5 Flash model is optimized for speed and efficiency, is highly capable of multimodal reasoning and features our breakthrough long context window.1.5 Flash excels at summarization, chat applications, image and video captioning, data extraction from long documents and tables, and more. This is because it’s been trained by 1.5 Pro through a process called “distillation,” where the most essential knowledge and skills from a larger model are transferred to a smaller, more efficient model.Read more about 1.5 Flash in our updated Gemini 1.5 technical report, on the Gemini technology page, and learn about 1.5 Flash’s availability and pricing.Over the last few months, we’ve significantly improved 1.5 Pro, our best model for general performance across a wide range of tasks.Beyond extending its context window to 2 million tokens, we’ve enhanced its code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding through data and algorithmic advances. We see strong improvements on public and internal benchmarks for each of these tasks.1.5 Pro can now follow increasingly complex and nuanced instructions, including ones that specify product-level behavior involving role, format and style. We’ve improved control over the model’s responses for specific use cases, like crafting the persona and response style of a chat agent or automating workflows through multiple function calls. And we’ve enabled users to steer model behavior by setting system instructions.We added audio understanding in the Gemini API and Google AI Studio, so 1.5 Pro can now reason across image and audio for videos uploaded in Google AI Studio. And we’re now integrating 1.5 Pro into Google products, including Gemini Advanced and in Workspace apps.Read more about 1.5 Pro in our updated Gemini 1.5 technical report and on the Gemini technology page.Gemini Nano is expanding beyond text-only inputs to include images as well. Starting with Pixel, applications using Gemini Nano with Multimodality will be able to understand the world the way people do — not just through text, but also through sight, sound and spoken language.Read more about Gemini 1.0 Nano on Android.Today, we’re also sharing a series of updates to Gemma, our family of open models built from the same research and technology used to create the Gemini models.We’re announcing Gemma 2, our next generation of open models for responsible AI innovation. Gemma 2 has a new architecture designed for breakthrough performance and efficiency, and will be available in new sizes.The Gemma family is also expanding with PaliGemma, our first vision-language model inspired by PaLI-3. And we’ve upgraded our Responsible Generative AI Toolkit with LLM Comparator for evaluating the quality of model responses.Read more on the Developer blog.As part of Google DeepMind’s mission to build AI responsibly to benefit humanity, we’ve always wanted to develop universal AI agents that can be helpful in everyday life. That’s why today, we’re sharing our progress in building the future of AI assistants with Project Astra (advanced seeing and talking responsive agent).To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do — and take in and remember what it sees and hears to understand context and take action. It also needs to be proactive, teachable and personal, so users can talk to it naturally and without lag or delay.While we’ve made incredible progress developing AI systems that can understand multimodal information, getting response time down to something conversational is a difficult engineering challenge. Over the past few years, we’ve been working to improve how our models perceive, reason and converse to make the pace and quality of interaction feel more natural.Building on Gemini, we’ve developed prototype agents that can process information faster by continuously encoding video frames, combining the video and speech input into a timeline of events, and caching this information for efficient recall.By leveraging our leading speech models, we also enhanced how they sound, giving the agents a wider range of intonations. These agents can better understand the context they’re being used in, and respond quickly, in conversation.With technology like this, it’s easy to envision a future where people could have an expert AI assistant by their side, through a phone or glasses. And some of these capabilities are coming to Google products, like the Gemini app and web experience, later this year.We’ve made incredible progress so far with our family of Gemini models, and we’re always striving to advance the state-of-the-art even further. By investing in a relentless production line of innovation, we’re able to explore new ideas at the frontier, while also unlocking the possibility of new and exciting Gemini use cases.Learn more about Gemini and its capabilities. Your information will be used in accordance with Google’s privacy policy.

      Done. Just one step more.
    
      Check your inbox to confirm your subscription.
    You are already subscribed to our newsletter.
    You can also subscribe with a
    different email address
    
    .
    
  Let’s stay in touch. Get the latest news from Google in your inbox.
          Follow Us

Displaying External Posts on Your al-folio Blog

2022-04-23T23:20:09+00:00

External Posts on Your al-folio Blog

If you prefer publishing blog posts on medium.com or other external sources, starting version v0.5.0, al-folio lets you to display your external posts in the blog feed of your website! 🎉🎉

Configuring external sources of super simple. After upgrading to v0.5.0, just add the following section to your _config.yml:

external_sources:
  - name: medium.com  # name of the source (arbitrary string)
    rss_url: https://medium.com/@/feed

The example above adds your medium.com blog post feed as an external source. But you can add arbitrary RSS feeds as sources.

Any questions or suggestions? 👉 Start a discussion on GitHub!

Option	Rationale
Simplified (\(w_t\) = 1)	Easiest to implement, works well for many tasks.
Sigma‑scaled (\(w_t\) = \(σ_t^2\))	Emphasises accuracy at late (cleaner) timesteps.

Schedule	Shape	Typical use‑case
Linear	\(σ_t\) grows linearly with t	Classic DDPM baseline.
Cosine	Slower rise near t = 0, faster near t = 1	Often yields crisper samples with fewer steps.

Dial	Symbol	Effect
Stochasticity	\(γ ∈ {0, 1}\)	\(γ\) = 0 gives deterministic DDIM‑like steps; γ = 1 keeps full randomness.
Denoising Step schedule	\({t_0 … t_n}\)	Any partition of [0, 1] works—e.g., 50, 100, or 1 000 steps.

blank

Mustango: Toward Controllable Text-to-Music Generation

Table of Contents

Fundamentals of Music

What is Music Made Of?

1. Beats and Downbeats - The Musical Heartbeat

Beat Types Explained

Genre Emphasis

Output of Beat Timings:

2. Chords - Musical Building Blocks

Basic Chord Types

3. Keys - The Musical Home Base

4. Tempo - The Speed of Music

How All Four Features Work Together

Why This Combination is Powerful

MusicBench

Feature Extraction

1. Beat and Downbeat Extraction

2. Tempo Extraction

3. Chord Extraction

4. Key Extraction

Description Enrichment

Augmentation and Music Diversification

Why Standard Audio Augmentation Fails for Music?

The Three-Dimensional Augmentation Strategy

More on Text Descriptions

MusicBench

Dataset Splitting

Data Filtering (Quality Control)

Audio Augmentation (Total ~37k samples):

Final Training Set Construction

Test Set Details

Mustango

Latent Diffusion Model (LDM)

Objective

MuNet

Reverse Diffusion Process

Training Details

Inference

1. Beat Predictor — Using DeBERTa Large

2. Chord Predictor — Using FLAN-T5 Large

How Good Are the Beat & Chord Predictors?

Final Output

Classifier-Free Guidance at Inference

Experiments

Models Compared

Training & Evaluation Dataset

Inference Setup

Objective Evaluation

How Did They Measure Quality?

Objective Results

Subjective Evaluation

Subjective Results

More

Noise2Music: Text-conditioned Music Generation with Diffusion Models

1 Summary

2 Related Works

How models ingest “what I want”

3 Methods

3.1 Diffusion models in a nutshell

Choosing the loss weight \(w_t\):

Noise‑schedule variants

Sampling: knobs you can turn

Classifier‑free guidance (CFG)

3.2 Model Architecture — “Efficient U‑Net 1‑D”

3.3 Cascaded Diffusion: three‑stage pipeline

3.3.1 Waveform Model

3.3.2 Spectrogram Model

3.3.3 SUPER-RESOLUTION CASCADER

3.4 Text Understanding

3.5 Pseudo‑Labeling for Music Data [DATA Creation]

3.5.1 Why pseudo‑labels are needed

3.5.2 Models Used

3.5.3 Building three caption vocabularies

3.5.4 Assigning captions to an unlabeled clip

3.5.5 Warm‑up experiment: MuLaMCap

3.6 Training‑Data Mining at Scale [DATA]

4 Experiments and Results

4.1 Model training details

Loss Weighting

3.2 Model Architecture — “Efficient U‑Net 1‑D”

3.3 Cascaded Diffusion: three‑stage pipeline

3.6 Training‑Data Mining at Scale [DATA]