<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://aayush9753.in/feed.xml" rel="self" type="application/atom+xml" /><link href="https://aayush9753.in/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-06-20T13:39:54+00:00</updated><id>https://aayush9753.in/feed.xml</id><title type="html">blank</title><subtitle>Aayush Sharma — machine learning researcher (speech &amp; audio, generative models). Public research notebook and academic profile.
</subtitle><entry><title type="html">Mustango: Toward Controllable Text-to-Music Generation</title><link href="https://aayush9753.in/blog/2025/mustango-toward-controllable-text-to-music-generation/" rel="alternate" type="text/html" title="Mustango: Toward Controllable Text-to-Music Generation" /><published>2025-07-25T00:00:00+00:00</published><updated>2025-07-25T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/mustango-toward-controllable-text-to-music-generation</id><content type="html" xml:base="https://aayush9753.in/blog/2025/mustango-toward-controllable-text-to-music-generation/"><![CDATA[<blockquote>
  <p>Melechovsky, Jan, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. “Mustango: Toward Controllable Text-to-Music Generation.” arXiv:2311.08355. Preprint, arXiv, June 3, 2024. https://doi.org/10.48550/arXiv.2311.08355.</p>
</blockquote>

<p><a href="https://github.com/AMAAI-Lab/mustango">Code</a></p>

<h2 id="table-of-contents">Table of Contents</h2>
<ul>
  <li><a href="#fundamentals-of-music">Fundamentals of Music</a></li>
  <li><a href="#musicbench">MusicBench</a>
    <ul>
      <li><a href="#feature-extraction">Feature Extraction</a></li>
      <li><a href="#description-enrichment">Description Enrichment</a></li>
      <li><a href="#augmentation-and-music-diversification">Augmentation and Music Diversification</a></li>
    </ul>
  </li>
  <li><a href="#mustango">Mustango</a>
    <ul>
      <li><a href="#latent-diffusion-model-ldm">Latent Diffusion Model (LDM)</a></li>
      <li><a href="#munet">MuNet</a></li>
      <li><a href="#reverse-diffusion-process">Reverse Diffusion Process</a></li>
      <li><a href="#training-details">Training Details</a></li>
    </ul>
  </li>
  <li><a href="#inference">Inference</a></li>
  <li><a href="#experiments">Experiments</a>
<strong>Overview:</strong></li>
</ul>

<p>Mustango is a diffusion-based text-to-music generation system that goes beyond general text conditioning. It introduces fine-grained control over <strong>musical attributes</strong> such as <strong>chords, beats, tempo,</strong> and <strong>key</strong>, enabling more structured and musically meaningful audio synthesis from natural language prompts.</p>

<p><strong>MuNet: Music-Aware UNet Denoiser</strong></p>

<p>At the core of Mustango is <strong>MuNet</strong>, a music-domain-informed UNet module that guides the reverse diffusion process. It integrates:</p>

<ul>
  <li>General <strong>text embeddings</strong> (from FLAN-T5), and</li>
  <li>
    <p>Predicted <strong>musical features</strong> (chords, beats, etc.) via hierarchical cross-attention layers.</p>

    <p>This enables Mustango to generate music that faithfully follows the structural elements described in the input text.</p>
  </li>
</ul>

<p><strong>MusicBench: A Richly Augmented Dataset</strong></p>

<p>To train controllable models, the authors introduce <strong>MusicBench</strong>, a dataset built on top of MusicCaps with:</p>

<ul>
  <li>Over <strong>52,000</strong> samples</li>
  <li>Text captions enhanced with <strong>automatically extracted</strong> and <strong>paraphrased</strong> descriptions of chords, tempo, key, and rhythm</li>
  <li><strong>Audio augmentations</strong> including pitch shifts, tempo changes, and volume variations</li>
</ul>

<p><strong>Challenges in Diffusion-based Music Generation</strong></p>

<ul>
  <li><strong>Musical structure enforcement</strong>: Music must obey formal rules like key signatures and chord progressions, which are difficult to evaluate and condition on.</li>
  <li><strong>Data scarcity</strong>: High-quality paired text-music datasets are limited and often lack rich musical annotations.</li>
  <li><strong>Representational depth</strong>: Most captions lack structural and harmonic detail. Mustango tackles this by learning to infer and control these aspects from text.</li>
</ul>

<h1 id="fundamentals-of-music">Fundamentals of Music</h1>
<div style="border: 2px solid #ccc; border-radius: 10px; padding: 20px; background-color: #f9f9f9; font-family: sans-serif; line-height: 1.6">

<p>To understand musical features, we will use <strong>"We Will Rock You"</strong> by Queen as our example, assuming we have no prior music knowledge.</p>

<h2>What is Music Made Of?</h2>
<p>Think of music like a recipe with four main ingredients:</p>
<ul>
  <li><strong>Beats</strong> = The steady pulse (like your heartbeat)</li>
  <li><strong>Chords</strong> = Multiple notes played together (like harmony)</li>
  <li><strong>Key</strong> = The musical "home base"</li>
  <li><strong>Tempo</strong> = How fast or slow the music moves</li>
</ul>

<h2>1. Beats and Downbeats - The Musical Heartbeat</h2>
<p>A <strong>beat</strong> is like the steady tick of a clock in music. It's the pulse you naturally tap your foot to.</p>
<p>The most common time signature is <strong>4/4</strong>, which means:</p>
<ul>
  <li>4 beats per measure</li>
  <li>Each beat gets a <strong>quarter note</strong> duration</li>
</ul>

<blockquote>
  A <strong>measure</strong> is a recurring pattern of beats that creates the rhythmic structure of music. Think of it as a rhythmic "sentence" that repeats throughout the song.
</blockquote>

<pre>Beat Type 1 | Beat Type 2 | Beat Type 3 | Beat Type 4
     ↓           ↓           ↓           ↓
   STRONG       weak      medium       weak
  (downbeat)
</pre>

<pre>Beat:     1    2    3    4  |  1    2    3    4
Pattern: STOMP STOMP CLAP -- | STOMP STOMP CLAP --
Sound:   "WE"  "WILL" "ROCK" | "YOU"  (rest) (clap)
Type:     1     2     3    4 |   1     2     3    4
</pre>

<pre>STOMP - STOMP - CLAP - (silence)
  1   -   2   -   3   -    4
</pre>

<h3>Beat Types Explained</h3>
<ul>
  <li><strong>Beat 1 (Downbeat)</strong>: Strongest, first stomp, marching step</li>
  <li><strong>Beat 2</strong>: Weaker, second stomp</li>
  <li><strong>Beat 3</strong>: Medium strength, the clap</li>
  <li><strong>Beat 4</strong>: Weakest, often silence</li>
</ul>

<pre>
Measure 1: STOMP(1) - STOMP(2) - CLAP(3) - silence(4)
Measure 2: STOMP(1) - STOMP(2) - CLAP(3) - silence(4)
Measure 3: STOMP(1) - STOMP(2) - CLAP(3) - silence(4)
</pre>

<h4>Genre Emphasis</h4>
<ul>
  <li><strong>Rock/Pop</strong>: Emphasis on beats 2 and 4</li>
  <li><strong>Classical/Folk</strong>: Emphasis on beats 1 and 3</li>
  <li><strong>Reggae</strong>: Off-beat emphasis (skank)</li>
</ul>

<h4>Output of Beat Timings:</h4>
<pre>Beat Type | Time (seconds)
    1     |     0.0
    2     |     0.5
    3     |     1.0
    4     |     1.5
    1     |     2.0
    2     |     2.5
</pre>

<h2>2. Chords - Musical Building Blocks</h2>
<p>A <strong>chord</strong> is when you play <strong>multiple notes at the same time</strong>.</p>
<ul>
  <li>Single note = one voice</li>
  <li>Chord = choir</li>
</ul>

<blockquote>
  The Musical Alphabet: A, B, C, D, E, F, G (then repeats)
</blockquote>

<h3>Basic Chord Types</h3>
<p><strong>Major Chords</strong>: Happy, bright (e.g., C Major = C, E, G)</p>
<p><strong>Minor Chords</strong>: Sad, emotional (e.g., A Minor = A, C, E)</p>

<pre>
Piano Keys: C  D  E  F  G  A  B  C  D  E  F  G
            |     |     |        |     |
C Major:    C     E     G        
G Major:                   G     B     D
</pre>

<pre>
"Twinkle, Twinkle, Little Star"
Twinkle, twinkle    = C major
little star         = G major
How I wonder        = C major
what you are        = G major
</pre>

<p><strong>Chord Progression</strong>: Most songs change chords to create tension and release. "We Will Rock You" mostly stays on one chord (E minor).</p>

<p><strong>Chord Inversion</strong>: Rearranging note order (same notes, different stacking)</p>

<h2>3. Keys - The Musical Home Base</h2>

<p><strong>Key</strong> = Home note everything gravitates toward</p>
<p><strong>We Will Rock You</strong> key = <strong>E minor</strong></p>

<ul>
  <li>E feels stable and complete</li>
  <li>Dark, serious tone (minor)</li>
  <li>You can hear the key by humming the final note of the song</li>
</ul>

<h2>4. Tempo - The Speed of Music</h2>

<p><strong>Tempo</strong> = Beats Per Minute (BPM)</p>
<p>"We Will Rock You" tempo = <strong>114 BPM</strong></p>

<ul>
  <li>~2 beats per second</li>
  <li>Moderate tempo (80–120 BPM)</li>
  <li>Easy for crowd participation</li>
</ul>

<h2>How All Four Features Work Together</h2>

<pre>
Time:     0s    1s    2s    3s    4s    5s    6s    7s
Beats:    1 2 3 4 | 1 2 3 4 | 1 2 3 4 | 1 2 3 4
Pattern:  STOMP STOMP CLAP - | STOMP STOMP CLAP -
Chord:    [---- E minor ----][---- E minor ----]
Key:      [------------- E minor throughout -------------]
Tempo:    [------------- 114 BPM steady ----------------]
</pre>

<h3>Why This Combination is Powerful</h3>
<ol>
  <li>Simple beat pattern = Easy for crowds</li>
  <li>Single chord = Hypnotic repetition</li>
  <li>Minor key = Powerful emotion</li>
  <li>Moderate tempo = Accessible and energetic</li>
</ol>

</div>

<h1 id="musicbench">MusicBench</h1>

<h2 id="feature-extraction">Feature Extraction</h2>

<p>Extracts musical features from audio and converting them into text-based control information to improve music generation systems.</p>

<ul>
  <li>Beats and downbeats</li>
  <li>Chords</li>
  <li>Keys</li>
  <li>Tempo</li>
</ul>

<p>These features serve a dual purpose: they enhance text prompts with specific musical information and guide the music generation process during the reverse diffusion phase.</p>

<h3 id="1-beat-and-downbeat-extraction">1. Beat and Downbeat Extraction</h3>

<p>Uses <a href="https://arxiv.org/abs/2108.03576"><strong>BeatNet</strong></a> which outputs:</p>

\[b \in \mathbb{R}^{(L_{beats} \times 2)}\]

<p>This mathematical notation means: \(b\) is a matrix with \(L_{beats}\) rows and 2 columns</p>

<ul>
  <li>\(L_{beats}\) represents the total number of beats detected in the audio → Each row is one beat event.</li>
  <li>\(2\): Beat Type \(\in \{1, 2, 3, 4\}\) and Time</li>
</ul>

<p><strong>Data Structure:</strong></p>

<ul>
  <li><strong>First dimension (column 1):</strong> Beat type according to meter</li>
  <li><strong>Second dimension (column 2):</strong> Precise timing in seconds when each beat occurs</li>
</ul>

<h3 id="2-tempo-extraction">2. Tempo Extraction</h3>

<p><strong>Calculation Method:</strong> Averaging the reciprocal of time intervals between beats</p>

<p><strong>Mathematical Process:</strong></p>

<ol>
  <li>Measure time intervals between consecutive beats</li>
  <li>Take the reciprocal (1/interval) to get instantaneous tempo</li>
  <li>Average these values across the entire piece</li>
  <li>Convert to <strong>BPM (beats per minute)</strong></li>
</ol>

<p>This approach accounts for tempo variations within a song rather than assuming constant tempo.</p>

<h3 id="3-chord-extraction">3. Chord Extraction</h3>

<p>Uses <strong><a href="https://citeseerx.ist.psu.edu/document?repid=rep1&amp;type=pdf&amp;doi=d6d65865b60877c2a49c9d80b6a9194033a26381">Chordino</a></strong> which outputs:</p>

\[c \in \mathbb{R}^{(L_{chords} \times 3)}\]

<p>This means:</p>

<ul>
  <li>c is a matrix with \(L_{chords}\) rows and 3 columns</li>
  <li>\(L_{chords}\) represents the number of chord segments identified</li>
</ul>

<p><strong>Data Structure:</strong></p>

<ul>
  <li><strong>First dimension (column 1):</strong> Chord roots
    <ul>
      <li>Examples: C, D♭, F#, etc. (the fundamental note of each chord)</li>
    </ul>
  </li>
  <li><strong>Second dimension (column 2):</strong> Chord quality/type
    <ul>
      <li>Examples: major, minor, maj7, dim, sus4, etc.</li>
    </ul>
  </li>
  <li><strong>Third dimension (column 3):</strong> Inversion information
    <ul>
      <li>Indicates whether the chord is in root position or inverted</li>
      <li>Example: C major could be C-E-G (root), E-G-C (first inversion), or G-C-E (second inversion)</li>
    </ul>
  </li>
</ul>

<h3 id="4-key-extraction">4. Key Extraction</h3>

<p><strong>Tool Used:</strong> <a href="https://essentia.upf.edu/">Essentia’s</a> - <a href="https://essentia.upf.edu/reference/std_KeyExtractor.html">KeyExtractor</a> algorithm</p>

<p><strong>Purpose:</strong> Identifies the overall tonal center or key signature of the piece</p>

<ul>
  <li>Examples: C major, A minor, F# major, etc.</li>
</ul>

<h2 id="description-enrichment">Description Enrichment</h2>

<p>The extracted numerical features are converted into <strong>natural language descriptions</strong> using predefined templates.</p>

<p><strong>Example Control Sentences:</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"The song is in the key of A minor. The tempo of this song is Adagio.
The beat counts to 4. The chord progression is Am, Cmaj7, G."

</code></pre></div></div>

<p><img src="/assets/images/2025-07-25/a.png" alt="420" /></p>

<h2 id="augmentation-and-music-diversification">Augmentation and Music Diversification</h2>

<p>Resulted in 11-fold increase in training data.</p>

<h3 id="why-standard-audio-augmentation-fails-for-music"><strong>Why Standard Audio Augmentation Fails for Music?</strong></h3>

<p><strong>Traditional Approach (e.g., Tango model):</strong> Take two audio samples → Normalize them to similar audio levels → <strong>Superimpose</strong> (layer) the audio tracks → <strong>Concatenate</strong> their text descriptions</p>

<p><strong>Why This Fails for Music:</strong></p>

<ul>
  <li><strong>Overlapping rhythms:</strong> Two different rhythmic patterns create chaotic, unmusical results</li>
  <li><strong>Harmonic dissonance:</strong> Combining different chord progressions creates unpleasant harmonic clashes</li>
  <li><strong>Conceptual mismatch:</strong> Mixing a “sad piano ballad” with an “upbeat rock song” creates conceptually incoherent training examples</li>
</ul>

<h3 id="the-three-dimensional-augmentation-strategy">The Three-Dimensional Augmentation Strategy</h3>

<p>Instead of combining multiple audio sources, they modify <strong>individual music samples</strong> along three fundamental musical dimensions:</p>

<ol>
  <li><strong>Pitch Augmentation (Melodic Dimension)</strong>
    <ol>
      <li><a href="https://github.com/bmcfee/pyrubberband">PyRubberband2</a>, Range: ±3 semitones, Distribution: Uniform</li>
      <li>Technical Details:
        <ol>
          <li><strong>Semitone range rationale:</strong> Keeps instrument <strong>timbre relatively untouched</strong></li>
          <li>Larger pitch shifts would cause <strong>unnatural timbre changes</strong> (e.g., making a piano sound artificially high or low)</li>
          <li><strong>3 semitones</strong> = approximately a minor third interval (musically significant but not timbre-destroying)</li>
        </ol>
      </li>
      <li><strong>Musical Impact:</strong>
        <ol>
          <li>Changes the perceived pitch/key of the music</li>
          <li>Maintains the relative intervals between notes (preserving melody shape)</li>
          <li>Creates training examples in different keys from the same source material</li>
        </ol>
      </li>
    </ol>
  </li>
  <li><strong>Speed Augmentation (Rhythmic Dimension)</strong>
    <ol>
      <li><strong>Range:</strong> ±(5% to 25%) speed change, <strong>Distribution:</strong> Uniform</li>
      <li><strong>Technical Implications:</strong>
        <ol>
          <li><strong>5-25% range</strong> represents musically meaningful tempo variations</li>
          <li>Slower speeds (−25%) create more relaxed, contemplative versions</li>
          <li>Faster speeds (+25%) create more energetic, urgent versions</li>
          <li>Maintains pitch relationships while altering rhythmic feel</li>
        </ol>
      </li>
      <li><strong>Musical Impact:</strong>
        <ol>
          <li>Changes the perceived energy and mood</li>
          <li>Affects the rhythmic groove and feel</li>
          <li>Creates training examples spanning different tempo ranges from single sources</li>
        </ol>
      </li>
    </ol>
  </li>
  <li><strong>Volume Augmentation (Dynamic Dimension)</strong>
    <ol>
      <li><strong>Method:</strong> Gradual volume changes (crescendo and decrescendo), <strong>Minimum volume:</strong> 0.1 to 0.5 times original amplitude (uniform distribution), <strong>Maximum volume:</strong> Kept untouched</li>
      <li><strong>Technical Design:</strong>
        <ol>
          <li><strong>Gradual changes</strong> rather than sudden volume jumps (more musically natural)</li>
          <li><strong>Crescendo:</strong> Gradual volume increase</li>
          <li><strong>Decrescendo:</strong> Gradual volume decrease</li>
          <li><strong>Amplitude range:</strong> 10-50% of original volume for minimum, preserving dynamic range</li>
        </ol>
      </li>
      <li><strong>Musical Impact:</strong>
        <ol>
          <li>Simulates different recording conditions and mixing styles</li>
          <li>Creates variations in perceived intensity and drama</li>
          <li>Maintains musical expressiveness while varying dynamic profiles</li>
        </ol>
      </li>
    </ol>
  </li>
</ol>

<h3 id="more-on-text-descriptions">More on Text Descriptions</h3>

<p>The text descriptions are enhanced and modified in tandem with audio alterations.</p>

<p><strong>Robustness Through Strategic Omission</strong></p>

<ul>
  <li><strong>Method:</strong> Randomly discard 1-4 sentences describing music features</li>
  <li><strong>Purpose:</strong> Enhance model robustness</li>
</ul>

<p><strong>Finally, we used ChatGPT to rephrase the text prompts to add variety to the text prompts.</strong></p>

<h2 id="musicbench-1">MusicBench</h2>

<p><strong>Derived from MusicCaps</strong>, which contains:</p>

<ul>
  <li><strong>5,521</strong> music audio clips (10 seconds each)</li>
  <li>Clips include ~4-sentence English captions</li>
  <li>Sourced from <strong>AudioSet</strong></li>
  <li>Due to missing audio, the usable dataset was reduced to <strong>5,479</strong> samples</li>
</ul>

<p><img src="/assets/images/2025-07-25/b.png" alt="421" /></p>

<h3 id="dataset-splitting"><strong>Dataset Splitting</strong></h3>

<ol>
  <li><strong>Initial Split:</strong>
    <ol>
      <li>Divided into <strong><code class="language-plaintext highlighter-rouge">TrainA</code></strong> and <strong><code class="language-plaintext highlighter-rouge">TestA</code></strong></li>
    </ol>
  </li>
  <li><strong>Control Prompts Addition: TranB - TestB</strong>
    <ol>
      <li>Appended 0–4 control sentences (describing music features) to form:
        <ol>
          <li><strong><code class="language-plaintext highlighter-rouge">TrainB</code></strong> from TrainA</li>
          <li><strong><code class="language-plaintext highlighter-rouge">TestB</code></strong> from TestA</li>
        </ol>
      </li>
      <li>Control prompt count probabilities: 0 → 25%, 1 → 30%, 2 → 20%, 3 → 15%, 4 → 10%</li>
    </ol>
  </li>
  <li><strong>Paraphrasing (Text Robustness):</strong>
    <ol>
      <li><strong><code class="language-plaintext highlighter-rouge">TrainC</code></strong> created by paraphrasing TrainB captions using ChatGPT</li>
      <li>Later, all captions (original &amp; augmented) were also paraphrased using ChatGPT</li>
      <li>Final training prompts use <strong>85% paraphrased / 15% original</strong></li>
    </ol>
  </li>
</ol>

<h3 id="data-filtering-quality-control"><strong>Data Filtering (Quality Control)</strong></h3>

<p><strong>Low-quality samples</strong> filtered out using keyword filter:</p>

<ul>
  <li>Removed any sample whose caption contains: “quality” (often refers to “poor quality”) and “low fidelity”.</li>
  <li>Remaining high-quality subset: <strong>3,413</strong> samples</li>
</ul>

<h3 id="audio-augmentation-total-37k-samples"><strong>Audio Augmentation (Total ~37k samples):</strong></h3>

<ul>
  <li>Each high-quality sample augmented into 11 variants:
    <ul>
      <li>Pitch shifts (6 total): ±1, ±2, ±3 semitones (excl. 0)</li>
      <li>Tempo alterations (4 total)</li>
      <li>Volume alteration (1 total)</li>
    </ul>
  </li>
  <li>These augmented samples form a dataset of <strong>~37,543 samples</strong></li>
</ul>

<h3 id="final-training-set-construction"><strong>Final Training Set Construction</strong></h3>

<ul>
  <li>Final set created by <strong>combining</strong>:
    <ul>
      <li>TrainA + TrainB + TrainC</li>
      <li>All augmented audio samples (with paraphrased and original captions)</li>
    </ul>
  </li>
  <li>Resulting in a total of <strong>52,768</strong> samples → This final training set is called <strong>MusicBench</strong></li>
</ul>

<h3 id="test-set-details"><strong>Test Set Details</strong></h3>

<ul>
  <li><strong>TestA/TestB</strong> contain:
    <ul>
      <li>200 <strong>low-quality</strong> samples</li>
      <li>200 <strong>high-quality</strong> samples</li>
    </ul>
  </li>
  <li><strong>Purpose</strong>: Designed to be <strong>challenging</strong> for evaluating controllability of the <strong>Mustango</strong> model</li>
</ul>

<h1 id="mustango">Mustango</h1>

<p>Mustango is a <strong>text-to-music generation model</strong> composed of two main components:</p>

<ol>
  <li><strong>Latent Diffusion Model (LDM)</strong></li>
  <li><strong>MuNet (Music-informed UNet Denoiser)</strong></li>
</ol>

<p>We have a <strong>latent audio representation</strong> — basically, a compressed version of music (\(z_0\)) created using a VAE (a type of encoder).</p>

<p>The model corrupts this clean music into <strong>random noise</strong> step by step (forward process), and then learns to <strong>reverse</strong> that and <strong>rebuild the music</strong> from noise — step by step — using a smart denoiser.</p>

<h2 id="latent-diffusion-model-ldm">Latent Diffusion Model (LDM)</h2>

<p>Goal is to Reduce computation while retaining expressivity by operating in a <strong>compressed latent space</strong> instead of raw audio.</p>

<ul>
  <li><strong>VAE</strong> (Variational Autoencoder):
    <ul>
      <li>Encodes raw waveform to latent representation \(z_0\)</li>
      <li>Pretrained VAE from <strong>AudioLDM</strong> is used</li>
    </ul>
  </li>
  <li><strong>Diffusion Process</strong>:
    <ul>
      <li>
        <p><strong>Forward</strong>: Corrupts \(z_0\) into noise \(z_n \sim \mathcal{N}(0, I)\) using Gaussian schedule</p>

\[q(z_n | z_{n-1}) = \mathcal{N}(\sqrt{1 - \beta_n} z_{n-1}, \beta_n I)\]
      </li>
      <li>
        <p><strong>Reverse</strong>: Reconstructs \(z_0\) from \(z_n\) using <strong>MuNet</strong> (denoiser) conditioned on music and text</p>
      </li>
    </ul>
  </li>
</ul>

<h3 id="objective">Objective</h3>

<p>Train the denoiser \(\hat{\epsilon}_\theta(z_n, C)\) to estimate noise via a noise prediction loss:</p>

\[\mathcal{L}_{\text{LDM}} = \sum_{n=1}^N \gamma_n \mathbb{E}_{\epsilon_n, z_0} \left[ \| \epsilon_n - \hat{\epsilon}_\theta^{(n)}(z_n, C) \|^2 \right]\]

<h2 id="munet">MuNet</h2>

<p><img src="/assets/images/2025-07-25/c.png" alt="422" /></p>

<p>Acts as the core denoiser in reverse diffusion. It’s designed to:</p>

<ul>
  <li>Integrate <strong>music-domain knowledge</strong></li>
  <li>
    <p>Accept <strong>joint conditions</strong>:</p>

\[C := \{\tau \text{ (text)}, b \text{ (beats)}, c \text{ (chords)}\}\]
  </li>
</ul>

\[\begin{aligned}&amp;\textbf{Input: } z_n \\&amp;\Downarrow \\&amp;\text{Apply Cross Attention:} \\&amp;\quad \text{Text } \tau \rightarrow \text{FLAN-T5} \rightarrow A_\tau \\&amp;\quad \text{Beats } b \rightarrow \text{Enc}^b(b) \rightarrow A_b \\&amp;\quad \text{Chords } c \rightarrow \text{Enc}^c(c) \rightarrow A_c \\&amp;\Downarrow \\&amp;\text{UNet}(A_c) \rightarrow \text{Output: } \hat{\epsilon}_\theta(z_n, C)\end{aligned}\]

<p><img src="/assets/images/2025-07-25/d.png" alt="423" /></p>

<ul>
  <li>MHA is the multi-headed attention block for the cross attentions, where Q, K, and V are query, key, and value, respectively</li>
  <li>FLAN-T5 is the text encoder model adopted from Tango.</li>
  <li>Cross-attention is applied to the beat first, as a consistent rhythm is fundamental basis for the generated music.</li>
  <li>MuNet consists of <strong>UNet</strong>
    <ul>
      <li>Total L downsampling, middle, and upsampling blocks—and multiple conditioning cross-attention blocks.</li>
    </ul>
  </li>
  <li>Both \(Enc^b\) and \(Enc^c\) leverage SOTA Fundamental Music Embedding (FME)</li>
  <li><strong>Beat Encoder</strong> \(Enc^b\):
    <ul>
      <li>One-hot beat type: \(\text{OH}_b(b[:, 0])\)</li>
      <li>Music Positional Encoding on beat time: \(\text{MPE}(b[:, 1])\)</li>
      <li>Combined and passed through linear projection \(W_b\)</li>
    </ul>

\[\text{Enc}^b(b) = W_b(\text{OH}_b(b[:,0]) \oplus \text{MPE}(b[:,1]))\]
  </li>
  <li><strong>Chord Encoder</strong> \(Enc^c\):
    <ul>
      <li>\(\text{FME}(c[:,0])\): Fundamental Music Embedding of chord root</li>
      <li>\(\text{OH}_t(c[:,1])\): One-hot chord type</li>
      <li>\(\text{OH}_i(c[:,2])\): One-hot chord inversion</li>
      <li>\(\text{MPE}(c[:,3])\): Positional encoding of chord timing</li>
      <li>
        <p>Combined then projected: \(W_c\)</p>

\[\text{Enc}^c(c) = W_c(\text{FME}(c[:,0]) \oplus \text{OH}_t(c[:,1]) \oplus \text{OH}_i(c[:,2]) \oplus \text{MPE}(c[:,3]))\]
      </li>
    </ul>
  </li>
</ul>

<h2 id="reverse-diffusion-process"><strong>Reverse Diffusion Process</strong></h2>

<p>The reverse diffusion process reconstructs the latent audio prior \(z_0\) from pure noise \(z_N \sim \mathcal{N}(0, I)\), step-by-step, using a <strong>parametrized denoiser</strong> \(\hat{\epsilon}_\theta^{(n)}(z_n, C)\)</p>

<ol>
  <li><strong>Reverse Transition Distribution</strong></li>
</ol>

\[p_\theta^{\text{mus}}(z_{n-1} \mid z_n, C) = \mathcal{N}(\mu_\theta^{(n)}(z_n, C), \tilde{\beta}_n)\]

<p>At each step n, the model samples \(z_{n-1}\) from a Gaussian with:</p>

<ul>
  <li>Mean \(\mu_\theta^{(n)}\)</li>
  <li>Variance \(\tilde{\beta}_n\)
    <ol>
      <li><strong>Mean for Reverse Step</strong></li>
    </ol>
  </li>
</ul>

\[\mu_\theta^{(n)}(z_n, C) = \frac{1}{\sqrt{\alpha_n}} \left[z_n - \frac{1 - \alpha_n}{\sqrt{1 - \bar{\alpha}_n}} \hat{\epsilon}_\theta^{(n)}(z_n, C)\right]\]

<p><strong>Where:</strong></p>

<ul>
  <li>
\[\alpha_n = 1 - \beta_n\]
  </li>
  <li>
\[\bar{\alpha}_n = \prod_{i=1}^n \alpha_i\]
  </li>
  <li>\(\hat{\epsilon}_\theta^{(n)}\): predicted noise at step n - Predicted by the model</li>
  <li><strong>This formula adjusts \(z_n\) by subtracting estimated noise and rescales it to predict \(z_{n-1}\)</strong>
    <ol>
      <li><strong>Variance of Reverse Step</strong></li>
    </ol>
  </li>
</ul>

\[\tilde{\beta}_n = \frac{1 - \bar{\alpha}_{n-1}}{1 - \bar{\alpha}_n} \beta_n\]

<ul>
  <li>This scales the forward diffusion variance \(\beta_n\) into an appropriate reverse-time variance.
    <ol>
      <li><strong>Diffusion Coefficients</strong></li>
    </ol>
  </li>
</ul>

\[\alpha_n = 1 - \beta_n,\quad \qquad \bar{\alpha}_n = \prod_{i=1}^{n} \alpha_i\]

<ul>
  <li>\(\alpha_n\) controls how much signal is preserved at each step.</li>
  <li>\(\bar{\alpha}_n\) is the cumulative product of all previous \(\alpha\)’s and is used in rescaling.</li>
</ul>

<h2 id="training-details">Training Details</h2>

<p><strong>Authors used three types of input dropout during training:</strong></p>

<ol>
  <li>Dropout #1 – Drop <strong>everything</strong> (5% chance)
    <ol>
      <li>The model sees <strong>no text</strong>, <strong>no beat</strong>, and <strong>no chord</strong> input.</li>
      <li>Why? So it learns to generate music even with <strong>zero guidance</strong> (like unconditional generation).</li>
    </ol>
  </li>
  <li>Dropout #2 – Drop <strong>one input at a time</strong> (5% chance)
    <ol>
      <li>It might drop just text, or just beats, or just chords.</li>
      <li>Helps the model handle <strong>incomplete input</strong> (e.g., caption but no beat info).</li>
    </ol>
  </li>
  <li>Dropout #3 – Mask <strong>parts of a prompt</strong>
    <ol>
      <li>
        <p>The longer the prompt, the more likely it is to be <strong>partially masked</strong>.</p>

\[\text{Mask chance} = \min\left(100, \frac{10N}{M}\right)\%\]

        <p>where:</p>

        <ul>
          <li>N = number of sentences in the current prompt</li>
          <li>M = average sentences across prompts</li>
        </ul>
      </li>
      <li>Then, remove <strong>20–50% of sentences randomly</strong></li>
      <li>This teaches the model to deal with <strong>short or incomplete captions</strong>.</li>
    </ol>
  </li>
</ol>

<p><strong>Hardware</strong></p>

<ul>
  <li>Training used 4× Tesla V100 GPUs and 8× RTX 8000s.</li>
  <li>Took <strong>5–10 days</strong></li>
  <li>Effective batch size: <strong>32</strong></li>
</ul>

<h1 id="inference">Inference</h1>

<blockquote>
  <p>During training, the model is given the actual (ground-truth) beats and chords.</p>

  <p>But during <strong>inference</strong> (real-world use), it must <strong>predict those</strong> from just the text description.</p>

</blockquote>

<p>Given only a <strong>text caption</strong> like:</p>

<blockquote>
  <p>“Soothing techno music with 120 bpm in the key of D major.”</p>

</blockquote>

<p>The model must <strong>automatically figure out</strong>:</p>

<ol>
  <li>Where the <strong>beats</strong> are (when each beat happens)</li>
  <li>What the <strong>chords</strong> are (and when they change)</li>
</ol>

<p>To do this, Mustango uses <strong>two separate pre-trained transformer models</strong>:</p>

<h2 id="1-beat-predictor--using-deberta-large">1. <strong>Beat Predictor</strong> — Using <strong>DeBERTa Large</strong></h2>

<ul>
  <li><strong>Input</strong>: Text caption</li>
  <li><strong>Output</strong>:
    <ol>
      <li><strong>Beat Count</strong> (1 to 4): The beat count (meter) of corresponding music
        <ul>
          <li>Predicted using classification on the <strong>first token</strong> (4-class classification)</li>
        </ul>
      </li>
      <li><strong>Beat Timings</strong> (the rhythm): The sequence of interval duration between the beats (their <strong>timing</strong>)
        <ul>
          <li>Predicted as float values from the <strong>next tokens</strong> (durations between beats)</li>
        </ul>
      </li>
    </ol>
  </li>
</ul>

<p><strong>Example:</strong> Suppose the model predicts:</p>

<ul>
  <li><strong>Beat Count</strong>: 2</li>
  <li><strong>Intervals</strong>: t1, t2, t3, t4, …</li>
</ul>

<p>Then the beat positions will be:</p>

<ul>
  <li>Beat 1 at <code class="language-plaintext highlighter-rouge">t1</code></li>
  <li>Beat 2 at <code class="language-plaintext highlighter-rouge">t1 + t2</code></li>
  <li>Beat 1 again at <code class="language-plaintext highlighter-rouge">t1 + t2 + t3</code></li>
  <li>Beat 2 again at <code class="language-plaintext highlighter-rouge">t1 + t2 + t3 + t4</code></li>
  <li>And so on…</li>
</ul>

<p>(Repeated in alternating fashion for 10 seconds)</p>

<h2 id="2-chord-predictor--using-flan-t5-large">2. <strong>Chord Predictor</strong> — Using <strong>FLAN-T5 Large</strong></h2>

<p>Predict the <strong>chords</strong> used in the music, and when each chord happens.</p>

<ul>
  <li><strong>Input</strong>:
    <ul>
      <li>The <strong>text caption</strong></li>
      <li>
        <p><strong>Verbalized beat sequence</strong> from DeBERTa:</p>

        <blockquote>
          <p>Timestamps: t1, t1 + t2, t1 + t2 + t3 . . . , Max Beat: 2</p>

        </blockquote>
      </li>
    </ul>
  </li>
  <li><strong>Output</strong>:
    <ul>
      <li>Chord progression over time in <strong>natural language</strong></li>
      <li>
        <p>For example:</p>

        <blockquote>
          <p>“Am at 1.11; E at 4.14; C#maj7 at 7.18”</p>

        </blockquote>
      </li>
    </ul>
  </li>
</ul>

<p>This is a <strong>sequence-to-sequence generation task</strong>, where the model outputs something that looks like music sheet annotations.</p>

<ul>
  <li>Any chords predicted <strong>after 10 seconds</strong> are <strong>ignored</strong> (since all music samples are only 10 seconds long)</li>
</ul>

<h3 id="how-good-are-the-beat--chord-predictors">How Good Are the Beat &amp; Chord Predictors?</h3>

<p>During inference, Mustango <strong>predicts beats &amp; chords from text</strong>. But do these predicted features work well?</p>

<ul>
  <li>When Control Sentences <em>Are</em> Present (TestB), Predictors do very well: <strong>94.5% accuracy</strong></li>
  <li>When Control Sentences <em>Are Missing</em> (TestA): Performance dips, but still <strong>better than Tango</strong>
    <ul>
      <li>This means the <strong>predictors don’t hurt Mustango’s quality</strong> when control is missing</li>
    </ul>
  </li>
</ul>

<h2 id="final-output">Final Output</h2>

<p>Now you have:</p>

<ul>
  <li><strong>Beat sequence</strong>: when each beat hits</li>
  <li><strong>Chord sequence</strong>: when chords start and change</li>
</ul>

<p>These are passed into the <strong>MuNet denoiser</strong>, and the final <strong>music is generated</strong> using reverse diffusion.</p>

<h2 id="classifier-free-guidance-at-inference"><strong>Classifier-Free Guidance at Inference</strong></h2>

\[\hat{\epsilon}_\theta^{(n)}(z_n, C) = w \cdot \epsilon_\theta^{(n)}(z_n, C) + (1 - w) \cdot \epsilon_\theta^{(n)}(z_n)\]

<ul>
  <li><strong>Purpose:</strong> Improves generation quality and controllability</li>
  <li><strong>Explanation:</strong>
    <ul>
      <li>The model is trained to predict noise both with and without conditions C</li>
      <li>At inference, both versions are interpolated using a guidance scale ww
        <ul>
          <li>\(w &gt; 1\): more faithful to the condition</li>
          <li>\(w = 0\): unconditional generation</li>
          <li>\(w = 1\): default, no guidance</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<h1 id="experiments">Experiments</h1>

<p>Questions explored:</p>

<ol>
  <li>How good is the music quality produced by Mustango?</li>
  <li>Is Mustango better than other models like Tango, MusicGen, AudioLDM2?</li>
  <li>Can Mustango follow control instructions well (like beat, chord, key)?</li>
  <li>Is their dataset (MusicBench) strong enough to train a model from scratch?</li>
</ol>

<h2 id="models-compared">Models Compared</h2>

<p><strong>Mustango Variants:</strong></p>

<table>
  <thead>
    <tr>
      <th>Model Name</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Tango on MusicCaps</strong></td>
      <td>Simple baseline, no MuNet, small dataset</td>
    </tr>
    <tr>
      <td><strong>Tango on MusicBench</strong></td>
      <td>Same architecture, better data</td>
    </tr>
    <tr>
      <td><strong>Mustango on MusicBench</strong></td>
      <td>Adds MuNet + good data</td>
    </tr>
    <tr>
      <td><strong>Pretrained Tango → AudioCaps → MusicBench</strong></td>
      <td>Transfer learning version</td>
    </tr>
    <tr>
      <td><strong>Pretrained Mustango → MusicBench</strong></td>
      <td>Strongest variant (MuNet + pretrained)</td>
    </tr>
  </tbody>
</table>

<p><strong>Other State-of-the-Art Models:</strong></p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>MusicGen (small, medium)</strong></td>
      <td>Text-to-music model</td>
    </tr>
    <tr>
      <td><strong>AudioLDM2 (music version)</strong></td>
      <td>Text-to-audio model trained on music</td>
    </tr>
  </tbody>
</table>

<h2 id="training--evaluation-dataset"><strong>Training &amp; Evaluation Dataset</strong></h2>

<ul>
  <li>All models were trained using the <strong>AdamW</strong> optimizer with a learning rate of <strong>4.5e−5</strong></li>
  <li>The beat and chord predictors were trained <strong>separately</strong></li>
  <li>Because some models already saw MusicCaps during pretraining, they created a <strong>new fair test set</strong> called <strong>FMACaps</strong> (1,000 music clips from Free Music Archive with AI-generated captions)</li>
</ul>

<h2 id="inference-setup"><strong>Inference Setup</strong></h2>

<ul>
  <li>All models generated <strong>10-second audio clips</strong></li>
  <li>Used <strong>200 diffusion steps</strong></li>
  <li>Classifier-free guidance scale = <strong>3</strong></li>
  <li>
    <p>Inference times on V100 GPU:</p>

    <table>
      <thead>
        <tr>
          <th>Model</th>
          <th>Time</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Tango</td>
          <td>34 sec</td>
        </tr>
        <tr>
          <td>MusicGen-M</td>
          <td>51 sec</td>
        </tr>
        <tr>
          <td>Mustango</td>
          <td>76 sec</td>
        </tr>
      </tbody>
    </table>
  </li>
</ul>

<h2 id="objective-evaluation">Objective Evaluation</h2>

<h3 id="how-did-they-measure-quality">How Did They Measure Quality?</h3>

<p>They used <strong>2 types of metrics</strong>:</p>

<p><strong>Audio Quality Metrics</strong></p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>What it tells us</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>FD</strong> (Fréchet Distance)</td>
      <td>Statistical similarity to real music</td>
    </tr>
    <tr>
      <td><strong>FAD</strong> (Fréchet Audio Distance)</td>
      <td>Human-perception-inspired metric</td>
    </tr>
    <tr>
      <td><strong>KL</strong> (Kullback-Leibler)</td>
      <td>Divergence between feature distributions</td>
    </tr>
  </tbody>
</table>

<p><strong>Controllability Metrics</strong></p>

<p>Measured <strong>how well the generated music follows the prompt</strong> (especially for beats, chords, tempo, key):</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Tempo Bin (TB)</strong></td>
      <td>Tempo (bpm) falls in correct bin</td>
    </tr>
    <tr>
      <td><strong>TBT</strong></td>
      <td>Tempo within bin or neighbor bin</td>
    </tr>
    <tr>
      <td><strong>CK / CKD</strong></td>
      <td>Correct key (exact or equivalent)</td>
    </tr>
    <tr>
      <td><strong>PCM / ECM / CMO / CMOT</strong></td>
      <td>Chord match (with various leniencies)</td>
    </tr>
    <tr>
      <td><strong>BM</strong></td>
      <td>Correct beat count</td>
    </tr>
  </tbody>
</table>

<h3 id="objective-results">Objective Results</h3>

<p>Audio Quality Findings:</p>

<ul>
  <li>Mustango (even when trained from scratch) performed as well or better than large pretrained models</li>
  <li>Mustango had the best FAD, which means better musicality</li>
  <li>The augmentation strategy (MusicBench) really works — it’s a solid alternative to large-scale pretraining</li>
</ul>

<p><strong>Controllability:</strong></p>

<table>
  <thead>
    <tr>
      <th>Control Type</th>
      <th>Who won</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Tempo</strong></td>
      <td>MusicGen slightly better</td>
    </tr>
    <tr>
      <td><strong>Beats</strong></td>
      <td>Similar across all models</td>
    </tr>
    <tr>
      <td><strong>Key</strong></td>
      <td>Mustango (trained on MusicBench) best</td>
    </tr>
    <tr>
      <td><strong>Chords</strong></td>
      <td>Mustango wins by a large margin (especially on FMACaps)</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Mustango excels in Key and Chord control</strong>, which are musically important.</li>
</ul>

<h2 id="subjective-evaluation">Subjective Evaluation</h2>

<p><strong>Two Groups Evaluated:</strong></p>

<ol>
  <li><strong>General audience</strong> — 48 people in Round 1, 17 in Round 2</li>
  <li><strong>Experts</strong> — 4 trained musicians per round</li>
</ol>

<p>They listened to samples and rated:</p>

<table>
  <thead>
    <tr>
      <th>Metric Name</th>
      <th>Meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>AQ</td>
      <td>Audio quality</td>
    </tr>
    <tr>
      <td>REL</td>
      <td>Relevance to caption</td>
    </tr>
    <tr>
      <td>OMQ</td>
      <td>Overall musical quality</td>
    </tr>
    <tr>
      <td>RC</td>
      <td>Rhythm consistency</td>
    </tr>
    <tr>
      <td>HC</td>
      <td>Harmony and consonance</td>
    </tr>
    <tr>
      <td>MCM</td>
      <td>Musical Chord Match</td>
    </tr>
    <tr>
      <td>MTM</td>
      <td>Musical Tempo Match</td>
    </tr>
  </tbody>
</table>

<p>All ratings used a <strong>7-point scale</strong>.</p>

<h3 id="subjective-results">Subjective Results</h3>

<p><strong>Round 1:</strong></p>

<ul>
  <li>Mustango from scratch had the best ratings overall</li>
  <li>Experts confirmed: Mustango had the best chord match (MCM)</li>
  <li>Conclusion: MusicBench works, MuNet helps, Mustango is very controllable</li>
</ul>

<p><strong>Round 2:</strong></p>

<ul>
  <li>Mustango beat MusicGen and AudioLDM2 in REL (relevance to text)</li>
  <li>Similar performance in OMQ, HC, MTM</li>
  <li>MusicGen won in RC (rhythm) — slightly better rhythm matching</li>
  <li>Mustango won in Chord Matching (MCM)</li>
</ul>

<h2 id="more">More</h2>

<p><strong>Is Pretraining Mustango Necessary?</strong></p>

<ul>
  <li><strong>What they tried:</strong>
    <ul>
      <li>Used a <strong>Tango model</strong> that was <strong>pre-trained on 1.2 million audio-text pairs</strong> (from AudioCaps etc.) and Then fine-tuned it on Mustango’s data</li>
    </ul>
  </li>
  <li><strong>What they found:</strong>
    <ul>
      <li>It <strong>didn’t help</strong> Mustango generate better music because the pretraining was on <strong>general audio</strong>, not music specifically.</li>
      <li>However: It might help for <strong>music + environmental sounds</strong>, like: “Jazz with thunder in the background”</li>
    </ul>
  </li>
</ul>

<p><strong>How well does Mustango really do?</strong></p>

<ul>
  <li><strong>Strengths:</strong>
    <ul>
      <li><strong>Great controllability</strong> — far better than previous models</li>
      <li><strong>Very good music quality</strong>, even though: It was trained only on a <strong>public small-ish dataset and</strong> Competing models (like MusicGen) used <strong>huge private datasets</strong></li>
    </ul>
  </li>
  <li><strong>Still, other models have some advantages:</strong>
    <ul>
      <li><strong>MusicGen</strong> produces: <strong>Higher audio quality</strong> in some cases and <strong>Longer musical structure</strong> (beyond 10 seconds)</li>
    </ul>
  </li>
</ul>

<p><strong>Limitations:</strong></p>

<ul>
  <li>Model works mainly on Western music styles — control info like “chord” and “key” might not apply to Indian or Chinese music</li>
  <li>Can only generate <strong>10 seconds</strong> of music due to compute limits</li>
  <li>Not yet optimized for long-form pieces (verse-chorus etc.)</li>
</ul>]]></content><author><name></name></author><category term="speech" /><summary type="html"><![CDATA[Melechovsky, Jan, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. “Mustango: Toward Controllable Text-to-Music Generation.” arXiv:2311.08355. Preprint, arXiv, June 3, 2024. https://doi.org/10.48550/arXiv.2311.08355.]]></summary></entry><entry><title type="html">Noise2Music: Text-conditioned Music Generation with Diffusion Models</title><link href="https://aayush9753.in/blog/2025/noise2music-text-conditioned-music-generation-with-diffusion-models/" rel="alternate" type="text/html" title="Noise2Music: Text-conditioned Music Generation with Diffusion Models" /><published>2025-07-25T00:00:00+00:00</published><updated>2025-07-25T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/noise2music-text-conditioned-music-generation-with-diffusion-models</id><content type="html" xml:base="https://aayush9753.in/blog/2025/noise2music-text-conditioned-music-generation-with-diffusion-models/"><![CDATA[<p><em>6 Mar 2023 - <a href="https://arxiv.org/abs/2302.03917">Link</a>  - <a href="https://google-research.github.io/noise2music">Website</a></em></p>

<h1 id="1-summary">1 Summary</h1>

<p><strong>Goal:</strong> Turn a plain‑language prompt (“a slow lo‑fi guitar ballad for a rainy afternoon”) into a <strong>30‑second, 24 kHz stereo clip</strong>.</p>

<p><strong>Approach</strong> – Train <em>several</em> diffusion models that run one after another (a <em>cascade</em>). The early stages sketch a coarse spectral “layout”; later stages fill in detail so the final waveform sounds clean and full‑bandwidth.</p>

<p><strong>Why a <em>cascade</em> of diffusion models?</strong></p>

<p>A single diffusion model that jumps straight from noise→high‑fidelity audio would need huge compute and might blur fine structure. Splitting the job lets each stage specialise:</p>

<ol>
  <li><strong>Generator</strong> – predicts a low‑resolution latent audio representation conditioned on the text.</li>
  <li><strong>Cascader(s)</strong> – progressively upsample and refine that latent into the final waveform (16kHz), optionally re‑checking the text each step.</li>
  <li><strong>Super‑resolution:</strong> A final superresolution cascader is used to generate the 24kHz audio from the 16kHz waveform.
    <ul>
      <li>All models are based on 1D U-Nets</li>
    </ul>
  </li>
</ol>

<p>Two options for the intermediate representation:</p>

<ol>
  <li>Spectrogram (log-mel)</li>
  <li>Audio with lower fidelity (3.2kHz waveform)</li>
</ol>

<p>Results:</p>

<ul>
  <li>Generated audio faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.</li>
  <li>Ground finegrained semantics of the prompt.</li>
</ul>

<p><strong>Data</strong></p>

<p>Text labels for the audio are generated by employing a pair of pretrained deep models:</p>

<ol>
  <li>Use a large language model to generate a large set of generic music descriptive sentences as caption candidates;</li>
  <li>Pre-trained music-text joint embedding model is used to score each unlabeled music clip against all the caption candidates and select the captions with the highest similarity score as pseudo labels for the audio clip.</li>
  <li>Annotate O(150K) hours of audio sources</li>
</ol>

<h1 id="2-related-works">2 Related Works</h1>

<ul>
  <li>Over the past five years, the recipe for dramatic jumps in sample quality has been simple: <strong>bigger datasets + bigger models</strong>.</li>
</ul>

<h3 id="how-models-ingest-what-i-want">How models ingest “what I want”</h3>

<ol>
  <li><strong>Fixed, human‑interpretable vocabularies</strong>
    <ul>
      <li><em>Jukebox</em> encodes each clip as one of ~8 k artist/genre labels extracted from its metadata.</li>
      <li><em>Mubert</em> maps a user prompt onto a hand‑curated tag set (e.g., “chill”, “EDM”, “focus‑music”).</li>
      <li><strong>Pros:</strong> Easy to reason about.</li>
      <li><strong>Cons:</strong> Can’t express “dreamy underwater lo‑fi.”</li>
    </ul>
  </li>
  <li><strong>Free‑form natural language embeddings</strong>
    <ul>
      <li><em>AudioGen</em>, <em>MusicLM</em> and <strong>Noise2Music</strong> feed the raw prompt through a frozen text encoder (e.g., MuLan or a language‑model encoder).</li>
      <li><strong>Pros:</strong> Unlimited expressiveness; prompts can describe mood, setting, instrumentation, era, etc.</li>
      <li><strong>Cons:</strong> The mapping from prose → sound is learned, not predefined, so training data must be rich.</li>
    </ul>
  </li>
</ol>

<h1 id="3-methods">3 Methods</h1>

<h2 id="31-diffusion-models-in-a-nutshell">3.1 Diffusion models in a nutshell</h2>

<p>Diffusion models turn pure noise into a data sample by <strong>iterative denoising</strong>. Two ingredients go in at each step:</p>

<ol>
  <li>Conditioning signal \(c\) – here, the text‑prompt embedding.</li>
  <li>Noisy input \(x_t\) – a corruption of the target waveform at “time” t, where t ∈ [0, 1]. Noise magnitude is set by a schedule \(σ_t\).</li>
</ol>

<p>During <strong>training</strong> the model \(θ\) learns to predict the exact noise vector \(ϵ\) that was added:</p>

\[\mathcal{L} \;=\; \mathbb{E}_{x,c,\epsilon,t}\!\bigl[ w_t \,\lVert \theta(x_t, c, t) \;-\; \epsilon \rVert^2 \bigr],\tag{1}\]

<p>where \(w_t\) is a hand‑chosen weight (details below).</p>

<h3 id="choosing-the-loss-weight-w_t"><strong>Choosing the loss weight \(w_t\):</strong></h3>

<table>
  <thead>
    <tr>
      <th>Option</th>
      <th>Rationale</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Simplified</strong> (\(w_t\) = 1)</td>
      <td>Easiest to implement, works well for many tasks.</td>
    </tr>
    <tr>
      <td><strong>Sigma‑scaled</strong> (\(w_t\) = \(σ_t^2\))</td>
      <td>Emphasises accuracy at late (cleaner) timesteps.</td>
    </tr>
  </tbody>
</table>

<h3 id="noiseschedule-variants">Noise‑schedule variants</h3>

<table>
  <thead>
    <tr>
      <th>Schedule</th>
      <th>Shape</th>
      <th>Typical use‑case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Linear</strong></td>
      <td>\(σ_t\) grows linearly with t</td>
      <td>Classic DDPM baseline.</td>
    </tr>
    <tr>
      <td><strong>Cosine</strong></td>
      <td>Slower rise near t = 0, faster near t = 1</td>
      <td>Often yields crisper samples with fewer steps.</td>
    </tr>
  </tbody>
</table>

<h3 id="sampling-knobs-you-can-turn">Sampling: knobs you can turn</h3>

<p>At inference we start with pure noise at <em>t = 1</em> and march back to <em>t = 0</em> (“ancestral” or DDPM sampling). Two important dials:</p>

<table>
  <thead>
    <tr>
      <th>Dial</th>
      <th>Symbol</th>
      <th>Effect</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Stochasticity</strong></td>
      <td>\(γ ∈ {0, 1}\)</td>
      <td>\(γ\) = 0 gives deterministic DDIM‑like steps; γ = 1 keeps full randomness.</td>
    </tr>
    <tr>
      <td><strong>Denoising Step schedule</strong></td>
      <td>\({t_0 … t_n}\)</td>
      <td>Any partition of [0, 1] works—e.g., 50, 100, or 1 000 steps.</td>
    </tr>
  </tbody>
</table>

<p>\(γ\) : Sets how much fresh Gaussian noise is re‑added at every reverse‑diffusion step.</p>

<h3 id="classifierfree-guidance-cfg">Classifier‑free guidance (CFG)</h3>

<p>To better align outputs with the text prompt, the authors adopt <strong>CFG</strong>:</p>

<ol>
  <li><strong>During training</strong>: Randomly drop the prompt for a subset of samples → model learns both conditional \(θ(x_t, c)\) and unconditional \(θ(x_t, ·)\).</li>
  <li><strong>During sampling</strong>: Blend the two predictions:</li>
</ol>

\[\hat\epsilon \;=\; w\;\theta(x_t, c) \;+\; (1-w)\;\theta(x_t, \cdot), \quad w&gt;1.\]

<p>Larger <em>w</em> tightens adherence to the prompt but risks clipping; the paper counters this with <strong>dynamic clipping</strong> that scales intermediate values to a safe range.</p>

<h2 id="32-model-architecture--efficientunet1d">3.2 Model Architecture — “Efficient U‑Net 1‑D”</h2>

<p><strong>Backbone.</strong> A 1‑D adaptation of the Efficient U‑Net:</p>

<ul>
  <li><strong>Down / Up blocks:</strong> They shrink the audio signal to a smaller size (down‑sampling) or stretch it back up (up‑sampling). Inside each block, the model mixes basic convolutions with attention layers to learn both local details and long‑range relationships.</li>
  <li><strong>Combine layer:</strong> The combine layer enables a single vector to interact with a sequence of vectors, where the single vector is used to produce a channel-wise scaling and bias.
    <ul>
      <li>A single vector (like the “time‑step” embedding) can turn channels up or down and add a bias, letting the model adapt its behaviour at each diffusion step.</li>
    </ul>
  </li>
  <li>More on Combine Layers
    <ol>
      <li><strong>Inputs</strong>
        <ul>
          <li><strong>Feature map</strong> A sequence of vectors coming from the convolution/attention stack. Shape: (length, channels).</li>
          <li><strong>Condition vector z</strong> A single 1‑D vector (e.g., the diffusion‑time embedding, or any global conditioning info). Shape: (channels).</li>
        </ul>
      </li>
      <li><strong>Learned transform</strong>
        <ul>
          <li>The layer passes \(z\) through two tiny neural nets (often single linear layers) to produce
            <ul>
              <li><strong>Scale s -</strong> one value per channel</li>
              <li><strong>Bias b -</strong> one value per channel</li>
            </ul>
          </li>
        </ul>
      </li>
      <li><strong>Channel‑wise modulation</strong>
        <ul>
          <li>
            <p>For every position \(i\) in the sequence and every channel c</p>

\[\text{output}_{i,c}=s_c \times \text{feature}_{i,c}+b_c\]
          </li>
          <li>
            <p>This is just an <strong>affine transform</strong> (scale + shift), but the scales/biases change with \(z\).</p>
          </li>
        </ul>
      </li>
      <li><strong>Why it matters</strong>
        <ul>
          <li>Lets a global signal (time step, overall prompt embedding, etc.) instantly tweak the local activations without extra convolutions.</li>
          <li>Makes conditioning cheap and expressive—similar in spirit to FiLM layers used in vision models.</li>
        </ul>
      </li>
    </ol>
  </li>
</ul>

<p><strong>Four conditioning routes</strong></p>

<ol>
  <li><strong>Noise input</strong> \(x_t\) (always left‑most in the stack).</li>
  <li><strong>Diffusion‑time embedding</strong> fed via <em>Combine</em> layers.</li>
  <li><strong>Text prompt</strong> sequence enters through cross‑attention.</li>
  <li>
    <p><strong>Low‑resolution audio or spectrogram</strong> (aligned) can be injected at the U‑Net bottleneck.</p>

    <p><img src="/assets/images/2025-07-25/1.png" alt="420" /></p>
  </li>
</ol>

<h2 id="33cascaded-diffusion-threestage-pipeline">3.3 Cascaded Diffusion: three‑stage pipeline</h2>

<p>Noise2Music follows the <em>Generator → Cascader → Super‑Resolution</em> recipe.</p>

<h3 id="331-waveform-model">3.3.1 Waveform Model</h3>

<p><strong>Generator</strong></p>

<ul>
  <li>Input: Text prompt</li>
  <li>A sequence of vectors derived from the text input is produced and fed into the network as a cross-attention sequence</li>
  <li>Outputs: 3.2 kHz <strong>waveform</strong></li>
</ul>

<p><strong>Cascader</strong></p>

<ul>
  <li>Inputs: Conditioned on both the text prompt and the low-fidelity audio generated by the generator model based on the text prompt.</li>
  <li>Outputs: 16 kHz waveform</li>
  <li>Method:
    <ul>
      <li>The text conditioning takes place via cross attention.</li>
      <li>Low-fidelity audio is upsampled and stacked with \(x_t\) and fed into the model.</li>
      <li>The upsampling is done by applying fast Fourier transform (FFT) to the low-fi audio sequence and then applying inverse FFT to obtain the high-fi audio from the low-fi Fourier coefficients.</li>
    </ul>
  </li>
</ul>

<h3 id="332-spectrogram-model">3.3.2 Spectrogram Model</h3>

<p><strong>Generator</strong></p>

<ul>
  <li>Outputs: 80‑×‑100 fps log‑mel spectrogram (80 channels and a frequency of 100 features per second)</li>
  <li>Pixel values of the log-mel spectrogram are normalized to lie within [−1, 1]</li>
</ul>

<p>Vocoder</p>

<ul>
  <li>Output: 16kHz audio that is conditioned only on the spectrogram</li>
</ul>

<h3 id="333-super-resolution-cascader">3.3.3 SUPER-RESOLUTION CASCADER</h3>

<ul>
  <li>Generate 24kHz audio from the 16kHz waveform produced by either model.</li>
  <li>The 16kHz audio is up-sampled and stacked with \(x_t\) as input to the model.</li>
  <li>Text conditioning is not used for this model.</li>
</ul>

<h2 id="34-text-understanding">3.4 Text Understanding</h2>

<p><strong>T5 encoder:</strong></p>

<ul>
  <li>Prompt’s token‑level embeddings without pooling are feed into cross‑attention layers throughout the U‑Net</li>
</ul>

<h2 id="35-pseudolabeling-for-music-data-data-creation">3.5 Pseudo‑Labeling for Music Data [DATA Creation]</h2>

<h3 id="351-why-pseudolabels-are-needed">3.5.1 Why pseudo‑labels are needed</h3>

<ul>
  <li>High‑quality <strong>music + free‑form caption</strong> pairs are rare.</li>
  <li>Without them, a text‑to‑music model can’t learn subtle descriptors like “laid‑back highway‑driving synthwave.”</li>
  <li>Solution: auto‑generate rich captions for millions of unlabeled tracks instead of hand‑annotating them.</li>
</ul>

<h3 id="352-models-used">3.5.2 Models Used</h3>

<p><strong>MuLan:</strong> A contrastive model with audio and text encoders that share an embedding space.</p>

<ul>
  <li>Lets you measure “text–audio similarity” with cosine distance (zero‑shot classification).</li>
</ul>

<p><strong>LaMDA:</strong> LLM trained for dialogue.</p>

<ul>
  <li>Used here to write human‑style music descriptions.</li>
</ul>

<h3 id="353-building-three-caption-vocabularies">3.5.3 Building three caption vocabularies</h3>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Size</th>
      <th>How it’s made</th>
      <th>Purpose / style</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>LaMDA‑LF</strong></td>
      <td>4 M long‑form sentences</td>
      <td>Prompt LaMDA with <em>title + artist</em> of 150 000 popular songs → clean &amp; deduplicate.</td>
      <td>Conversational, user‑prompt‑like prose.</td>
    </tr>
    <tr>
      <td><strong>Rater‑LF</strong></td>
      <td>35 333 sentences</td>
      <td>Split 10 028 expert captions from MusicCaps into single sentences.</td>
      <td>Human‑written, descriptive.</td>
    </tr>
    <tr>
      <td><strong>Rater‑SF</strong></td>
      <td>23 906 short tags</td>
      <td>Collect all short aspect tags from the same raters (mood, genre, instrument, etc.).</td>
      <td>Compact, label‑like keywords.</td>
    </tr>
  </tbody>
</table>

<p><img src="/assets/images/2025-07-25/2.png" alt="420" /></p>

<h3 id="354-assigning-captions-to-an-unlabeled-clip">3.5.4 Assigning captions to an unlabeled clip</h3>

<ol>
  <li><strong>Segment</strong> clip into 10‑s windows → feed each window to MuLan’s audio encoder.</li>
  <li><strong>Average</strong> those embeddings → one vector per clip.</li>
  <li><strong>Encode</strong> every caption in the vocabularies with MuLan’s text encoder.</li>
  <li><strong>Retrieve</strong> the <em>K = 10</em> closest captions (cosine similarity).</li>
  <li><strong>Sample</strong> <em>K′ = 3</em> of those 10, with probability \(∝\) 1 / global_frequency (rare captions get a boost).
    <ul>
      <li>Balances the label distribution and increases diversity.</li>
    </ul>
  </li>
  <li><strong>Store</strong> the selected captions as pseudo‑labels for that clip.</li>
</ol>

<p><em>Net effect</em>: each 30‑s clip can receive up to <strong>12 pseudo‑labels</strong> (3 from LaMDA‑LF, 3 from Rater‑LF, 6 from Rater‑SF) in addition to any inherent metadata.</p>

<h3 id="355-warmup-experiment-mulamcap">3.5.5 Warm‑up experiment: MuLaMCap</h3>

<ul>
  <li><strong>Source:</strong> AudioSet’s <em>music</em> subtree — 388 262 train clips + 4 497 test clips (each 10 s).</li>
  <li><strong>Labels per clip:</strong> 3 × 3 + 3 × 3 + 6 × 6.</li>
  <li>Purpose: sanity‑check the pipeline before scaling to millions of tracks.</li>
</ul>

<h2 id="36trainingdata-mining-at-scale-data">3.6 Training‑Data Mining at Scale [DATA]</h2>

<ol>
  <li><strong>Raw audio pool</strong>
    <ul>
      <li><strong>6.8 million</strong> full‑length music tracks are collected.</li>
      <li>Each track is chopped into <strong>six non‑overlapping 30‑second clips</strong> → ~340 000 hours total.</li>
    </ul>
  </li>
  <li><strong>Sample rates</strong>
    <ul>
      <li><strong>24 kHz</strong> clips train the <em>super‑resolution</em> stage (it must output 24 kHz).</li>
      <li><strong>16 kHz</strong> versions of the same clips train every other model stage (saves compute).</li>
    </ul>
  </li>
  <li>
    <p><strong>Text labels attached to every clip</strong></p>

    <table>
      <thead>
        <tr>
          <th>Label source</th>
          <th>Count per clip</th>
          <th>What it adds</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td><strong>Song title</strong></td>
          <td>1</td>
          <td>“Hotel California”</td>
        </tr>
        <tr>
          <td><strong>Named‑entity tags</strong></td>
          <td>variable</td>
          <td>Genre, artist, instrument, year, etc.</td>
        </tr>
        <tr>
          <td><strong>LaMDA‑LF pseudo‑labels</strong></td>
          <td><strong>3</strong></td>
          <td>Rich sentences like “slow acoustic ballad for a summer evening.”</td>
        </tr>
        <tr>
          <td><strong>Rater‑SF pseudo‑labels</strong></td>
          <td><strong>6</strong></td>
          <td>Compact tags such as “laid‑back,” “highway‑driving,” “lo‑fi beats.”</td>
        </tr>
      </tbody>
    </table>

    <p><strong><em>Why skip Rater‑LF?</em></strong></p>

    <ul>
      <li>Those captions appear in the MusicCaps evaluation set; excluding them avoids train‑test leakage.</li>
    </ul>
  </li>
  <li><strong>Why mix “objective” and “subjective” labels?</strong>
    <ul>
      <li>Objective tags (genre, artist) nail down obvious metadata.</li>
      <li>Pseudo‑labels add nuances—mood, activity, fine‑grained compositional hints.</li>
      <li>Together they give the model both <em>facts</em> and <em>feelings</em> to learn from.</li>
    </ul>
  </li>
  <li><strong>Quality anchor inside the noisy sea</strong>
    <ul>
      <li>The authors add ≈ <strong>300 hours</strong> of internally curated, attribution‑free tracks.</li>
      <li>Each track’s rich metadata is concatenated into a single prompt string.</li>
      <li>Acts as a clean, high‑signal subset to stabilize training.</li>
    </ul>
  </li>
</ol>

<p><strong>Net result:</strong> a gigantic, diverse corpus where every 30‑s clip carries 10 + text descriptors that range from “objective metadata” to “subjective vibes,” providing the breadth of supervision a text‑to‑music diffusion model needs.</p>

<h1 id="4-experiments-and-results">4 Experiments and Results</h1>

<h2 id="41-model-training-details">4.1 Model training details</h2>

<p><strong>Models trained:</strong> 4 separate 1‑D U‑Nets</p>

<ol>
  <li>Waveform <strong>Generator</strong> (3.2 kHz)</li>
  <li>Waveform <strong>Cascader</strong> (16 kHz)</li>
  <li>Spectrogram <strong>Generator</strong> (80 × 100 fps log‑mel)</li>
  <li>Spectrogram <strong>Vocoder</strong> (16 kHz)</li>
</ol>

<blockquote>
  <p>Final 24 kHz “super‑res” U‑Net is a light extension of the cascader.</p>
</blockquote>

<h3 id="loss-weighting"><strong>Loss Weighting</strong></h3>

<ul>
  <li><em>σ²</em>‑weighted MSE for spectrogram generator (critical for convergence)
    <ul>
      <li>Weighs the loss more heavily on the “back end” (late or cleaner timesteps) of the denoising schedule.</li>
    </ul>
  </li>
  <li>Either <em>σ²</em> or constant 1 for others</li>
</ul>

<blockquote>
  <p><strong>Note:</strong> All the models, with the exception of the vocoder, are trained on audio-text pairs, while the vocoder is only trained on audio.</p>
</blockquote>

<h3 id="text-batch-per-clip"><strong>Text batch per clip</strong></h3>

<ol>
  <li>Long‑form descriptions (3 items) comes from LaMDA‑LF vocabulary and stored as three different strings.</li>
  <li>Short tags and metadata - mashed together, then chopped to size
    <ol>
      <li>All of them are concatenated into one line</li>
      <li>If this string exceeds the token limit fixed in Table 2 (say 64 tokens), it is split into equal‑length chunks so that each chunk fits the limit.</li>
      <li>Every chunk counts as an additional candidate caption.</li>
    </ol>
  </li>
  <li>Total = 3 long sentences + 1 – 2 short chunks (depending on length).</li>
</ol>

<p>During training the loader <strong>randomly picks one element</strong> from that list and feeds it to the U‑Net as the text conditioning for this audio example.</p>

<ul>
  <li>So across epochs the network sees the same audio paired sometimes with a rich prose description, other times with a terse tag bundle—helping it learn both broad language and concise labels.</li>
</ul>

<h3 id="more-details">More Details</h3>

<p><strong>Optimizer:</strong></p>

<ul>
  <li>Adam, \(β_1\) = 0.9,  \(β_2\) = 0.999</li>
</ul>

<p><strong>LR schedule:</strong></p>

<ul>
  <li>Cosine LR Scheduler, Max LR: 1 × 10⁻⁴</li>
  <li>End Point: Step 2.5 M, Warm-Up steps: 10 k</li>
</ul>

<p><strong>Exponential Moving Average (EMA)</strong></p>

<p>Individual parameter updates from each mini‑batch are noisy. Averaging them over time gives a smoother, typically better‑generalising set of weights for inference.</p>

\[\theta_{t} = (1-\alpha)\,\theta_{t-1} + \alpha\,\theta_{t}\]

<ul>
  <li>Decay factor d = 1 − α , d = 0.9999 and used at inference time.</li>
  <li>Keep a second weight copy while training (EMA)</li>
  <li>Snapshot those EMA weights to disk</li>
  <li>At inference time → Load only the EMA copy (ignore the noisy “online” weights).</li>
  <li><strong>Why this works</strong>
    <ul>
      <li>Reduces training‑loss noise.</li>
      <li>EMA weights have seen every past setting of the model, so outliers cancel out.</li>
      <li>Empirically they yield crisper audio and fewer artifacts, especially for diffusion and GAN‑style generators.</li>
    </ul>
  </li>
</ul>

<p><strong>Batch Size</strong></p>

<ul>
  <li>4096 for Super-res cascader (since its lightweight)</li>
  <li>2048 for rest</li>
</ul>

<p><strong>CFG During Training</strong></p>

<ul>
  <li>Hide prompt for 10 % of samples (cross‑attention outputs zeroed)</li>
  <li>Teaches model to handle both conditional and unconditional cases, enabling CFG at inference.</li>
</ul>

<p><strong>Sequence length seen by each model</strong></p>

<ul>
  <li>Generators: full 30 s clip</li>
  <li>Cascader &amp; vocoder: random 3–4 s windows
    <ul>
      <li>Cascader/vocoder don’t use self‑attention → can train on snippets, saving memory.</li>
    </ul>
  </li>
</ul>

<h3 id="data-augmentations-for-cascader--vocoder"><strong>Data augmentations (for cascader &amp; vocoder)</strong></h3>

<p>Randomly corrupt the conditioning low-fidelity audio or the spectrogram input by applying diffusion noise</p>

<ul>
  <li>Random diffusion time is chosen within [0, \(t_{max}\)] and applied to the intermediate representation of the audio, i.e., the upsampled low-fi audio or the spectrogram.</li>
  <li>Cascader \(t_{max}\): 0.5</li>
  <li>Vocoder and super-res \(t_{max}\): 1.0</li>
</ul>

<p><strong>Blur Augmentation of conditioning input</strong></p>

<ul>
  <li>For the cascader model, a 1D blur kernel of size 10 is used with a Gaussian blur kernel whose standard deviation ranges from 0.1 to 5.0.</li>
  <li>For the vocoder model, a 2D 5x5 blur kernel is applied with the standard deviation ranging from 0.2 to 1.0.</li>
</ul>

<p><img src="/assets/images/2025-07-25/3.png" alt="420" /></p>

<h2 id="42-model-inference-and-serving">4.2 Model inference and serving</h2>

<h3 id="421-model-inference">4.2.1 Model Inference</h3>

<ol>
  <li><strong>Three knobs you can turn</strong>
    <ul>
      <li><strong>Denoising schedule</strong> – how you spread the diffusion steps along time t∈[0, 1].</li>
      <li><strong>Stochasticity γ</strong> – 0 = deterministic (DDIM‑style), 1 = full randomness (DDPM‑style).</li>
      <li><strong>CFG scale w</strong> – how strongly the result must obey the text prompt (larger w → tighter match, but riskier artefacts).</li>
    </ul>
  </li>
  <li><strong>What “denoising schedule” really means</strong>
    <ul>
      <li>Imagine you have N small time jumps \(δ₁…δ_N\)  that must add up to 1.</li>
      <li><strong>Front‑heavy</strong>: many tiny steps right at the start (when audio is still noisy).</li>
      <li><strong>Uniform</strong>: equal spacing throughout.</li>
      <li><strong>Back‑heavy</strong>: more steps near the end (when audio is already fairly clean).</li>
      <li>Given a fixed budget of steps, choosing where to spend them is a trade‑off between global structure (benefits from early steps) and fine detail (benefits from late steps).</li>
    </ul>
  </li>
  <li><strong>Hyper‑parameter sets actually used</strong>
    <ul>
      <li>Each of the four U‑Nets (generator, cascader, spectrogram generator, vocoder) gets its own trio of settings (schedule type, γ, CFG scale).</li>
      <li>Those exact numbers live in Table 3 of the paper; the principle is the same: early models lean slightly “front‑heavy” and higher γ for creativity, while later refiners go “back‑heavy” and lower γ for polish.</li>
    </ul>
  </li>
</ol>

<p><img src="/assets/images/2025-07-25/4.png" alt="420" /></p>

<p><img src="/assets/images/2025-07-25/5.png" alt="420" /></p>

<h2 id="43-evaluation">4.3 Evaluation</h2>

<ol>
  <li><strong>Parameter Selection for the Models</strong>
    <ul>
      <li>Team used a handful of private “dev prompts,” listened, and chose the versions that subjectively sounded best within their compute budget.</li>
      <li>All metrics are computed on the <strong>16 kHz</strong> outputs straight from the cascader/vocoder — the 24 kHz super-resolution stage is skipped during evaluation.</li>
    </ul>
  </li>
  <li><strong>4.3.2 Evaluation Metrics</strong>
    <ol>
      <li><strong>Fréchet Audio Distance (FAD)</strong>: same idea as FID for images. Three encoders give three flavours:
        <ul>
          <li><strong>VGGish</strong> → general sonic quality.</li>
          <li><strong>Trill</strong> → vocal-centric quality.</li>
          <li><strong>MuLan audio encoder</strong> → high-level musical semantics.</li>
        </ul>
      </li>
      <li><strong>MuLan similarity</strong>: cosine similarity in the MuLan embedding space. Used two ways:
        <ul>
          <li><em>Text ↔ generated audio</em> (how well the clip matches its prompt).</li>
          <li><em>Ground-truth audio ↔ generated audio</em>.</li>
          <li>Randomly shuffled pairs give a “chance level” baseline.</li>
        </ul>
      </li>
    </ol>
  </li>
  <li><strong>Evaluation datasets</strong>
    <ul>
      <li><strong>MagnaTagATune (MTAT)</strong> — 21 638 clips with up to 188 tag labels concatenated into a single prompt; model generates a full 29-s clip.</li>
      <li><strong>AudioSet-Music-Eval</strong> — 1 482 ten-second clips; tags concatenated; model generates 30 s, middle 10 s are scored.</li>
      <li><strong>MusicCaps</strong> — 5.5 K ten-second clips with rater-written free-form captions; model generates 30 s, middle 10 s are scored.</li>
    </ul>
  </li>
</ol>

<h3 id="44-results">4.4 Results</h3>

<p><img src="/assets/images/2025-07-25/6.png" alt="420" /></p>

<h3 id="45-inference-time-ablations">4.5 Inference-time ablations</h3>

<ul>
  <li><strong>Classifier-free guidance (CFG) scale</strong>
    <ul>
      <li>Sweet spot around <strong>5–10</strong>; beyond that, FAD rises and audio gets over-compressed or distorted.</li>
      <li>Generator’s CFG weight matters more than its denoising schedule; for the cascader it’s the opposite.</li>
    </ul>
  </li>
  <li><strong>Denoising schedule shape</strong>
    <ul>
      <li>Cascader is very sensitive: front-heavy schedules hurt quality; back-heavy gives best FAD &amp; similarity.</li>
      <li>Generator is less sensitive; uniform vs. mildly front-heavy are both acceptable.</li>
    </ul>
  </li>
  <li><strong>Step count vs. quality (cost curve)</strong>
    <ul>
      <li>More steps in the <em>cascader/vocoder</em> nearly always help; extra steps in the <em>generator</em> give diminishing returns after a point.</li>
      <li>Plot shows the elbow where doubling steps adds little perceptual gain — useful for setting latency targets.</li>
    </ul>
  </li>
</ul>

<h1 id="5-more">5 More</h1>

<p><strong>Spectrogram vs. waveform cascades</strong></p>

<ul>
  <li><em>Spectrogram path</em>
    <ul>
      <li>Much cheaper to train and serve because the input sequence is short.</li>
      <li>Naturally keeps high‑frequency detail that a 3 kHz low‑fi waveform cannot contain.</li>
      <li>Down‑side: intermediate representations are hard for engineers to interpret/debug.</li>
    </ul>
  </li>
  <li><em>Waveform path</em>
    <ul>
      <li>Every intermediate output is an actual audio snippet, which makes debugging and hyper‑parameter tuning easier.</li>
      <li>Training/serving is costlier and sequence length limits scalability to very long clips.</li>
    </ul>
  </li>
</ul>

<p><strong>Open research directions</strong></p>

<ol>
  <li>Better interpretability and controllability.</li>
  <li>Stronger text–audio alignment (fewer “off‑prompt” generations).</li>
  <li>Lower training and inference cost.</li>
  <li>Longer outputs, plus tasks such as music in‑painting or style transfer—analogous to image editing with diffusion “paint‑over” techniques.</li>
</ol>]]></content><author><name></name></author><category term="speech" /><summary type="html"><![CDATA[6 Mar 2023 - Link - Website]]></summary></entry><entry><title type="html">Text-to-Audio-Models</title><link href="https://aayush9753.in/blog/2025/text-to-audio-models/" rel="alternate" type="text/html" title="Text-to-Audio-Models" /><published>2025-07-25T00:00:00+00:00</published><updated>2025-07-25T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/text-to-audio-models</id><content type="html" xml:base="https://aayush9753.in/blog/2025/text-to-audio-models/"><![CDATA[<h2 id="my-journey-into-text-to-audio-models">My Journey into Text to Audio Models</h2>

<p>I am studying text to audio models with more focus towards Music Generation models.</p>

<h1 id="text-to-music-models">Text-To-Music Models</h1>

<h3 id="1-mustango-toward-controllable-text-to-music-generation">1. <a href="https://aayush9753.github.io/mustango-toward-controllable-text-to-music-generation.html">Mustango: Toward Controllable Text-to-Music Generation</a></h3>
<p>Mustango is a diffusion-based text-to-music model that enables structured control over <strong>chords, beats, tempo,</strong> and <strong>key</strong> directly from natural-language prompts.</p>

<p><strong>MusicBench — Dataset Pipeline</strong></p>
<ol>
  <li><strong>Seed corpus:</strong> 5521 MusicCaps clips (10 s + captions).</li>
  <li><strong>Control sentences:</strong> append 0–4 beat/chord/key/tempo lines</li>
  <li><strong>Paraphrase:</strong> ChatGPT rephrasing</li>
  <li><strong>Filter:</strong> drop “poor‑quality/low‑fidelity” captions</li>
  <li><strong>11× augment:</strong> ±1‑3 semitones, ±5–25 % speed, crescendo/decrescendo volume → ≈37 k new samples.</li>
</ol>

<p><strong>Mustango Model</strong></p>
<ol>
  <li><strong>Latent space:</strong> AudioLDM VAE → latent z.</li>
  <li><strong>MuNet denoiser:</strong> UNet + hierarchical cross‑attention.
    <ul>
      <li>Inputs: FLAN‑T5 text emb; beat &amp; chord encodings. (<strong>Beat encoder</strong> and <strong>Chord encoder</strong>)</li>
    </ul>
  </li>
  <li><strong>Inference helpers:</strong>
    <ul>
      <li>DeBERTa beat predictor (meter + intervals).</li>
      <li>FLAN‑T5 chord predictor (time‑stamped chords).</li>
    </ul>
  </li>
  <li><strong>Output:</strong> 10‑s waveform obeying tempo, key, chords, beats when provided; graceful fallback when not.</li>
</ol>

<h3 id="2-noise2music-text-conditioned-music-generation-with-diffusion-models">2. <a href="https://aayush9753.github.io/noise2music-text-conditioned-music-generation-with-diffusion-models.html">Noise2Music: Text-conditioned Music Generation with Diffusion Models</a></h3>
<p>Generate a 30-second, 24 kHz stereo music clip from a plain-language prompt.</p>

<p><strong>Training‑Data Pipeline</strong></p>
<ol>
  <li><strong>Raw audio pool:</strong> 6.8 M full‑length tracks → chopped into 30 s clips (~340 k h).</li>
  <li><strong>Caption vocabularies (built offline)</strong>
    <ul>
      <li><strong>LaMDA‑LF</strong> – 4M rich sentences (LLM‑generated).</li>
      <li><strong>Rater‑LF / SF</strong> – 35k long + 24k short human sentences/tags from MusicCaps.</li>
    </ul>
  </li>
  <li><strong>Embedding space scoring:</strong> Encode every clip (MuLan‑audio) &amp; every caption (MuLan‑text).</li>
  <li><strong>Pseudo‑labelling:</strong> For each clip pick top‑10 captions by cosine sim → sample 3 low‑frequency ones from each vocab (bias toward rarer labels).</li>
  <li><strong>Extra metadata:</strong> Append title, artist, genre, year, instrument tags.</li>
  <li><strong>Quality anchor:</strong> Inject ~300 h curated, attribution‑free tracks with rich manual metadata.</li>
  <li><strong>Dual‑rate storage:</strong> Keep 24 kHz (for super‑res stage) + 16 kHz copies (for the rest).</li>
  <li><strong>Final payload:</strong> Every 30 s clip carries 10 + text descriptors spanning objective tags → subjective vibes.</li>
</ol>

<p><strong>Model Stack (three‑stage diffusion cascade)</strong></p>

<table>
  <thead>
    <tr>
      <th>Stage</th>
      <th>I/O</th>
      <th>Role</th>
      <th>Key details</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Waveform Generator</strong></td>
      <td><em>Text → 3.2 kHz audio</em></td>
      <td>Sketch global structure.</td>
      <td>1‑D Efficient‑U‑Net; text fed via cross‑attention; CFG during sampling.</td>
    </tr>
    <tr>
      <td><strong>Waveform Cascader</strong></td>
      <td><em>Text + 3.2 kHz → 16 kHz audio</em></td>
      <td>Upsample &amp; refine.</td>
      <td>Receives up‑sampled low‑fi audio + prompt; blur/noise augmentation during training.</td>
    </tr>
    <tr>
      <td><strong>Super‑Res Cascader</strong></td>
      <td><em>16 kHz → 24 kHz audio</em></td>
      <td>Restore full bandwidth.</td>
      <td>No text conditioning; lightweight U‑Net.</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p><strong>Spectrogram path</strong> (alt): parallel generator + vocoder pair that works in log‑mel space; cheaper but less interpretable.</p>
</blockquote>

<h3 id="3-stable-audio---fast-timing-conditioned-latent-audio-diffusion">3. <a href="https://aayush9753.github.io/stable-audio.html">Stable Audio - Fast Timing-Conditioned Latent Audio Diffusion</a></h3>
<ol>
  <li>A <strong>convolutional VAE</strong> that efficiently compresses and reconstructs long stereo audio.</li>
  <li>It uses <strong>latent diffusion</strong></li>
  <li>It adds <strong>timing embeddings</strong>.</li>
</ol>

<p><strong>Dataset Construction</strong></p>
<ol>
  <li><strong>Collect</strong> 806284 stereo tracks (≈ 19500 h) from <strong>AudioSparx</strong>.</li>
  <li><strong>Pre‑process audio</strong>
    <ul>
      <li>Resample to <strong>44.1 kHz</strong>, stereo.</li>
      <li>Slice / pad each file to a fixed <strong>95.1 s</strong> window (4 194 304 samples).</li>
    </ul>
  </li>
  <li><strong>Build text prompts</strong> from metadata <em>on‑the‑fly</em>
    <ul>
      <li>Randomly sample descriptors (genre, mood, BPM, instruments).</li>
      <li>Emit either <strong>free‑form</strong> or <strong>structured</strong> text strings.</li>
    </ul>
  </li>
  <li><strong>Final sets</strong>
    <ul>
      <li>Same corpus trains the <strong>VAE</strong>, <strong>CLAP</strong> (text encoder), and <strong>latent diffusion U‑Net</strong>.</li>
    </ul>
  </li>
</ol>

<p><strong>Model Pipeline</strong></p>

<table>
  <thead>
    <tr>
      <th>Stage</th>
      <th>Key points</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>1. VAE</strong></td>
      <td>32× compression</td>
    </tr>
    <tr>
      <td><strong>2. Text encoder (CLAPours)</strong></td>
      <td>trained from scratch</td>
    </tr>
    <tr>
      <td><strong>3. Timing embeddings</strong></td>
      <td>seconds_start, seconds_total; concatenated with text features</td>
    </tr>
    <tr>
      <td><strong>4. Latent U‑Net diffusion</strong></td>
      <td>907 M params;</td>
    </tr>
    <tr>
      <td><strong>5. Inference</strong></td>
      <td>DPMSolver++</td>
    </tr>
  </tbody>
</table>

<p><strong>Outcome:</strong> 44.1 kHz stereo audio, up to 95 s, fast (latent) diffusion with precise duration control via timing conditioning.</p>

<hr />

<p><em>More blog posts coming soon as I continue my learning journey…</em></p>]]></content><author><name></name></author><category term="journey" /><summary type="html"><![CDATA[My Journey into Text to Audio Models]]></summary></entry><entry><title type="html">Stable Audio - Fast Timing-Conditioned Latent Audio Diffusion</title><link href="https://aayush9753.in/blog/2025/stable-audio/" rel="alternate" type="text/html" title="Stable Audio - Fast Timing-Conditioned Latent Audio Diffusion" /><published>2025-07-25T00:00:00+00:00</published><updated>2025-07-25T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/stable-audio</id><content type="html" xml:base="https://aayush9753.in/blog/2025/stable-audio/"><![CDATA[<blockquote>
  <p>Evans, Zach, C. J. Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. “Fast Timing-Conditioned Latent Audio Diffusion.” arXiv:2402.04825. Preprint, arXiv, May 13, 2024. https://doi.org/10.48550/arXiv.2402.04825.</p>
</blockquote>

<p><a href="https://github.com/Stability-AI/stable-audio-tools">Model-Code</a> - <a href="https://github.com/Stability-AI/stable-audio-metrics">Metrices</a> - <a href="https://stability-ai.github.io/stable-audio-demo/">Demo</a></p>

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><a href="#1-summary">Summary</a></li>
  <li><a href="#2-related-work">Related Work</a>
    <ul>
      <li><a href="#21-autoregressive-models-great-sound-painfully-slow">2.1 Autoregressive Models: Great Sound, Painfully Slow</a></li>
      <li><a href="#22-non-autoregressive-models-faster-still-limited">2.2 Non-Autoregressive Models: Faster, Still Limited</a></li>
      <li><a href="#23-diffusion-models">2.3 Diffusion Models</a></li>
      <li><a href="#24-high-sampling-rate--stereo-audio">2.4 High Sampling Rate &amp; Stereo Audio</a></li>
      <li><a href="#25-timing-conditioning">2.5 Timing Conditioning</a></li>
      <li><a href="#26-evaluation-metrics">2.6 Evaluation Metrics</a></li>
      <li><a href="#27-multitask-generation">2.7 Multitask Generation</a></li>
    </ul>
  </li>
  <li><a href="#3-architecture">Architecture</a>
    <ul>
      <li><a href="#31-variational-autoencoder-vae-compressing-audio-for-fast-diffusion">3.1 Variational Autoencoder (VAE): Compressing Audio for Fast Diffusion</a></li>
      <li><a href="#32-conditioning-telling-the-model-what-and-how-long-to-generate">3.2 Conditioning: Telling the Model What and How Long to Generate</a></li>
      <li><a href="#33-diffusion-model-the-brain-behind-the-music">3.3 Diffusion Model: The Brain Behind the Music</a></li>
      <li><a href="#34-inference-fast-controlled-sampling">3.4 Inference: Fast, Controlled Sampling</a></li>
    </ul>
  </li>
  <li><a href="#4-training">Training</a>
    <ul>
      <li><a href="#41-dataset-the-backbone">4.1 Dataset: The Backbone</a></li>
      <li><a href="#42-training-the-vae-compressing-without-losing-musicality">4.2 Training the VAE: Compressing Without Losing Musicality</a></li>
      <li><a href="#43-training-the-text-encoder-clap-from-scratch">4.3 Training the Text Encoder: CLAP, from Scratch</a></li>
      <li><a href="#44-training-the-diffusion-model">4.4 Training the Diffusion Model</a></li>
      <li><a href="#45-prompt-preparation-how-text-prompts-were-created">4.5 Prompt Preparation: How Text Prompts Were Created</a></li>
    </ul>
  </li>
  <li><a href="#5-methodology">Methodology</a>
    <ul>
      <li><a href="#51-quantitative-metrics">5.1 Quantitative Metrics</a></li>
      <li><a href="#52-qualitative-metrics">5.2 Qualitative Metrics</a></li>
      <li><a href="#53-evaluation-data">5.3 Evaluation Data</a></li>
      <li><a href="#54-baselines">5.4 Baselines</a></li>
    </ul>
  </li>
  <li><a href="#6-experiments">Experiments</a>
    <ul>
      <li><a href="#61-how-good-is-the-autoencoder">6.1 How Good Is the Autoencoder?</a></li>
      <li><a href="#62-which-text-encoder-works-best">6.2 Which Text Encoder Works Best?</a></li>
      <li><a href="#63-how-accurate-is-the-timing-conditioning">6.3 How Accurate Is the Timing Conditioning?</a></li>
      <li><a href="#64-how-does-it-compare-to-state-of-the-art">6.4 How Does It Compare to State-of-the-Art?</a></li>
      <li><a href="#65-how-fast-is-it">6.5 How Fast Is It?</a></li>
    </ul>
  </li>
  <li><a href="#section-7-conclusions">Conclusions</a></li>
</ol>

<hr />

<h1 id="1-summary">1 Summary</h1>

<p><strong>The Problem with Audio Diffusion</strong></p>

<p><strong>Raw audio</strong> is massive in size and complexity. That means:</p>

<ul>
  <li><strong>Training</strong> is slow and memory-intensive.</li>
  <li><strong>Inference</strong> (actually generating audio) is even slower, especially for <strong>long clips</strong> or <strong>stereo</strong> output.</li>
  <li>Another practical issue: most audio diffusion models only generate <strong>fixed-length clips</strong>.
    <ul>
      <li>A model trained on 30-second chunks will always give you exactly 30 seconds — even when your prompt suggests something shorter or longer. This is unnatural, especially for:
        <ul>
          <li><strong>Music</strong>, which has structure (like intros and outros)</li>
          <li><strong>Sound effects</strong>, which can be quick or long</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<h3 id="stable-audio">Stable Audio</h3>

<ol>
  <li>A <strong>convolutional VAE</strong> that efficiently compresses and reconstructs long stereo audio.</li>
  <li>It uses <strong>latent diffusion</strong> — meaning it learns to denoise in a compressed, lower-dimensional space (the latent space), not on raw audio. This is <em>much faster</em> and allows for <em>longer generation</em>.</li>
  <li>It adds <strong>timing embeddings</strong> — so you can tell the model how long the output should be.</li>
</ol>

<p>That combo allows it to:</p>

<ul>
  <li>Generate <strong>up to 95 seconds</strong> of full-quality audio in just <strong>8 seconds</strong></li>
  <li>Offer <strong>precise control</strong> over the duration and content</li>
  <li>Render <strong>stereo</strong> audio at <strong>44.1kHz</strong> — the same sample rate used in CDs</li>
</ul>

<h3 id="rethinking-audio-evaluation">Rethinking Audio Evaluation</h3>

<ol>
  <li><strong>Fréchet Distance with OpenL3</strong>: Measures how realistic the audio <em>sounds</em> by comparing it to real-world audio using perceptual embeddings.</li>
  <li><strong>KL Divergence</strong>: Quantifies how well the <em>semantic content</em> of the generated audio matches a reference.</li>
  <li><strong>CLAP Score</strong>: Assesses how well the generated audio aligns with the <em>text prompt</em>.</li>
</ol>

<p>They also go a step further by assessing:</p>

<ul>
  <li><strong>Musical structure</strong> (does it feel like a song, or just a loop?)</li>
  <li><strong>Stereo correctness</strong> (do the left and right channels make sense?)</li>
  <li><strong>Human perception</strong> (via qualitative studies)</li>
</ul>

<hr />

<h1 id="2-related-work">2 Related Work</h1>

<h3 id="21-autoregressive-models-great-sound-painfully-slow">2.1 Autoregressive Models: Great Sound, Painfully Slow</h3>

<p><strong>What they are:</strong></p>

<p>Autoregressive models generate audio one step (or token) at a time. Think of it like writing a sentence word by word — each decision depends on what came before.</p>

<p><strong>Examples:</strong></p>

<ul>
  <li><strong>WaveNet</strong> (2016): Generated audio from scratch using raw waveform values — high-quality but painfully slow.</li>
  <li><strong>Jukebox</strong> (2020): Compressed music into multi-scale latent tokens, then used transformers to model them.</li>
  <li><strong>MusicLM / MusicGen / AudioLM</strong>: Modern versions that use text prompts instead of artist/genre metadata and work on compressed audio tokens.</li>
</ul>

<p><strong>Problem:</strong> Even with compression, these models are slow to generate audio because of their step-by-step nature.</p>

<h3 id="22-non-autoregressive-models-faster-still-limited">2.2 Non-Autoregressive Models: Faster, Still Limited</h3>

<p><strong>What they are:</strong></p>

<p>These models try to speed up generation by skipping the step-by-step process.</p>

<p><strong>Examples:</strong></p>

<ul>
  <li><strong>Parallel WaveNet</strong>, <strong>GAN-based methods</strong>: Tried adversarial training.</li>
  <li><strong>VampNet / MAGNeT / StemGen</strong>: Use masked modeling (like BERT) or other tricks to avoid sequential generation.</li>
  <li><strong>Flow-matching</strong> models: Try to morph noise into data in a more direct way.</li>
</ul>

<p><strong>Problem:</strong> Many are limited in the duration they can handle (up to 20 seconds) or don’t focus on structured or stereo music.</p>

<h3 id="23-diffusion-models">2.3 Diffusion Models</h3>

<ol>
  <li><strong>End-to-End Diffusion:</strong> These generate raw waveforms directly (e.g., CRASH, DAG, Noise2Music). Powerful, but training is costly and slow.</li>
  <li><strong>Spectrogram Diffusion:</strong> Generate images of sound (spectrograms) and convert them back to waveforms.
    <ol>
      <li><strong>Riffusion</strong>: Generated audio by tweaking Stable Diffusion for spectrograms.</li>
      <li>Needs a separate vocoder (like HiFi-GAN) to reconstruct audio.</li>
    </ol>
  </li>
  <li><strong>Latent Diffusion (Stable Audio’s Approach)</strong>
    <ol>
      <li><strong>Moûsai, AudioLDM, JEN-1</strong>: All use VAE-based latents to make the job easier and faster.</li>
      <li><strong>AudioLDM</strong>: Generates spectrograms first, then inverts to audio.</li>
      <li><strong>Moûsai</strong>: Diffuses latents and directly decodes into audio.</li>
      <li><strong>JEN-1</strong>: A multitask model with dimensionality-reduced latents.</li>
      <li><strong>Stable Audio’s Differentiator:</strong> Also uses latent diffusion, <strong>but focuses on 44.1kHz stereo audio</strong>, supports <strong>up to 95 seconds</strong>, and introduces <strong>timing conditioning</strong> — something none of these others do.</li>
    </ol>
  </li>
</ol>

<h3 id="24-high-sampling-rate--stereo-audio">2.4 High Sampling Rate &amp; Stereo Audio</h3>

<p>Most past models:</p>

<ul>
  <li>Work in <strong>mono</strong> or <strong>low sample rates</strong> (16–24kHz).</li>
  <li>Or generate <strong>short clips</strong>.</li>
</ul>

<p><strong>Only a few</strong> (e.g., Moûsai, JEN-1) can handle stereo and high quality — but not efficiently, and not with variable length.</p>

<p><strong>Stable Audio’s edge:</strong> One of the <strong>first</strong> models to combine:</p>

<ul>
  <li><strong>44.1kHz stereo</strong></li>
  <li><strong>Up to 95 seconds</strong></li>
  <li><strong>Variable-length control via timing conditioning</strong></li>
</ul>

<h3 id="25-timing-conditioning">2.5 Timing Conditioning</h3>

<p>Introduced by <strong>Jukebox</strong>, which used it in an autoregressive way (e.g., where in the song a chunk came from).</p>

<p><strong>Stable Audio’s innovation:</strong> Brings <strong>timing embeddings</strong> into the world of latent diffusion — a first. These embeddings help the model <strong>control duration</strong> of output precisely, which is crucial for realistic music or sound effects.</p>

<h3 id="26-evaluation-metrics">2.6 Evaluation Metrics</h3>

<p><strong>Problem:</strong> Most metrics (e.g., from Kilgour et al.) were designed for 16kHz, mono, short-form audio.</p>

<p><strong>Stable Audio introduces:</strong></p>

<ul>
  <li><strong>OpenL3 Fréchet Distance</strong>: Like FID for music — checks realism.</li>
  <li><strong>KL divergence for semantic alignment</strong>: Checks if generated audio <em>matches</em> the idea.</li>
  <li><strong>CLAP score</strong>: Measures text-to-audio alignment.</li>
  <li><strong>Qualitative assessments</strong>: Musicality, stereo image, structure.</li>
</ul>

<h3 id="27-multitask-generation">2.7 Multitask Generation</h3>

<p>Some recent models (e.g., JEN-1) try to generate <strong>speech + music + sound</strong> in one system.</p>

<p><strong>Stable Audio’s focus:</strong> Just <strong>music and sound effects</strong> — not speech — for better domain-specific quality.</p>

<hr />

<h1 id="3-architecture">3 Architecture</h1>

<p>At a high level, it consists of <strong>three core components</strong>:</p>

<ol>
  <li>A <strong>Variational Autoencoder (VAE)</strong> to compress and decompress the audio</li>
  <li>A <strong>Conditioning system</strong> using <strong>text</strong> and <strong>timing</strong> embeddings</li>
  <li>A <strong>U-Net-based diffusion model</strong> that learns how to turn noise into music — fast and controllably</li>
</ol>

<p>Let’s walk through each of them.</p>

<h2 id="31-variational-autoencoder-vae-compressing-audio-for-fast-diffusion">3.1 Variational Autoencoder (VAE): Compressing Audio for Fast Diffusion</h2>

<p>Training and sampling on raw 44.1kHz stereo audio would be painfully slow and memory-intensive. That’s why Stable Audio uses a <strong>VAE</strong> to shrink the raw audio into a <strong>learnable latent space</strong> — a compact, lossy representation that still retains musical essence.</p>

<p><strong>Key Features:</strong></p>

<ul>
  <li><strong>Input</strong>: Stereo audio (2 channels) of arbitrary length.</li>
  <li><strong>Output</strong>: Latent tensor with 64 channels and 1/1024th the original length. That’s a <strong>32× compression</strong> in size.</li>
  <li><strong>Architecture</strong>: Based on the <strong>Descript Audio Codec</strong>, but without quantization.</li>
  <li><strong>Activations</strong>: Uses <strong>Snake activations</strong>, which help better reconstruct the audio at high compression — better than more common models like <strong>EnCodec</strong>, though at the cost of using more VRAM.</li>
</ul>

<p>This design allows the model to handle <em>long-form stereo audio</em> efficiently, which would otherwise be computationally infeasible.</p>

<!-- 🐍 Snake Activation card -->
<div class="snake-box">
  <h3>🐍 What is Snake Activation?</h3>

  <p><strong>Snake activation</strong> is a type of activation function introduced to help neural networks
  better represent <em>periodic</em> and <em>high‑frequency</em> patterns — like those commonly found in
  <strong>audio</strong> or <strong>waveforms</strong>.</p>

  <p>The function is defined as:</p>

  <p>
    $$\text{Snake}(x) \;=\; x \;+\; \frac{1}{\alpha}\,\sin^2(\alpha x)$$
  </p>

  <ul>
    <li><code>x</code> — input value</li>
    <li><code>&alpha;</code> — learnable parameter controlling the sinusoid’s frequency</li>
  </ul>

  <p>First proposed by <a href="https://arxiv.org/abs/2006.08195" target="_blank" rel="noopener">Ziyin et al., 2020</a>,
  the layer excels on continuous signals (e.g.&nbsp;audio).</p>

  <h4>Why Use Snake?</h4>
  <ul>
    <li>Standard activations <em>don’t</em> natively capture oscillations.</li>
    <li>Audio is highly periodic and rich in high‑frequency detail.</li>
    <li>Snake helps models learn and preserve those details during encoding/decoding.</li>
  </ul>

  <h4>Intuition</h4>
  <p>
    $$x + \frac{1}{\alpha}\sin^2(\alpha x)$$
  </p>
  <ul>
    <li><strong>Linear term <code>x</code></strong> → stable gradients.</li>
    <li><strong>Sinusoid</strong> → adaptive “wiggle”.</li>
    <li><strong>&alpha;</strong> → learns the optimal frequency per neuron.</li>
  </ul>

  <h4>Comparison to Other Activations</h4>
  <table>
    <thead>
      <tr><th>Activation</th><th>Pros</th><th>Cons</th></tr>
    </thead>
    <tbody>
      <tr><td>ReLU</td><td>Simple, fast</td><td>Cannot model periodic signals</td></tr>
      <tr><td>GELU</td><td>Smooth gradients</td><td>Still not ideal for oscillations</td></tr>
      <tr><td>Sinusoidal&nbsp;(SIREN)</td><td>Excellent for periodic data</td><td>Fixed frequency, harder to train</td></tr>
      <tr><td><strong>Snake</strong></td><td>Learnable periodicity + linear term</td><td>Slightly higher compute / VRAM</td></tr>
    </tbody>
  </table>
</div>

<!-- Minimal card styling (tweak colours to match your theme) -->
<style>
.snake-box{
  font-family: system-ui, sans-serif;
  background:#f9fffa;
  border:2px solid #39b34a;
  border-radius:8px;
  padding:1.25rem 1.5rem;
  margin:1.5rem 0;
  line-height:1.6;
}
.snake-box h3{margin-top:0}
.snake-box table{
  width:100%;
  border-collapse:collapse;
  margin-top:.5rem
}
.snake-box th,
.snake-box td{
  border:1px solid #cfe8cf;
  padding:.45rem .6rem;
}
.snake-box thead{
  background:#e7f9e9;
}
@media (prefers-color-scheme: dark){
  .snake-box{
    background:#112d14;
    border-color:#27a53a;
    color:#e6ffe9;
  }
  .snake-box thead{background:#16401d}
  .snake-box th,
  .snake-box td{border-color:#216e31}
}
</style>

<h2 id="32-conditioning-telling-the-model-what-and-how-long-to-generate">3.2 Conditioning: Telling the Model What and How Long to Generate</h2>

<p>To steer the model’s output, Stable Audio uses two kinds of conditioning signals: <strong>Text prompts</strong> and <strong>Timing embeddings</strong>.</p>

<h3 id="-text-encoder-clap-to-the-rescue">📝 Text Encoder: CLAP to the Rescue</h3>

<ul>
  <li>The team uses a <strong>CLAP-based encoder</strong> — a contrastive language-audio pretraining model.</li>
  <li>It’s trained <strong>from scratch</strong> on their own dataset (not just the open-source CLAP).</li>
  <li>Instead of using the final layer (as many do), they use the <strong>next-to-last hidden layer</strong>, inspired by practices in visual-language models like CLIP and Stable Diffusion. This layer tends to preserve more useful context for generation.</li>
  <li>These <strong>text embeddings</strong> are passed to the <strong>U-Net via cross-attention layers</strong>.</li>
</ul>

<blockquote>
  <p><strong>Why not T5 or MuLan?</strong></p>
  <ul>
    <li>Because CLAP learns <strong>audio-text relationships</strong>, making it more suitable for describing sound-rich prompts like “ambient rainforest with tribal drums”.</li>
  </ul>
</blockquote>

<h3 id="-timing-embeddings-fine-grained-control-over-duration">🕒 Timing Embeddings: Fine-Grained Control Over Duration</h3>

<p>Stable Audio pioneers the idea of <strong>timing-aware diffusion for audio</strong>. Here’s how it works:</p>

<ul>
  <li>From each training clip, two timing values are recorded:
    <ul>
      <li>seconds_start: Where the chunk begins in the original audio</li>
      <li>seconds_total: The full duration of the original audio</li>
    </ul>
  </li>
</ul>

<p>📌 Example:</p>

<p>If you sample a 95-sec chunk from a 180-sec track starting at 14s:</p>

<ul>
  <li>seconds_start = 14</li>
  <li>seconds_total = 180</li>
</ul>

<p>These are then turned into <strong>learned per-second embeddings</strong>, and <strong>concatenated with the text features</strong>. They are fed into the model via cross-attention.</p>

<p>During <strong>inference</strong>, you can set:</p>

<ul>
  <li>seconds_start = 0, seconds_total = 30 to get a <strong>30-sec output</strong></li>
  <li>The remaining time (e.g. 65 sec) is padded with <strong>silence</strong> in the latent space</li>
</ul>

<p>💡 Why this matters:</p>

<ul>
  <li>Supports <strong>variable-length generation</strong></li>
  <li>Eliminates hardcoded clip lengths</li>
  <li>Allows users to request specific durations</li>
</ul>

<p>And yes — silence padding is easy to trim afterward.</p>

<h2 id="33-diffusion-model-the-brain-behind-the-music">3.3 Diffusion Model: The Brain Behind the Music</h2>

<p>The actual denoising (i.e. generation) happens in a <strong>U-Net diffusion model</strong> with <strong>907M parameters</strong>. It’s inspired by <strong>Moûsai</strong> and tailored to scale up with long latents.</p>

<h3 id="u-net-design">U-Net Design</h3>

<ul>
  <li><strong>4 Levels</strong> of encoder-decoder blocks</li>
  <li>Downsampling factors: <strong>1×, 2×, 2×, 4×</strong> (i.e. progressively compress along length)</li>
  <li>Channel sizes: <strong>1024, 1024, 1024, 1280</strong></li>
  <li><strong>Skip connections</strong> between encoder and decoder layers maintain resolution-specific features</li>
</ul>

<h3 id="inside-each-block">Inside Each Block</h3>

<ul>
  <li><strong>2 Conv residual layers</strong></li>
  <li><strong>1 to 3 attention layers</strong>:
    <ul>
      <li><strong>Self-attention</strong></li>
      <li><strong>Cross-attention</strong> for text + timing</li>
    </ul>
  </li>
  <li><strong>Bottleneck block</strong> between encoder and decoder with 1280 channels</li>
  <li><strong>Fast attention kernels</strong> (from Dao et al., 2022) to optimize memory and speed</li>
</ul>

<h3 id="conditioning-layers">Conditioning Layers</h3>

<ul>
  <li><strong>FiLM (Feature-wise Linear Modulation)</strong> layers inject <strong>timestep noise level info</strong> (i.e. how noisy the latent currently is)</li>
  <li><strong>Cross-attention</strong> layers inject <strong>text + timing information</strong></li>
</ul>

<!-- 🎞️ FiLM (Feature‑wise Linear Modulation) card -->
<div class="film-box">
  <h3>🎞️ FiLM (Feature‑wise Linear Modulation)</h3>

  <p><strong>FiLM</strong>, introduced by
    <a href="https://arxiv.org/abs/1709.07871" target="_blank" rel="noopener">Perez et al., 2017</a>,
    lets a neural network <em>adapt its internal features</em> using an external input
    — e.g.&nbsp;text, labels, or (for diffusion models) the <strong>timestep</strong>.</p>

  <h4>The Core Idea</h4>
  <p>
    Given a feature map \(F \in \mathbb{R}^{C \times H \times W}\) and a conditioning vector \(c\),
    FiLM learns per‑channel scale &amp; shift:
  </p>

  <p>
    $$\text{FiLM}(F;\gamma,\beta) \;=\; \gamma(c)\,F \;+\; \beta(c)$$
  </p>

  <ul>
    <li><strong>\(\gamma(c)\)</strong> — MLP outputs channel‑wise <em>scales</em></li>
    <li><strong>\(\beta(c)\)</strong> — MLP outputs channel‑wise <em>shifts</em></li>
  </ul>

  <h4>In Diffusion Models</h4>
  <p>
    The timestep \(t\) is embedded, passed through MLPs to get \(\gamma(t)\) and \(\beta(t)\),
    then applied:
  </p>

  <p>
    $$\text{FiLM}(x) = \gamma(t)\,x + \beta(t)$$
  </p>

  <p class="callout">
    <em>Effect:</em> The network “knows” how noisy the input is and modulates its
    features accordingly — gentle cleaning early on, fine‑grain denoising later.
  </p>

  <h4>Why Not Just Concatenate the Timestep?</h4>
  <ul>
    <li><strong>More expressive</strong> — can amplify or suppress specific channels per step.</li>
    <li><strong>Modular</strong> — injects conditioning exactly where needed.</li>
    <li><strong>Widely adopted</strong> — Imagen, Muse, Latent Diffusion, etc.</li>
  </ul>
</div>

<!-- Minimal styling; tweak to match your theme -->
<style>
.film-box{
  font-family: system-ui, sans-serif;
  background:#f3f7ff;
  border:2px solid #4d7cff;
  border-radius:8px;
  padding:1.25rem 1.5rem;
  margin:1.5rem 0;
  line-height:1.6;
}
.film-box h3{margin-top:0}
.film-box ul{margin:0 0 1rem 1rem;padding:0}
.film-box .callout{
  background:#e6edff;
  border-left:4px solid #4d7cff;
  padding:.5rem .75rem;
  border-radius:4px;
}
@media (prefers-color-scheme: dark){
  .film-box{
    background:#0e162d;
    border-color:#6f8dff;
    color:#eaf0ff;
  }
  .film-box .callout{
    background:#1b2647;
    border-left-color:#6f8dff;
  }
}
</style>

<h2 id="34-inference-fast-controlled-sampling">3.4 Inference: Fast, Controlled Sampling</h2>

<p>During inference, Stable Audio uses:</p>

<ul>
  <li><strong>DPMSolver++</strong>: A fast, high-quality diffusion sampler</li>
  <li><strong>Classifier-free guidance (CFG)</strong>: Amplifies the conditioning signal (scale = 6)</li>
  <li><strong>100 diffusion steps</strong>: Chosen as a balance between speed and audio quality (details in Appendix A)</li>
</ul>

<p>💡 The final audio:</p>

<ul>
  <li>Can be <strong>up to 95 sec</strong></li>
  <li>Will contain silence after your specified seconds_total</li>
  <li>
    <p>Silence can be <strong>trimmed post-hoc</strong> — works reliably due to strong timing embeddings (as validated in Section 6.3)</p>

    <p><img src="/assets/images/2025-07-29/a.png" alt="420" /></p>
  </li>
</ul>

<!-- ⚡ DPMSolver++ card -->
<div class="dpm-box">
  <h3>⚡ DPMSolver++ (Fast Diffusion Sampler)</h3>

  <p><strong>DPMSolver++</strong> (<em>Denoising Probabilistic Matching Solver++</em>) is
  a <em>fast &amp; accurate</em> sampler for diffusion models, introduced by
  <a href="https://arxiv.org/abs/2206.13797" target="_blank" rel="noopener">Lu et al., 2022</a>.</p>

  <h4>Why Sampling Matters</h4>
  <ul>
    <li>Diffusion starts with pure noise and denoises over <code>T</code> steps.</li>
    <li>Each step = one forward pass → <strong>speed bottleneck</strong>.</li>
    <li>Vanilla DDPM needs 1000+ steps; DPMSolver++ can deliver high‑quality
        samples in <strong>≈ 15 – 100 steps</strong>.</li>
  </ul>

  <h4>What Makes DPMSolver++ Special?</h4>
  <ul>
    <li><strong>ODE‑based formulation</strong> — models the true probabilistic path.</li>
    <li><strong>Higher‑order solvers</strong> — 2nd / 3rd‑order integration for accuracy
        at large step sizes.</li>
    <li><strong>Explicit update rules</strong> — maintain the diffusion process’s
        statistical properties.</li>
    <li>Outperforms DDIM, PLMS, etc., at similar step counts.</li>
  </ul>

  <h4>Practical Upshot</h4>
  <p class="callout">
    Swap in DPMSolver++ → <strong>~10× faster inference</strong> with negligible (or no)
    loss in perceptual quality.
  </p>
</div>

<!-- Minimal styling; tweak to fit your theme -->
<style>
.dpm-box{
  font-family: system-ui,sans-serif;
  background:#fff7f3;
  border:2px solid #ff7d47;
  border-radius:8px;
  padding:1.25rem 1.5rem;
  margin:1.5rem 0;
  line-height:1.6;
}
.dpm-box h3{margin-top:0}
.dpm-box ul{margin:0 0 1rem 1rem;padding:0}
.dpm-box .callout{
  background:#ffe9df;
  border-left:4px solid #ff7d47;
  padding:.5rem .75rem;
  border-radius:4px;
}
@media (prefers-color-scheme: dark){
  .dpm-box{
    background:#2b160a;
    border-color:#ff9467;
    color:#ffece6;
  }
  .dpm-box .callout{
    background:#472416;
    border-left-color:#ff9467;
  }
}
</style>

<h1 id="4-training">4 Training</h1>

<h2 id="41-dataset-the-backbone">4.1 Dataset: The Backbone</h2>

<p>Stable Audio is trained on a <strong>massive dataset</strong> of <strong>806,284 audio files</strong> totaling <strong>19,500 hours</strong> from <a href="https://www.audiosparx.com/">AudioSparx</a>, a stock music provider.</p>

<h3 id="dataset-breakdown">Dataset Breakdown:</h3>

<ul>
  <li><strong>Music</strong>: 66% of the files (or 94% of total audio hours)</li>
  <li><strong>Sound effects</strong>: 25% of files (5% of hours)</li>
  <li><strong>Instrument stems</strong>: 9% of files (1% of hours)</li>
</ul>

<p>Each file comes with rich <strong>text metadata</strong>, including:</p>

<ul>
  <li>Descriptions (e.g., “epic orchestral cinematic rise”)</li>
  <li>BPM</li>
  <li>Genre</li>
  <li>Mood</li>
  <li>Instrument labels</li>
</ul>

<p>📌 <strong>The dataset is public</strong> for consultation — a win for transparency and reproducibility.</p>

<h2 id="42-training-the-vae-compressing-without-losing-musicality">4.2 Training the VAE: Compressing Without Losing Musicality</h2>

<p>The <strong>VAE</strong> (used to compress audio into latents) was trained on <strong>16 A100 GPUs</strong> using <strong>automatic mixed precision</strong> (AMP) for <strong>1.1 million steps</strong>.</p>

<h3 id="amp">AMP</h3>

<p><strong>What is Automatic Mixed Precision (AMP)?</strong></p>

<p><strong>AMP</strong> is a technique that allows deep learning models to <strong>use both 16-bit (float16) and 32-bit (float32)</strong> floating-point numbers during training — <strong>automatically</strong>.</p>

<p>Traditionally, models are trained in <strong>float32</strong> precision (a.k.a. FP32), which is precise but:</p>

<ul>
  <li>Slower to compute</li>
  <li>Uses more GPU memory</li>
</ul>

<p>With AMP:</p>

<ul>
  <li>Some operations (like matrix multiplications) are done in <strong>float16 (FP16)</strong> — faster and smaller</li>
  <li>Others (like loss computation or gradient updates) stay in <strong>float32</strong> — more stable and accurate</li>
</ul>

<p>The “automatic” part means <strong>you don’t need to manually specify which ops use which precision</strong> — your framework (like PyTorch or TensorFlow) figures it out for you.</p>

<p><strong>Pros:</strong></p>

<ul>
  <li><strong>Faster training</strong>: On GPUs like NVIDIA A100s or V100s, FP16 operations are <strong>2–8× faster</strong> than FP32.</li>
  <li><strong>Lower memory usage</strong>: FP16 uses <strong>half the memory</strong>, so you can train <strong>larger models or bigger batches</strong>.</li>
  <li><strong>Same or similar accuracy</strong>: Thanks to dynamic loss scaling and smart casting, AMP usually retains almost all the performance of full-precision training.</li>
</ul>

<p><strong>Challenges:</strong></p>

<ul>
  <li>FP16 has a <strong>narrower range</strong> of values (can underflow or overflow), which may cause instability if used naively.</li>
  <li>That’s why AMP keeps sensitive operations in <strong>FP32</strong>, like:
    <ul>
      <li>Loss calculation</li>
      <li>Gradients accumulation</li>
      <li>Batch norm updates</li>
    </ul>
  </li>
</ul>

<h3 id="strategy">Strategy</h3>

<ul>
  <li><strong>Phase 1</strong>: Train both encoder and decoder for 460,000 steps.</li>
  <li><strong>Phase 2</strong>: <strong>Freeze the encoder</strong>, fine-tune the decoder for 640,000 more steps — this improves reconstruction fidelity without changing latent space.</li>
</ul>

<h3 id="loss-functions">Loss Functions</h3>

<p>They used a carefully crafted loss mix focused on <strong>stereo audio fidelity</strong>:</p>

<table>
  <thead>
    <tr>
      <th>Loss Type</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>🎧 <strong>STFT Loss</strong></td>
      <td>Multi-resolution <strong>sum-and-difference STFT</strong> (to ensure left/right stereo correctness), applied after <strong>A-weighting</strong> to match human hearing</td>
    </tr>
    <tr>
      <td>🧠 <strong>Adversarial Loss</strong></td>
      <td>From a <strong>multi-scale STFT discriminator</strong> with patch-based hinge loss (encourages realism)</td>
    </tr>
    <tr>
      <td>🧪 <strong>Feature Matching</strong></td>
      <td>Matches internal features of real vs generated audio</td>
    </tr>
    <tr>
      <td>📉 <strong>KL Loss</strong></td>
      <td>Keeps the latent space well-behaved</td>
    </tr>
  </tbody>
</table>

<p><strong>Window sizes</strong> for STFT:</p>

<p>[2048, 1024, 512, 256, 128, 64, 32] (for reconstruction) and</p>

<p>[2048, 1024, 512, 256, 128] (for adversarial discriminator)</p>

<p><strong>Loss weights</strong>:</p>

<ul>
  <li>STFT loss: <strong>1.0</strong></li>
  <li>Adversarial: <strong>0.1</strong></li>
  <li>Feature matching: <strong>5.0</strong></li>
  <li>KL divergence: <strong>1e-4</strong></li>
</ul>

<p>This blend ensures high <strong>fidelity</strong>, <strong>structure</strong>, and <strong>stereo realism</strong> in reconstruction.</p>

<h2 id="43-training-the-text-encoder-clap-from-scratch">4.3 Training the Text Encoder: CLAP, from Scratch</h2>

<p>They trained their <strong>CLAP model</strong> (contrastive language-audio pretraining) from scratch on the same dataset.</p>

<h3 id="setup">Setup:</h3>

<ul>
  <li><strong>100 epochs</strong></li>
  <li><strong>Batch size</strong>: 6,144</li>
  <li><strong>Hardware</strong>: 64 A100 GPUs</li>
  <li>Uses the original CLAP configuration:
    <ul>
      <li><strong>RoBERTa-based text encoder</strong> (110M parameters)</li>
      <li><strong>HTSAT-based audio encoder</strong> (31M parameters)</li>
    </ul>
  </li>
  <li><strong>Loss</strong>: Language-audio contrastive loss</li>
</ul>

<p>🎯 Result: A <strong>multimodal text encoder</strong> deeply aligned with their dataset — outperforming open-source CLAP or T5 in text-to-audio alignment.</p>

<h2 id="44-training-the-diffusion-model">4.4 Training the Diffusion Model</h2>

<p>Once the VAE and CLAP were ready, they trained the <strong>latent diffusion model</strong>.</p>

<h3 id="setup-1">Setup:</h3>

<ul>
  <li><strong>640,000 steps</strong></li>
  <li><strong>64 A100 GPUs</strong></li>
  <li><strong>Batch size</strong>: 256</li>
  <li><strong>Exponential moving average (EMA)</strong> of model weights</li>
  <li><strong>AMP enabled</strong> for memory-efficient training</li>
</ul>

<h3 id="audio-preparation">Audio Preparation:</h3>

<ul>
  <li><strong>Resample</strong> to <strong>44.1kHz</strong></li>
  <li><strong>Slice to exactly 95.1 seconds</strong> (4,194,304 samples)
    <ul>
      <li><strong>Crop</strong> long files from random point</li>
      <li><strong>Pad</strong> short ones with <strong>silence</strong></li>
    </ul>
  </li>
</ul>

<h3 id="objective">Objective:</h3>

<ul>
  <li><strong>v-objective</strong> (Salimans &amp; Ho, 2022): A more stable variant of denoising objective</li>
  <li><strong>Cosine noise schedule</strong> (smoothly decays noise over time)</li>
  <li><strong>Continuous timestep sampling</strong></li>
</ul>

<p>💡 <strong>Dropout (10%)</strong> applied to the conditioning inputs → this enables <strong>classifier-free guidance</strong> during inference (a trick borrowed from image models).</p>

<p>Note: <strong>Text encoder was frozen</strong> during diffusion training — so only the U-Net learns how to use its features.</p>

<h2 id="45-prompt-preparation-how-text-prompts-were-created">4.5 Prompt Preparation: How Text Prompts Were Created</h2>

<p>Each audio file had rich metadata, but not all of it was equally useful all the time.</p>

<p>So they used <strong>dynamic prompt construction</strong> during training:</p>

<ul>
  <li>Create <strong>synthetic natural-language prompts</strong> by <strong>randomly sampling</strong> metadata fields.</li>
  <li>Two styles:
    <ol>
      <li>
        <p><strong>Structured</strong>:</p>

        <p><code class="language-plaintext highlighter-rouge">Instruments: Guitar, Drums | Moods: Uplifting, Energetic</code></p>
      </li>
      <li>
        <p><strong>Free-form</strong>:</p>

        <p><code class="language-plaintext highlighter-rouge">Guitar, Drums, Bass Guitar, Uplifting, Energetic</code></p>
      </li>
    </ol>
  </li>
  <li>Shuffle the items to prevent the model from overfitting to order.</li>
</ul>

<p>This makes the model robust — it can understand both <strong>natural text</strong> and <strong>structured metadata</strong> during inference.</p>

<hr />

<h1 id="5-methodology">5 Methodology</h1>

<p>Generating high-quality, realistic, and text-aligned music or sound effects is already hard — but <strong>measuring</strong> how good that generation is? Even harder. Especially when you’re dealing with <strong>long-form, stereo, high-fidelity audio</strong>.</p>

<h2 id="51-quantitative-metrics">5.1 Quantitative Metrics</h2>

<h3 id="1-fdopenl3--realism">1. <strong>FDOpenL3</strong> — Realism</h3>

<p>The <strong>Fréchet Distance (FD)</strong> is a go-to metric in generative modeling. It checks <strong>how similar</strong> the statistics (mean, covariance) of generated content are to real content — in a learned feature space.</p>

<p><strong>Stable Audio’s Twist</strong></p>

<ul>
  <li>Instead of projecting audio into <strong>VGGish features</strong> (which are 16kHz and mono), they use <strong>OpenL3</strong>, which handles <strong>up to 48kHz</strong> and <strong>stereo</strong>.</li>
  <li><strong>Stereo-aware</strong>: They feed left and right channels <strong>separately</strong>, get OpenL3 features for each, and concatenate.</li>
  <li>For mono baselines, they simply <strong>copy the features</strong> to both sides.</li>
</ul>

<p>✅ <strong>FDOpenL3</strong> evaluates:</p>

<ul>
  <li>Realism of generated <strong>long-form</strong></li>
  <li><strong>Full-band</strong> stereo audio at <strong>44.1kHz</strong></li>
</ul>

<h3 id="2-klpasst--semantic-alignment">2. <strong>KLPaSST</strong> — Semantic Alignment</h3>

<p>How much do the generated sounds <strong>semantically match</strong> their reference content?</p>

<p>They use:</p>

<ul>
  <li><strong>PaSST</strong>: A strong audio tagging model trained on AudioSet</li>
  <li>Compute the <strong>KL divergence</strong> between the label probabilities of generated vs real audio</li>
</ul>

<p><strong>Stable Audio’s Twist:</strong></p>

<ul>
  <li>PaSST only supports up to <strong>32kHz</strong>, so they resample from 44.1kHz</li>
  <li>Audio is segmented into <strong>overlapping chunks</strong>, logits are averaged, and softmax is applied</li>
</ul>

<p>✅ <strong>KLPaSST</strong> captures:</p>

<ul>
  <li>Tag-level alignment (e.g., “rock”, “violin”, “clapping”)</li>
  <li>Works for <strong>variable-length</strong> audio, not just 10-second snippets</li>
</ul>

<h3 id="3-clapscore--prompt-adherence">3. <strong>CLAPscore</strong> — Prompt Adherence</h3>

<p>CLAP (Contrastive Language-Audio Pretraining) is used to measure how well the <strong>generated audio matches the text prompt</strong>.</p>

<p><strong>Stable Audio’s Twist:</strong></p>

<ul>
  <li>Instead of using just a single 10s crop (like prior works), they use <strong>feature fusion</strong>:
    <ul>
      <li>A <strong>global downsampled</strong> version of the full audio</li>
      <li>Plus <strong>3 random 10s crops</strong> from beginning, middle, and end</li>
    </ul>
  </li>
  <li>This fused signal is encoded using <strong>CLAP-LAION</strong> (trained on 48kHz)</li>
  <li>Both the text and audio embeddings are compared via <strong>cosine similarity</strong></li>
</ul>

<p>✅ <strong>CLAPscore</strong> tests:</p>

<ul>
  <li>How well long-form stereo audio <strong>adheres to the prompt</strong></li>
  <li>Works across full audio — intro, middle, and end</li>
</ul>

<h2 id="52-qualitative-metrics">5.2 Qualitative Metrics</h2>

<p>Beyond math and embeddings — what do humans think?</p>

<h3 id="human-evaluation-criteria">Human evaluation criteria:</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>🎧 <strong>Audio quality</strong></td>
      <td>Is it high-fidelity or noisy/low-res?</td>
    </tr>
    <tr>
      <td>✍️ <strong>Text alignment</strong></td>
      <td>Does the sound match the prompt?</td>
    </tr>
    <tr>
      <td>🎵 <strong>Musicality</strong></td>
      <td>Are melodies/harmonies coherent?</td>
    </tr>
    <tr>
      <td>🔊 <strong>Stereo correctness</strong></td>
      <td>Does the left/right channel sound appropriate?</td>
    </tr>
    <tr>
      <td>🏗️ <strong>Musical structure</strong></td>
      <td>Does the music have an intro, middle, and outro?</td>
    </tr>
  </tbody>
</table>

<h3 id="ratings-collected">Ratings Collected:</h3>

<ul>
  <li><strong>Audio quality, Text alignment, Musicality</strong>: Rated on a <strong>0–4 scale</strong> (bad → excellent)</li>
  <li><strong>Stereo correctness &amp; Musical structure</strong>: Binary (Yes/No)</li>
</ul>

<h3 id="special-rules">Special rules:</h3>

<ul>
  <li><strong>Musicality/structure</strong>: Only evaluated for <strong>music</strong></li>
  <li><strong>Stereo correctness</strong>: Only for <strong>stereo signals</strong></li>
  <li><strong>Non-music</strong>: Only quality, alignment, stereo correctness</li>
</ul>

<p>Evaluations were run using <strong>webMUSHRA</strong>, a standardized perceptual testing framework.</p>

<h2 id="53-evaluation-data">5.3 Evaluation Data</h2>

<p>They used two popular <strong>text-audio</strong> benchmarks:</p>

<h3 id="musiccaps">MusicCaps</h3>

<ul>
  <li>5,521 music clips with 1 text caption each</li>
  <li>YouTube-based, mostly stereo</li>
  <li><strong>Only 10-second clips</strong> — so Stable Audio generated longer clips (up to 95 sec)</li>
</ul>

<h3 id="audiocaps">AudioCaps</h3>

<ul>
  <li>979 clips with <strong>4,875 total captions</strong></li>
  <li>Also YouTube-based, mostly stereo</li>
  <li>Focuses on <strong>environmental sounds and effects</strong></li>
</ul>

<h3 id="challenge">Challenge:</h3>

<ul>
  <li>Captions only describe the <strong>first 10 seconds</strong>, so reference comparisons are limited.</li>
  <li>Stable Audio still generates <strong>longer audio</strong> — showing its ability to go beyond what’s seen during training.</li>
</ul>

<h2 id="54-baselines">5.4 Baselines</h2>

<p>Some top models (e.g., <strong>Moûsai</strong>, <strong>JEN-1</strong>) weren’t comparable due to lack of open-source weights.</p>

<p>So they compared against <strong>open-source SOTA</strong>:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Type</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>AudioLDM2</strong></td>
      <td>Latent Diffusion</td>
      <td>48kHz mono and 16kHz variants</td>
    </tr>
    <tr>
      <td><strong>MusicGen</strong></td>
      <td>Autoregressive</td>
      <td>Small and large models, stereo version available</td>
    </tr>
    <tr>
      <td><strong>AudioGen</strong></td>
      <td>Autoregressive</td>
      <td>Medium-sized, for sound effects</td>
    </tr>
  </tbody>
</table>

<h3 id="notes">Notes:</h3>

<ul>
  <li>AudioLDM2 = best <strong>non-autoregressive</strong> open baseline</li>
  <li>MusicGen-stereo = best <strong>autoregressive stereo</strong> baseline</li>
  <li>MusicGen doesn’t model <strong>vocals</strong>, so vocal prompts were filtered in some tests</li>
</ul>

<hr />

<h2 id="6-experiments">6 Experiments</h2>

<h3 id="61-how-good-is-the-autoencoder">6.1. How Good Is the Autoencoder?</h3>

<p>To test how much audio quality is lost in the <strong>compression and decompression</strong> process (via the VAE), the authors:</p>

<ul>
  <li>Passed real training audio through the <strong>encoder → decoder pipeline</strong></li>
  <li>Compared the output to the original using <strong>FDOpenL3</strong></li>
</ul>

<p>🧠 <strong>Result</strong>: The autoencoded audio showed <strong>slightly worse FD scores</strong> than the original, but the degradation was <strong>minimal</strong>. Informal listening confirmed the fidelity is <strong>transparent</strong> — meaning humans barely notice the difference.</p>

<hr />

<h3 id="62-which-text-encoder-works-best">6.2. Which Text Encoder Works Best?</h3>

<p>They tested:</p>

<ul>
  <li><strong>CLAP-LAION</strong> (open-source)</li>
  <li><strong>CLAPours</strong> (trained from scratch on their dataset)</li>
  <li><strong>T5</strong> (text-only encoder)</li>
</ul>

<p>Each version was frozen during training, and the base model was trained for 350K steps.</p>

<p>🧠 <strong>Result</strong>: All performed comparably, but <strong>CLAPours slightly outperformed</strong> the others. Since it was trained on the same dataset as the diffusion model, it offered <strong>better vocabulary alignment</strong> and <strong>semantic grounding</strong>.</p>

<p>✔️ Final Choice: <strong>CLAPours</strong> — for consistency and performance.</p>

<hr />

<h3 id="63-how-accurate-is-the-timing-conditioning">6.3. How Accurate Is the Timing Conditioning?</h3>

<p>They tested if the model could:</p>

<ul>
  <li><strong>Generate audio of exactly the length requested</strong> via timing embeddings.</li>
  <li>Do this <strong>across many durations</strong> (from short to long).</li>
</ul>

<p>They used a <strong>simple energy-based silence detector</strong> to find where the real content ended in the generated 95s audio window.</p>

<p>🧠 <strong>Result</strong>:</p>

<ul>
  <li>The model <strong>closely follows the expected duration</strong></li>
  <li>Most accurate at <strong>short (≤30s)</strong> and <strong>long (≥70s)</strong> durations</li>
  <li>Some variability around <strong>40–60 seconds</strong>, likely due to <strong>fewer training examples</strong> of this length</li>
  <li>Some misreadings caused by limitations of the silence detection method</li>
</ul>

<hr />

<h3 id="64-how-does-it-compare-to-state-of-the-art">6.4. How Does It Compare to State-of-the-Art?</h3>

<p>Benchmarks are shown in Tables 1–3 (not included here), comparing <strong>Stable Audio</strong> against:</p>

<ul>
  <li><strong>AudioLDM2</strong></li>
  <li><strong>MusicGen (small, large, stereo)</strong></li>
  <li><strong>AudioGen</strong></li>
</ul>

<h3 id="key-observations">Key Observations:</h3>

<ul>
  <li><strong>Best in audio quality</strong> and <strong>text alignment</strong> on <strong>MusicCaps</strong></li>
  <li>Slightly weaker on <strong>AudioCaps</strong> for text alignment, possibly due to fewer <strong>sound effects</strong> in its training set</li>
  <li><strong>Competitive in musicality</strong> and <strong>musical structure</strong></li>
  <li>Good at <strong>stereo rendering for music</strong> but weaker on <strong>stereo correctness for effects</strong> — possibly because some prompts don’t require spatial diversity</li>
  <li>Importantly, it’s <strong>the only model</strong> consistently capable of generating <strong>intro → development → outro</strong> — real musical structure, not just loops</li>
</ul>

<hr />

<h3 id="65-how-fast-is-it">6.5. How Fast Is It?</h3>

<p>They benchmarked <strong>inference time</strong> on a <strong>single A100 GPU</strong> (batch size = 1).</p>

<p>🧠 <strong>Result</strong>:</p>

<ul>
  <li><strong>Much faster</strong> than <strong>autoregressive models</strong> (e.g., MusicGen, AudioGen)</li>
  <li><strong>Faster than AudioLDM2</strong>, even when generating <strong>higher-quality audio</strong> (44.1kHz stereo vs. 16kHz or mono)</li>
  <li>Particularly <strong>faster than AudioLDM2-48kHz</strong>, which works at a similar bandwidth but takes longer</li>
</ul>

<p>✅ Latent diffusion + optimized architecture + DPMSolver++ = <strong>speed with quality</strong></p>

<hr />

<h2 id="section-7-conclusions">Section 7: Conclusions</h2>

<p>Stable Audio proves that it’s possible to build a system that is:</p>

<ul>
  <li>🎵 <strong>Flexible</strong> (supports music and sound effects)</li>
  <li>⏱️ <strong>Fast</strong> (generates up to 95s in just 8s)</li>
  <li>🎧 <strong>High-fidelity</strong> (44.1kHz stereo)</li>
  <li>🧠 <strong>Controllable</strong> (via text + timing conditioning)</li>
  <li>🧪 <strong>Well-evaluated</strong> (with new long-form-aware metrics)</li>
</ul>

<p>It pushes the frontier in multiple areas:</p>

<ul>
  <li>One of the first systems to consistently generate <strong>structured music</strong></li>
  <li>Among the few to generate <strong>stereo sound effects</strong></li>
  <li>Introduces <strong>new metrics</strong> for evaluating long-form, full-band, stereo generation</li>
  <li>Outperforms or competes with state-of-the-art in multiple benchmark</li>
</ul>

<hr />]]></content><author><name></name></author><category term="speech" /><summary type="html"><![CDATA[Evans, Zach, C. J. Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. “Fast Timing-Conditioned Latent Audio Diffusion.” arXiv:2402.04825. Preprint, arXiv, May 13, 2024. https://doi.org/10.48550/arXiv.2402.04825.]]></summary></entry><entry><title type="html">Denoising Diffusion Probabilistic Models</title><link href="https://aayush9753.in/blog/2025/denoising-diffusion-probabilistic-models/" rel="alternate" type="text/html" title="Denoising Diffusion Probabilistic Models" /><published>2025-07-23T00:00:00+00:00</published><updated>2025-07-23T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/denoising-diffusion-probabilistic-models</id><content type="html" xml:base="https://aayush9753.in/blog/2025/denoising-diffusion-probabilistic-models/"><![CDATA[<blockquote>
  <p>Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models.” arXiv:2006.11239. Preprint, arXiv, December 16, 2020. https://doi.org/10.48550/arXiv.2006.11239.</p>
</blockquote>

<p><a href="https://github.com/hojonathanho/diffusion">Code</a></p>

<h2 id="table-of-contents">Table of Contents</h2>

<ul>
  <li><a href="#abstract">Abstract</a></li>
  <li><a href="#diffusion-models">Diffusion Models</a>
    <ul>
      <li><a href="#the-forward-process-data--noise">The Forward Process (Data → Noise)</a></li>
      <li><a href="#the-reverse-process-noise--data">The Reverse Process (Noise → Data)</a></li>
    </ul>
  </li>
  <li><a href="#training-objective">Training Objective</a>
    <ul>
      <li><a href="#variational-lower-bound-loss">Variational Lower Bound Loss</a></li>
      <li><a href="#expanding-the-variational-bound">Expanding the Variational Bound</a></li>
      <li><a href="#rewriting-the-loss">Rewriting the Loss</a></li>
      <li><a href="#parameterisation-trick-predicting-noise-instead-of-image">Parameterisation Trick</a></li>
      <li><a href="#the-simplified-objective">The Simplified Objective</a></li>
    </ul>
  </li>
  <li><a href="#training-algorithm">Training Algorithm</a>
    <ul>
      <li><a href="#connection-to-score-matching">Connection to Score Matching</a></li>
      <li><a href="#more">More</a></li>
    </ul>
  </li>
  <li><a href="#experiments">Experiments</a></li>
</ul>

<h1 id="abstract">Abstract</h1>

<ul>
  <li>High quality image synthesis results using diffusion probabilistic models.
    <ul>
      <li><strong>Latent‑variable model</strong> – The model assumes there’s an unobserved variable <em>z</em> that, after some transformation, produces your image <em>x</em>.</li>
    </ul>
  </li>
  <li>Trained on Weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics.</li>
</ul>

<p><img src="/assets/images/2025-07-23/a.png" alt="420" /></p>

<h1 id="diffusion-models">Diffusion Models</h1>

<p>We use Diffusion probabilist model (aka Diffusion Models). They are a class of generative models that learn to create high-quality samples by reversing a gradual corruption process.</p>

<p>More Details on Diffusion models →  <a href="https://aayush9753.github.io/diffusion-models.html">diffusion-models.html</a></p>

<h3 id="the-forward-process-data--noise">The Forward Process (Data → Noise)</h3>

<p>Markov chain that gradually adds Gaussian noise to the data according to a variance schedule: \(β_1, . . . , β_T\). It gradually corrupts the original data by adding Gaussian noise:</p>

\[q(x_{1:T}|x_0) = ∏^T_{t=1} q(x_t|x_{t-1})\]

\[q(x_t|x_{t-1}) = N(x_t; \sqrt{(1-β_t)}x_{t-1}, β_tI)\]

<p><strong>Key aspects:</strong></p>

<ul>
  <li><strong>\(x_0\)</strong>: Original clean data</li>
  <li><strong>\(x_1, x_2, ..., x_t\)</strong>: Progressively noisier versions</li>
  <li><strong>\(β_t\)</strong>: Variance schedule controlling how much noise is added at each step</li>
  <li>This process is <strong>fixed</strong> and doesn’t require learning</li>
</ul>

<h3 id="the-reverse-process-noise--data">The Reverse Process (Noise → Data)</h3>

<p>The reverse process learns to undo the forward corruption:</p>

\[p_θ(x_{0:T}) = p(x_T) ∏_{t=1}^T p_θ(x_{t-1}|x_t)\]

\[p_\theta(x_{t-1} \mid x_t) \;=\; \mathcal{N}\!\Bigl(  x_{t-1} \;;\;  \mu_\theta(x_t, t),  \;\Sigma_\theta(x_t, t)\Bigr)\]

<p><strong>Key aspects:</strong></p>

<ul>
  <li>Starts from pure noise: <strong>\(p(x_T) = N(x_T; 0, I)\)</strong></li>
  <li>Each step is a learned Gaussian transition</li>
  <li><strong>\(μ_θ\)</strong> and <strong>\(Σ_θ\)</strong> are neural network predictions</li>
</ul>

<hr />

<h1 id="training-objective">Training Objective</h1>

<p>The model is trained by optimising the variational bound:</p>

\[\mathcal{L} \;=\; \mathbb{E}_q \left[  -\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right]\]

<p><strong>Efficient sampling property:</strong> The forward process allows sampling at any timestep <strong>t</strong> directly:</p>

\[q(x_t \mid x_0) \;=\;\mathcal{N}\!\Bigl(  x_t \;;\;  \sqrt{\bar{\alpha}_t} \, x_0,\;  (1 - \bar{\alpha}_t) I\Bigr)\]

<p>where <strong>\(\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s\)</strong> and <strong>\(\alpha_t = 1 - \beta_t\)</strong></p>

<h2 id="variational-lower-bound-loss">Variational Lower Bound Loss</h2>

<p>We want to model realistic image data. Our goal is to maximize the likelihood of real images under a generative model:</p>

\[\max\; p_\theta(x_0) = \max \int p_\theta(x_{0:T})\; dx_{1:T}\]

<p>But this marginalization over all possible noise trajectories is intractable. So, we approximate it using <strong>variational inference</strong>.</p>

<p>Starting from the log-likelihood:</p>

\[\log p_\theta(x_0) = \log \int p_\theta(x_{0:T})\; dx_{1:T}\]

<p>We reformulate the intractable log-likelihood using a known forward process (diffusion) \(q(x_{1:T} \mid x_0)\):</p>

\[\log p_\theta(x_0) = \log \int \frac{p_\theta(x_{0:T})\; q(x_{1:T} \mid x_0)}{q(x_{1:T} \mid x_0)} dx_{1:T}\]

\[= \log \mathbb{E}_q\left[\frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right]\]

\[\geq \mathbb{E}_q\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] \quad \text{[Jensen's inequality]}\]

<p>This gives us the <strong>variational lower bound</strong> which we minimize during training:</p>

\[\mathcal{L} = \mathbb{E}_q\left[-\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] = \mathbb{E}_q\left[-\log p_\theta(x_{0:T}) + \log q(x_{1:T} \mid x_0)\right]\]

<p><strong>In context of image generation:</strong> We’re finding the best denoising path to generate realistic images from noise.</p>

<div style="border: 1px solid #ccc; border-radius: 8px; padding: 16px; background: #f9f9f9; font-family: sans-serif; line-height: 1.6;">
  <h3>🧮 Derivation of Above Loss</h3>

  <p><strong>How we got to the above loss:</strong></p>

  <ol>
    <li><strong>Original Integral:</strong><br />
      $$I = \int f(x)\, dx \quad \text{where } f(x) = p_\theta(x_{0:T})$$
    </li>

    <li><strong>Multiply and Divide by \( g(x) \):</strong><br />
      $$I = \int f(x) \cdot \frac{g(x)}{g(x)}\, dx \quad \text{where } g(x) = q(x_{1:T} \mid x_0)$$
    </li>

    <li><strong>Rearranged Form:</strong><br />
      $$I = \int \frac{f(x)}{g(x)} \cdot g(x)\, dx$$
    </li>

    <li><strong>Recognize as Expectation:</strong><br />
      $$I = \mathbb{E}_g\left[\frac{f(x)}{g(x)}\right]$$
    </li>

    <li><strong>Apply Jensen's Inequality:</strong><br />
      For a convex function \( f \) and a random variable \( X \):<br />
      $$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$$<br />
      For a concave function like \( \log \), the inequality flips:<br />
      $$\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$$<br />
      Therefore:<br />
      $$\log \mathbb{E}_q\left[\frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] \geq \mathbb{E}_q\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}\right] \quad \text{[Jensen's inequality]}$$
    </li>
  </ol>
</div>

<h2 id="expanding-the-variational-bound">Expanding the Variational Bound</h2>

<p><strong>Reverse Process (learned):</strong></p>

\[p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t)\]

<p>This defines how we turn noise into an image.</p>

<p><strong>Forward Process (known):</strong></p>

\[q(x_{1:T} \mid x_0) = \prod_{t=1}^{T} q(x_t \mid x_{t-1})\]

<p>This adds noise step by step to a clean image.</p>

<p><strong>The Full Loss Function after using p and q from above:</strong></p>

<p>Breaking the bound into individual terms:</p>

\[\mathcal{L} = \mathbb{E}_q\left[-\log p(x_T) - \sum_{t=1}^{T} \log p_\theta(x_{t-1} \mid x_t) + \sum_{t=1}^{T} \log q(x_t \mid x_{t-1})\right]\]

\[= \mathbb{E}_q\left[-\log p(x_T) + \log q(x_T \mid x_{T-1}) - \sum_{t=1}^{T-1} \log \frac{p_\theta(x_{t-1} \mid x_t)}{q(x_t \mid x_{t-1})} - \log p_\theta(x_0 \mid x_1)\right]\]

<p>This measures how well our reverse process undoes the forward noise corruption.</p>

<h2 id="rewriting-the-loss">Rewriting the Loss</h2>

<p>By Bayes’ Rule:</p>

\[q(x_t \mid x_{t-1}) \cdot q(x_{t-1} \mid x_0) = q(x_t \mid x_0) \cdot q(x_{t-1} \mid x_t, x_0)\]

<p>This allows us to rewrite the loss as:</p>

\[\mathcal{L} = \mathbb{E}_q\left[\mathrm{D_{KL}}(q(x_T \mid x_0) \parallel p(x_T))\right] + \sum_{t=2}^{T} \mathbb{E}_q\left[\mathrm{D_{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_\theta(x_{t-1} \mid x_t))\right] + \mathbb{E}_q\left[-\log p_\theta(x_0 \mid x_1)\right]\]

<div style="border: 1px solid #ccc; border-radius: 8px; padding: 16px; background: #f9f9f9; font-family: sans-serif; line-height: 1.6;">
  <h3>How?</h3>

  <p><strong>1. Rewriting each log term as a KL divergence</strong></p>
  <p>We use the identity:</p>
  <p>
    $$\mathrm{D_{KL}}(q(z) \| p(z)) = \mathbb{E}_{q(z)}[\log q(z) - \log p(z)]$$
  </p>
  <p>This allows us to convert pairs of log terms into KL divergences, wherever we can express a tractable pair of distributions.</p>

  <p><strong>2. Final timestep prior matching</strong></p>
  <p>We isolate the final timestep:</p>
  <p>
    $$-\log p(x_T) + \log q(x_T \mid x_{T-1}) \approx \mathrm{D_{KL}}(q(x_T \mid x_0) \parallel p(x_T))$$
  </p>
  <p>Because:</p>
  <p>
    $$q(x_T \mid x_0) = \int q(x_T \mid x_{T-1}) q(x_{T-1} \mid x_0) \, dx_{T-1}$$
  </p>
  <p>And it's tractable, so we merge these into one KL term.</p>

  <p><strong>3. KL terms for t = 2 to T</strong></p>
  <p>For steps t = 2 to T, using Bayes' rule:</p>
  <p>
    $$q(x_t \mid x_{t-1}) \cdot q(x_{t-1} \mid x_0) = q(x_t \mid x_0) \cdot q(x_{t-1} \mid x_t, x_0)$$
  </p>
  <p>Taking logs and summing:</p>
  <p>
    $$\log q(x_t \mid x_{t-1}) - \log p_\theta(x_{t-1} \mid x_t) = \log q(x_{t-1} \mid x_t, x_0) - \log p_\theta(x_{t-1} \mid x_t)$$
  </p>
  <p>Thus, each of those becomes a KL divergence:</p>
  <p>
    $$\mathrm{D_{KL}}(q(x_{t-1} \mid x_t, x_0) \| p_\theta(x_{t-1} \mid x_t))$$
  </p>

  <p><strong>4. Final step t = 1</strong></p>
  <p>There's no posterior \( q(x_0 \mid x_1, x_0) \), so we leave the log term as-is:</p>
  <p>
    $$-\log p_\theta(x_0 \mid x_1)$$
  </p>
</div>

<p><strong>Interpretation for image generation:</strong> We encourage our model to align with the known noise process and accurately reconstruct the original image.</p>

<p><strong>Named Components:</strong></p>

\[\mathcal{L} = \mathcal{L}_T + \sum_{t=2}^{T} \mathcal{L}_{t-1} + \mathcal{L}_0\]

<p>Each term in the loss plays a specific role:</p>

<p>where</p>

<h3 id="prior-matching-l_t"><strong>Prior Matching: \(L_T\)</strong></h3>

<p><strong>KL between final noisy state and prior</strong></p>

\[\mathcal{L}_T = \mathrm{D_{KL}}(q(x_T \mid x_0) \parallel p(x_T))\]

<ul>
  <li>Pushes noisy images to align with Gaussian noise</li>
  <li>Ensures the final forward process state:<br />
\(q(x_T \mid x_0)\) matches the prior: \(p(x_T) = \mathcal{N}(0, I)\)</li>
  <li>When \(\beta_t\) are fixed (not learned), this becomes a <strong>constant</strong> and can be ignored during training</li>
  <li>No parameters to optimize here!</li>
</ul>

<h3 id="denoising-kl-terms-l_t-1"><strong>Denoising KL Terms: \(L_{t-1}\)</strong></h3>

<p><strong>KL between forward and reverse process at each step</strong></p>

\[\mathcal{L}_{t-1} = \mathbb{E}_q\left[\mathrm{D_{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_\theta(x_{t-1} \mid x_t))\right]\]

<ul>
  <li>Forward posterior is tractable. We can compute it exactly:</li>
</ul>

\[q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\left(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I\right)\]

<ul>
  <li>With:</li>
</ul>

\[\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t\]

\[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t\]

<ul>
  <li>Ensures reverse denoising steps are accurate.</li>
</ul>

<p><strong>What it does:</strong></p>

<ul>
  <li><strong>Ground truth target</strong>:<br />
\(q(x_{t-1} \mid x_t, x_0)\) — the “true” way to denoise \(x_t\) when we know \(x_0\)</li>
  <li><strong>Model prediction</strong>:<br />
\(p_\theta(x_{t-1} \mid x_t)\) — what our model thinks is the right way to denoise</li>
  <li><strong>Training signal</strong>:<br />
Make the model’s denoising match the ground truth denoising</li>
</ul>

<h3 id="reconstruction-loss-for-final-denoising-step---l_0">Reconstruction loss for final denoising step - \(L_0\)</h3>

\[\mathcal{L}_0 = \mathbb{E}_q\left[-\log p_\theta(x_0 \mid x_1)\right]\]

<ul>
  <li>Handles the final step from slightly noisy image to clean discrete pixels</li>
  <li>Uses a discrete decoder to ensure proper pixel values {0,1,…,255}</li>
</ul>

<h2 id="parameterisation-trick-predicting-noise-instead-of-image"><strong>Parameterisation Trick: Predicting Noise instead of Image</strong></h2>

<h3 id="traditional-approach-directly-predict-μ_θx_t-t"><strong>Traditional approach: Directly predict \(μ_θ(x_t, t)\)</strong></h3>

<p><strong>Loss:</strong></p>

\[\mathcal{L}_{t-1} = \mathbb{E}_q\left[\frac{1}{2\sigma_t^2} \| \tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t) \|^2\right] + C\]

<h3 id="noise-prediction">Noise Prediction</h3>

<p><strong>Instead of predicting clean images directly, we reparameterize:</strong></p>

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \varepsilon\]

<p>Then train the model to predict \(\varepsilon\), the noise and Loss becomes:</p>

\[\mathcal{L}_{t-1} = \mathbb{E}_{x_0, \varepsilon}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \| \varepsilon - \varepsilon_\theta(x_t, t) \|^2\right]\]

<p><strong>What this means:</strong></p>

<ul>
  <li>Instead of predicting the denoised image directly, <strong>predict the noise</strong></li>
  <li>The model learns: “Given a noisy image, what noise was added?”</li>
  <li>Much more stable and effective training signal!</li>
</ul>

<h2 id="the-simplified-objective">The Simplified Objective</h2>

<p>The full variational bound has complex weighting terms: \(\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)}\)</p>

<p>The paper proposes ignoring these weights:</p>

\[\mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \varepsilon}\left[ \| \varepsilon - \varepsilon_\theta(x_t, t) \|^2 \right]\]

<p>Where:</p>

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \varepsilon\]

\[t \sim \text{Uniform}(1, T)\]

<p>This balances all noise levels equally and avoids overfitting to low-noise timesteps.</p>

<p><strong>Problem with original weighting:</strong></p>

<ul>
  <li>Small t (little noise): Large weight, easy task</li>
  <li>Large t (lots of noise): Small weight, hard task</li>
  <li>Model focuses on easy denoising tasks!</li>
</ul>

<p><strong>Solution with uniform weighting:</strong></p>

<ul>
  <li>Equal attention to all noise levels</li>
  <li>Model learns difficult denoising better</li>
  <li>Better sample quality in practice</li>
</ul>

<hr />

<h1 id="training-algorithm">Training Algorithm</h1>

<p><img src="/assets/images/2025-07-23/b.png" alt="Training Algorithm" /></p>

<h3 id="training">Training</h3>

<ol>
  <li><strong>Sample real data</strong>: Get a training image from your dataset</li>
  <li><strong>Random timestep</strong>: Choose how much noise to add (uniform across all levels)</li>
  <li><strong>Sample noise</strong>: Generate the specific noise to add</li>
  <li><strong>Create noisy version</strong>: Apply the forward process in one step</li>
  <li><strong>Predict noise</strong>: Ask the model “what noise was added?”</li>
  <li><strong>Update</strong>: Make the model better at noise prediction</li>
</ol>

<ul>
  <li><strong>Single-step training</strong>: Instead of running the full T-step forward process, we can jump directly to any timestep t using the closed-form formula.</li>
  <li><strong>Stochastic training</strong>: Each batch sees different noise levels, so the model learns to denoise across all levels simultaneously.</li>
  <li><strong>Simple objective</strong>: Just predict noise - no complex distributions or adversarial training.</li>
</ul>

<h2 id="connection-to-score-matching">Connection to Score Matching</h2>

<p>The noise prediction objective is equivalent to <strong>denoising score matching</strong>:</p>

\[\nabla_{x_t} \log p(x_t) \approx -\frac{\varepsilon}{\sqrt{1 - \bar{\alpha}_t}}\]

<p><strong>What this means:</strong></p>

<ul>
  <li>Predicting noise \(\varepsilon\) is equivalent to predicting the gradient of the log-probability</li>
  <li>The model learns the “gradient field” pointing toward high-probability regions</li>
  <li>Sampling follows these gradients to find realistic images</li>
</ul>

<p>Why this Matter?</p>

<ul>
  <li><strong>Theoretical foundation</strong>: Connects diffusion models to the rich theory of score-based generative models.</li>
  <li><strong>Sampling interpretation</strong>: The reverse process becomes Langevin dynamics following learned gradients.</li>
  <li><strong>Stability</strong>: Score matching is known to be more stable than adversarial training.</li>
</ul>

<h2 id="more">More</h2>

<h3 id="variance-schedule-choice"><strong>Variance Schedule Choice</strong></h3>

<p>The paper fixes the forward process variances <strong>\(β_t\)</strong> to constants rather than learning them.</p>

<ul>
  <li>This simplification means the forward process <strong>q</strong> has no learnable parameters</li>
  <li>The term <strong>\(L_T\)</strong> becomes constant and can be ignored during training
    <ul>
      <li><strong>Linear schedule</strong>: \(β_t\) increases linearly from \(β_1\) to \(β_T\)</li>
      <li><strong>Typical values</strong>: \(β_1 = 0.0001\), \(β_T = 0.02\)</li>
      <li><strong>T steps</strong>: Usually T = 1000 for training</li>
    </ul>
  </li>
</ul>

<h3 id="covariance-choice"><strong>Covariance Choice:</strong></h3>

<p>The model sets \(\Sigma_\theta(x_t,t) = \sigma_t^2 I\) (diagonal, time-dependent constants):</p>

<ul>
  <li><strong>\(\sigma_t^2 = \beta_t\)</strong>: Optimal when \(x_0 \sim N(0,I)\)</li>
  <li><strong>\(\sigma_t^2 = \tilde{\beta}_t\)</strong>: Optimal when \(x_0\) is deterministic</li>
  <li>Both choices gave similar empirical results</li>
</ul>

<h3 id="image-preprocessing"><strong>Image Preprocessing:</strong></h3>

<ul>
  <li>Images are scaled from {0,1,…,255} to [-1,1]</li>
  <li>Ensures consistent neural network input scaling</li>
  <li>Starting point is standard normal prior \(p(x_T)\)</li>
</ul>

<h3 id="practical-implementation-details">Practical Implementation Details</h3>

<ul>
  <li><strong>Training Tips:</strong>
    <ul>
      <li>EMA for sampling</li>
      <li>Gradient clipping</li>
      <li>Cosine learning rate schedule</li>
      <li>Data augmentation</li>
    </ul>
  </li>
</ul>

<h1 id="experiments">Experiments</h1>

<h3 id="experiment-setup"><strong>Experiment Setup</strong></h3>

<ul>
  <li><strong>T = 1000</strong>: Number of diffusion steps, matching prior work to keep neural network evaluations comparable.</li>
  <li><strong>Noise schedule</strong>: Linearly increasing variances from \(β_1 = 10^{-4}\) to \(β_T = 0.02\), which keeps added noise small but enough to reach near-complete destruction of the original signal by the end.</li>
  <li><strong>Signal-to-noise control</strong>: Final KL divergence from Gaussian is \(\approx 10^{-5}\) bits/dim — ensures the model learns well.</li>
</ul>

<h3 id="model-architecture"><strong>Model Architecture</strong></h3>

<ul>
  <li><strong>U-Net</strong>: Based on unmasked PixelCNN++ with <strong>group normalization</strong> and <strong>shared weights across time</strong>.</li>
  <li><strong>Time embeddings</strong>: Injected using <strong>Transformer sinusoidal embeddings</strong>.</li>
  <li><strong>Self-attention</strong>: Added at 16×16 feature resolution.</li>
</ul>

<h3 id="training-objective-ablation"><strong>Training Objective Ablation</strong></h3>

<ul>
  <li><strong>True Variational Bound</strong>: Best for compression (lossless codelength).</li>
  <li><strong>Simplified Objective</strong>: Best for sample quality.</li>
  <li><strong>Predicting mean \(\tilde{\mu}\)</strong>:
    <ul>
      <li>Works well with variational bound.</li>
      <li>Performs worse with simple MSE objective.</li>
    </ul>
  </li>
  <li><strong>Learned variance</strong>: Leads to instability and poor quality.</li>
  <li><strong>Fixed variance</strong>: More stable.</li>
</ul>

<h3 id="progressive-generation"><strong>Progressive Generation</strong></h3>

<ul>
  <li>Generate images progressively from random bits (reverse process).</li>
  <li>Large-scale features appear early; details come later.</li>
  <li>Shows that Gaussian diffusion allows coarse-to-fine image generation.</li>
</ul>

<h3 id="interpolation-in-latent-space"><strong>Interpolation in Latent Space</strong></h3>

<ul>
  <li>Interpolate two images in latent space (at same timestep <code class="language-plaintext highlighter-rouge">t</code>), then decode via reverse process.</li>
  <li><strong>Results</strong>:
    <ul>
      <li>Smooth and meaningful transitions in pose, hair, background, etc.</li>
      <li><strong>Eyewear remains unchanged</strong>, showing model’s bias or lack of variation in that feature.</li>
      <li>Larger <code class="language-plaintext highlighter-rouge">t</code> → blurrier but more varied (i.e., creative) results.</li>
    </ul>
  </li>
</ul>

<h3 id="connection-to-autoregressive-models"><strong>Connection to Autoregressive Models</strong></h3>

<ul>
  <li>Rewriting the variational bound shows <strong>diffusion is like autoregressive decoding</strong> with a continuous and generalized bit ordering.</li>
  <li>Gaussian noise acts like masking but may be more natural and effective.</li>
  <li>Unlike true autoregressive models, <strong>diffusion can use T &lt; data dimension</strong>, allowing flexibility in sampling speed or model power.</li>
</ul>

<h3 id="key-takeaways"><strong>Key Takeaways</strong></h3>

<ul>
  <li>Diffusion models:
    <ul>
      <li>Achieve <strong>high-quality image synthesis</strong> even without conditioning.</li>
      <li>Show strong <strong>lossy compression ability</strong> (good perceptual reconstructions).</li>
      <li>Can act as a <strong>generalization of autoregressive models</strong>.</li>
    </ul>
  </li>
  <li>Model architecture and training objective <strong>greatly affect</strong> performance.</li>
  <li><strong>Progressive generation, interpolation, and decoding</strong> are all efficient and visually plausible.</li>
</ul>]]></content><author><name></name></author><category term="diffusion" /><summary type="html"><![CDATA[Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models.” arXiv:2006.11239. Preprint, arXiv, December 16, 2020. https://doi.org/10.48550/arXiv.2006.11239.]]></summary></entry><entry><title type="html">Diffusion Models</title><link href="https://aayush9753.in/blog/2025/diffusion-models/" rel="alternate" type="text/html" title="Diffusion Models" /><published>2025-07-17T00:00:00+00:00</published><updated>2025-07-17T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/diffusion-models</id><content type="html" xml:base="https://aayush9753.in/blog/2025/diffusion-models/"><![CDATA[<h2 id="my-journey-into-diffusion-models">My Journey into Diffusion Models</h2>

<p>I am studying diffusion models from scratch, diving deep into the mathematical foundations and practical implementations. This blog serves as a central hub that summarizes all my readings, notes, and reference blogs that I’m writing on diffusion models. As I explore this fascinating field, I’ll be documenting my learnings through detailed posts that break down complex concepts into digestible explanations.</p>

<h2 id="blog-posts-on-diffusion-models">Blog Posts on Diffusion Models</h2>

<h3 id="1-step-by-step-diffusion-an-elementary-tutorial">1. <a href="https://aayush9753.github.io/step-by-step-diffusion-an-elementary-tutorial.html">Step-by-Step Diffusion: An Elementary Tutorial</a></h3>

<p>This covers the fundamentals of diffusion models including:</p>
<ul>
  <li>How diffusion models work by gradually adding and removing noise</li>
  <li>DDPM (stochastic sampling) and DDIM (deterministic sampling) algorithms</li>
  <li>Flow matching as a generalization beyond Gaussian noise</li>
  <li>Practical implementation details and best practices</li>
</ul>

<h3 id="2-denoising-diffusion-probabilistic-models">2. <a href="https://aayush9753.github.io/denoising-diffusion-probabilistic-models.html">Denoising Diffusion Probabilistic Models</a></h3>

<ul>
  <li>Covers a Image generation model created using diffusion.</li>
  <li>Focuses on Loss for the training and mathemetical derivations.</li>
</ul>

<hr />

<p><em>More blog posts on diffusion models coming soon as I continue my learning journey…</em></p>]]></content><author><name></name></author><category term="journey" /><summary type="html"><![CDATA[My Journey into Diffusion Models]]></summary></entry><entry><title type="html">Step-by-Step Diffusion: An Elementary Tutorial</title><link href="https://aayush9753.in/blog/2025/step-by-step-diffusion-an-elementary-tutorial/" rel="alternate" type="text/html" title="Step-by-Step Diffusion: An Elementary Tutorial" /><published>2025-07-17T00:00:00+00:00</published><updated>2025-07-17T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/step-by-step-diffusion-an-elementary-tutorial</id><content type="html" xml:base="https://aayush9753.in/blog/2025/step-by-step-diffusion-an-elementary-tutorial/"><![CDATA[<blockquote>
  <p>Nakkiran, Preetum, Arwen Bradley, Hattie Zhou, and Madhu Advani. “Step-by-Step Diffusion: An Elementary Tutorial.” arXiv, June 23, 2024. <a href="https://arxiv.org/abs/2406.08929">https://doi.org/10.48550/arXiv.2406.08929</a>.</p>
</blockquote>

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><strong><a href="#1-fundamental-of-diffusion">Fundamental of Diffusion</a></strong>
    <ul>
      <li><a href="#11-gaussian-diffusion">1.1 Gaussian Diffusion</a></li>
      <li><a href="#12-diffusions-in-the-abstract">1.2 Diffusions in the Abstract</a></li>
      <li><a href="#13-discretisation">1.3 Discretisation</a></li>
    </ul>
  </li>
  <li><strong><a href="#2-stochastic-sampling-ddpm">Stochastic Sampling: DDPM</a></strong>
    <ul>
      <li><a href="#21-correctness-of-ddpm-look-in-paper-for-the-proof">2.1 Correctness of DDPM</a></li>
      <li><a href="#22-algorithms">2.2 Algorithms</a></li>
      <li><a href="#23-variance-reduction-predicting-x_0">2.3 Variance Reduction: Predicting x₀</a></li>
    </ul>
  </li>
  <li><strong><a href="#3-deterministic-sampling-ddim">Deterministic Sampling: DDIM</a></strong>
    <ul>
      <li><a href="#algorithm-2-deterministic-reverse-sampler-ddim-like">Algorithm 2: Deterministic Reverse Sampler</a></li>
      <li><a href="#31-case-1-single-point">3.1 Case 1: Single Point</a></li>
      <li><a href="#32-velocity-fields-and-gases">3.2 Velocity Fields and Gases</a></li>
      <li><a href="#36-discussion-ddpm-vs-ddim">3.6 Discussion: DDPM vs DDIM</a></li>
      <li><a href="#37-remarks-on-generalization">3.7 Remarks on Generalization</a></li>
    </ul>
  </li>
  <li><strong><a href="#4-flow-matching">Flow Matching</a></strong>
    <ul>
      <li><a href="#the-two-step-construction-from-ddim">The Two-Step Construction from DDIM</a></li>
      <li><a href="#why-this-matters">Why This Matters</a></li>
      <li><a href="#41-flows">4.1 Flows</a></li>
      <li><a href="#42-pointwise-flows">4.2 Pointwise Flows</a></li>
      <li><a href="#43-marginal-flows">4.3 Marginal Flows</a></li>
      <li><a href="#44-a-simple-choice-of-pointwise-flow">4.4 A Simple Choice of Pointwise Flow</a></li>
      <li><a href="#45-flow-matching">4.5 Flow Matching</a></li>
    </ul>
  </li>
  <li><strong><a href="#5-diffusion-in-practice">Diffusion in Practice</a></strong>
    <ul>
      <li><a href="#samplers-in-practice">Samplers in Practice</a></li>
      <li><a href="#noise-schedules">Noise Schedules</a></li>
      <li><a href="#likelihood-interpretations-and-vaes">Likelihood Interpretations and VAEs</a></li>
      <li><a href="#parametrization-the-x_0--ε--v-prediction-wars">Parametrization: The x₀ / ε / v-Prediction Wars</a></li>
      <li><a href="#the-error-landscape-what-actually-goes-wrong">The Error Landscape: What Actually Goes Wrong</a></li>
    </ul>
  </li>
  <li><strong><a href="#further-reading-and-resources">Further Reading and Resources</a></strong></li>
</ol>

<hr />

<h1 id="1-fundamental-of-diffusion">1. Fundamental of Diffusion</h1>

<p><strong>Goal of Generative Modelling</strong>: Given i.i.d. samples from an unknown distribution \(p^*\), we create a method that can generate new samples by sampling from an approximation of \(p^*(x)\).</p>

<p>i.i.d. samples: Independent and identically distributed samples</p>

<ul>
  <li>Each sample was drawn independently and all samples come from same underlying distribution \(p^*\).</li>
</ul>

<p><strong>Example:</strong> We have a training set of 10,000 dog photos:</p>

<ul>
  <li>These photos represent samples from some true distribution \(p_{dog}(x)\) over all possible dog images and we don’t know the mathematical form of \(p_{dog}(x)\)</li>
  <li>Our goal is to create a system that can generate new, realistic dog images that look like they could have come from the same distribution</li>
</ul>

<p>Idea: Learn a transformation from some easy-to-sample distribution (such as Gaussian noise) to our target distribution \(p^*\).</p>

<ul>
  <li>Diffusion models offer a general framework for learning such transformations.</li>
  <li>The clever trick of diffusion is to reduce the problem of sampling from distribution \(p^{*}(x)\) into to a sequence of easier sampling problems.</li>
</ul>

<h2 id="11-gaussian-diffusion">1.1 Gaussian Diffusion</h2>

<h3 id="forward-pass"><strong>Forward Pass</strong></h3>

<p>Systematically transforms target data (like images of dogs) into pure noise through a series of small, random steps.</p>

<p><strong>Starting point</strong>: We have some data \(x_0\) sampled from target distribution \(p^*\) (e.g., real dog images).</p>

<p><strong>The forward process</strong>: You create a sequence 
\(x_0 \rightarrow x_1 \rightarrow x_2 \rightarrow \ldots \rightarrow x_T\)
by repeatedly adding small amounts of Gaussian noise:</p>

\[x_{t+1} = x_t + \eta_t, \quad \text{where } \eta_t \sim \mathcal{N}(0, \sigma^2)\]

<p>This means each step adds independent Gaussian noise with variance \(\sigma^2\).</p>

<p><strong>Final state</strong>: After \(T\) steps, the distribution \(p_T\) becomes approximately Gaussian \(\mathcal{N}(0, \sigma^2)\).</p>

<ul>
  <li>This happens because we are repeatedly adding independent Gaussian noise and the Central Limit Theorem ensures that the result approaches a Gaussian distribution. Variance grows linearly with the number of steps.</li>
</ul>

\[\text{(See figure below)}\]

<p>Images as <img src="/assets/images/2025-07-17/a.png" alt="420" /></p>

<ul>
  <li>So we can approximately sample from \(p_T\) by just sampling a Gaussian.</li>
  <li>We can directly sample \(x_t\) given \(x_0\) without computing all intermediate steps. (Sum of Gaussians is Gaussian)</li>
</ul>

<h3 id="reverse-sampling"><strong>Reverse Sampling</strong></h3>

<p><strong>Strategy</strong></p>

<p>The authors propose to solve generative modelling by decomposing it into many simpler “reverse sampling” steps:</p>

<ul>
  <li><strong>Instead of</strong>: Learning to generate samples from \(p^*\) directly (very hard)</li>
  <li><strong>Do this</strong>: Learn to go backwards one step at a time: 
\(p_T \rightarrow p_{T-1} \rightarrow p_{T-2} \rightarrow \ldots \rightarrow p_0 = p^*\)</li>
</ul>

<p><strong>Why This Decomposition Helps</strong></p>

<p>The key insight is that adjacent distributions (\(p_{t-1}, p_t\)) are very similar because we only add a small amount of noise \(\sigma\) at each step. This makes the reverse step much easier to learn than the full generative problem.</p>

<p>Think of it like this:</p>

<ul>
  <li>Hard: Transform pure noise into a realistic dog image in one step</li>
  <li>Easy: Remove a tiny bit of noise from an almost-clean dog image</li>
</ul>

<h3 id="the-ddpm-reverse-sampler">The DDPM Reverse Sampler</h3>

<p>DDPM: Denoising Diffusion Probabilistic Models</p>

<p>The “obvious” approach is to learn the conditional distribution \(p(x_{t-1} \mid x_t)\) for each step. Given a noisy sample \(x_t\), we want to predict what the slightly less noisy version \(x_{t-1}\) should be.</p>

<p><strong>Fact 1</strong>: <strong>When \(\sigma\) is small, the conditional distribution \(p(x_{t-1} \mid x_t)\) is approximately Gaussian.</strong></p>

<p>This means:</p>

\[p(x_{t-1} \mid x_t = z) \approx \mathcal{N}(\mu_{t-1}(z), \sigma^2)\]

<p>So instead of learning an arbitrary complex distribution, we only need to learn the <strong>mean function \(\mu_{t-1}(z)\)</strong>.</p>

\[\text{(See figure below)}\]

<p>Images as <img src="/assets/images/2025-07-17/b.png" alt="421" /></p>

<p><strong>The Regression Formulation</strong></p>

<p>Since we know the distribution is Gaussian with known variance \(\sigma^2\), learning the mean is equivalent to solving a regression problem:</p>

\[\mu_{t-1} = \arg\min \mathbb{E}\left[\|f(x_t) - x_{t-1}\|^2\right]\]

<p>This can be rewritten as:</p>

\[\mu_{t-1} = \arg\min \mathbb{E}\left[\|f(x_{t-1} + \eta_t) - x_{t-1}\|^2\right]\]

<p>where \(\eta_t \sim \mathcal{N}(0, \sigma^2)\) is the noise we added.</p>

<p><strong>Theorem:</strong> For any joint distribution over random variables \((X, Y)\), the conditional expectation \(\mathbb{E}[Y \mid X]\) is the function that minimizes the mean squared error:</p>

\[\mathbb{E}[Y \mid X] = \arg\min_{f} \mathbb{E}\left[(f(X) - Y)^2\right]\]

<p><strong>The Beautiful Connection to Denoising</strong></p>

<p>Notice what this regression objective is asking: given a clean signal \(x_{t-1}\) plus some noise \(\eta_t\), predict the original clean signal.</p>

<p>This is exactly the <strong>image denoising problem</strong>! We can use standard denoising techniques (like convolutional neural networks) to solve it.</p>

<ul>
  <li>The authors have reduced the complex problem of generative modeling to the well-understood problem of regression/denoising.</li>
</ul>

<p>Instead of learning to generate realistic images from scratch, we learn to remove small amounts of noise—doing this many times in sequence to gradually transform pure noise into realistic samples.</p>

<p>This is why diffusion models work so well: they break down an impossibly hard problem into many manageable denoising steps that neural networks are already good at solving.</p>

<h2 id="12-diffusions-in-the-abstract">1.2 Diffusions in the Abstract</h2>

<p>Diffusion models follow a universal pattern that works across many different settings—not just Gaussian noise, but also discrete domains, deterministic processes, and more.</p>

<ul>
  <li><strong>Discrete Domains:</strong> Instead of working with continuous values (like pixel intensities 0.0 to 1.0), we work with discrete, finite sets of possibilities. For example, text generation where each position can be one of a finite vocabulary.</li>
  <li><strong>Deterministic Processes:</strong> The reverse sampler produces the same output every time you give it the same input—there’s no randomness involved.</li>
</ul>

<h3 id="the-abstract-recipe">The Abstract Recipe</h3>

<p><strong>Step 1: Choose your endpoints</strong></p>

<ul>
  <li>Start with target distribution \(p^*\) (what you want to generate)</li>
  <li>Choose a base distribution \(q\) that’s easy to sample from (e.g., Gaussian noise, random bits)</li>
</ul>

<p><strong>Step 2: Create an interpolating sequence</strong></p>

<ul>
  <li>Build a sequence of distributions that smoothly connects these endpoints:</li>
</ul>

\[p_0 = p^* \rightarrow p_1 \rightarrow p_2 \rightarrow \ldots \rightarrow p_T = q\]

<ul>
  <li>The key requirement is that adjacent distributions (\(p_{t-1}, p_t\)) are “close” in some meaningful sense.</li>
</ul>

<p><strong>Step 3: Learn reverse samplers</strong></p>

<ul>
  <li>For each step \(t\), learn a function \(F_t\) that can transform samples from \(p_t\) back to \(p_{t-1}\).</li>
</ul>

<h3 id="the-reverse-sampler-definition">The Reverse Sampler Definition</h3>

<p>This is the formal definition of what we need to learn:</p>

<p><strong>Definition:</strong> A reverse sampler \(F_t\) is a function such that if you:</p>

<ol>
  <li>Take a sample \(x_t\) from distribution \(p_t\)</li>
  <li>Apply \(F_t\) to get \(F_t(x_t)\)</li>
  <li>The result is distributed according to \(p_{t-1}\)</li>
</ol>

<p>Mathematically:</p>

\[F_t(z) : z \sim p_t \implies F_t(z) \sim p_{t-1}\]

<h3 id="why-this-abstraction-is-powerful">Why This Abstraction is Powerful</h3>

<p><strong>Flexibility:</strong> This framework works for:</p>

<ul>
  <li>Continuous domains (images with Gaussian noise)</li>
  <li>Discrete domains (text, categorical data)</li>
  <li>Deterministic processes (no randomness in the reverse step)</li>
  <li>Stochastic processes (with randomness)</li>
</ul>

<p><strong>Multiple implementations:</strong> The same abstract framework gives us:</p>

<ul>
  <li>DDPM (stochastic, Gaussian-based)</li>
  <li>DDIM (deterministic version)</li>
  <li>Flow-matching (continuous-time generalization)</li>
</ul>

<h3 id="the-key-insight-about-closeness">The Key Insight About “Closeness”</h3>

<p>The magic happens because adjacent distributions are “close.” This means:</p>

<ul>
  <li>The reverse sampling step \(F_t\) doesn’t need to do much work</li>
  <li>Learning becomes feasible because we’re making small adjustments rather than dramatic transformations</li>
</ul>

<h3 id="the-coupling-perspective">The Coupling Perspective</h3>

<p>Given the marginal distributions \({p_t}\), there are many possible ways to define the joint relationships between consecutive steps. These are called “couplings” in probability theory.</p>

<p>This means we have <strong>freedom in how we design the reverse sampler</strong>—we can choose whichever coupling is most convenient for learning or sampling.</p>

<p><strong>Why This Matters</strong></p>

<p>This abstraction shows that diffusion models aren’t just about “adding noise”—they’re about:</p>

<ol>
  <li><strong>Interpolation:</strong> Creating smooth paths between complex and simple distributions</li>
  <li><strong>Decomposition:</strong> Breaking hard problems into many easier steps</li>
  <li><strong>Flexibility:</strong> Adapting the same core idea to many different domains and applications</li>
</ol>

<h2 id="13-discretisation">1.3 Discretisation</h2>

<p>We need to be more precise about what we mean by adjacent distributions \(p_t\), \(p_{t-1}\) being “close”.</p>

<h3 id="the-continuous-time-perspective">The Continuous-Time Perspective</h3>

<p>The authors are shifting from thinking about discrete steps (\(x_0\), \(x_1\), \(x_2\), …) to a <strong>continuous-time process</strong> \(p(x,t)\) where:</p>

<ul>
  <li>\(t = 0\): We have our target distribution \(p^*\)</li>
  <li>\(t = 1\): We have our base distribution (noise)</li>
  <li>\(t \in [0,1]\): We have intermediate distributions</li>
</ul>

<p>The discrete steps are just a <strong>discretisation</strong> of this continuous process:</p>

\[p_k(x) = p(x, k \cdot \Delta t) \qquad \text{where} \; \Delta t = 1/T\]

<p><strong>Finer discretisation = closer adjacent distributions</strong>:</p>

<ul>
  <li>Large \(T \rightarrow\) small \(\Delta t \rightarrow\) many small steps \(\rightarrow\) adjacent distributions are very close</li>
  <li>Small \(T \rightarrow\) large \(\Delta t \rightarrow\) few big steps \(\rightarrow\) adjacent distributions are farther apart</li>
</ul>

<p>This explains why diffusion models work better with more steps!</p>

<h3 id="the-variance-scaling-problem-and-sqrtdelta-t-scaling">The Variance Scaling Problem and \(\sqrt{\Delta t}\) Scaling</h3>

<p>Here’s a subtle but crucial issue: If we naively add noise \(\sigma^2\) at each step, then after \(T\) steps we’d have total variance \(T \cdot \sigma^2\). This means:</p>

<ul>
  <li>More steps \(\rightarrow\) higher final variance</li>
  <li>Fewer steps \(\rightarrow\) lower final variance</li>
</ul>

<p>But we want the final distribution to be the same regardless of how many steps we take.</p>

<p><strong>Solution</strong></p>

<p>To fix this, they scale the noise variance by \(\Delta t\):</p>

\[\sigma = \sigma_q \sqrt{\Delta t} = \sigma_q \sqrt{1/T}\]

<p><strong>Why this works</strong>: After \(T\) steps, the total variance becomes:</p>

\[\text{Total variance} = T \times \sigma_q^2 \Delta t = T \times \sigma_q^2 \times (1/T) = \sigma_q^2\]

<p>So regardless of \(T\), the final variance is always \(\sigma_q^2\)!</p>

<h3 id="the-new-notation">The New Notation</h3>

<p>This scaling ensures that as \(T \rightarrow \infty\) (continuous limit), the process converges to a well-defined continuous-time stochastic process.</p>

<p>From now on:</p>

<ul>
  <li><strong>t</strong> represents continuous time in \([0,1]\), not discrete steps</li>
  <li><strong>\(\Delta t = 1/T\)</strong> is the step size</li>
  <li><strong>\(x_t\)</strong> means “x at time t” (not “x at step t”)</li>
</ul>

<p>The forward process becomes:</p>

\[x_{t+\Delta t} = x_t + \eta_t, \qquad \text{where} \; \eta_t \sim N(0, \sigma_q^2 \Delta t)\]

<p><strong>The Cumulative Effect</strong></p>

\[x_t \sim N(x_0, \sigma_t^2) \qquad \text{where} \; \sigma_t := \sigma_q \sqrt{t}\]

<p>This beautiful formula shows that:</p>

<ul>
  <li>At \(t = 0\): \(\sigma_0 = 0\) (no noise, original data)</li>
  <li>At \(t = 1\): \(\sigma_1 = \sigma_q\) (full noise level)</li>
  <li>At \(t = 0.5\): \(\sigma_{0.5} = \sigma_q \sqrt{0.5}\) (intermediate noise)</li>
</ul>

<p><strong>This discretization framework:</strong></p>

<ol>
  <li><strong>Unifies</strong> discrete and continuous views of diffusion</li>
  <li><strong>Ensures consistency</strong> across different numbers of steps</li>
  <li><strong>Enables</strong> theoretical analysis of the continuous limit</li>
  <li><strong>Connects</strong> to stochastic differential equations (SDEs)</li>
</ol>

<h1 id="2-stochastic-sampling-ddpm">2. Stochastic Sampling: DDPM</h1>

<p>This section introduces the DDPM (Denoising Diffusion Probabilistic Models) sampler - the classic stochastic approach to diffusion sampling. Let me break this down:</p>

<p>The DDPM sampler learns to predict what the previous (less noisy) timestep looked like given the current (more noisy) timestep. Specifically, it learns:</p>

\[\mu_t(z) := E[x_t \mid x_{t+\Delta t} = z]\]

<p>This means: “Given that we observe value \(z\) at time \(t+\Delta t\), what was the expected value at the previous time \(t\)?”</p>

<h3 id="the-training-process">The Training Process</h3>

<p><strong>Objective</strong>: Learn the conditional expectation functions \({\mu_t}\) by solving a regression problem:</p>

\[\mu_t = \arg\min \; E[||f(x_{t+\Delta t}) - x_t||^2]\]

<p><strong>What this means</strong>:</p>

<ul>
  <li>Take pairs of (\(x_t\), \(x_{t+\Delta t}\)) from the forward diffusion process</li>
  <li>Train a neural network to predict the cleaner version \(x_t\) given the noisier version \(x_{t+\Delta t}\)</li>
  <li>This is literally a <strong>denoising</strong> problem!</li>
</ul>

<p><strong>Practical implementation</strong>: Instead of learning separate functions for each timestep, we typically train a single neural network \(f_\theta(x, t)\) that takes both the noisy sample and the time \(t\) as input.</p>

<h3 id="sampling-algorithm-1-stochastic-reverse-sampler-ddpm-like-sampler"><strong>Sampling Algorithm 1: Stochastic Reverse Sampler (DDPM-like Sampler)</strong></h3>

<p>Once trained, the reverse sampler works as follows:</p>

<p>For input sample \(x_t\), and timestep \(t\), output:</p>

\[\hat{x}_{t-\Delta t} \leftarrow \mu_{t-\Delta t}(x_t) + N(0, \sigma_q^2 \Delta t)\]

<p><strong>Breaking this down:</strong></p>

<ol>
  <li><strong>\(\mu_{t-\Delta t}(x_t)\)</strong>: Use the learned function to predict the mean of the previous timestep</li>
  <li><strong>\(+ N(0, \sigma_q^2 \Delta t)\)</strong>: Add Gaussian noise with the same variance as the forward process</li>
  <li>The result is a sample from the previous timestep</li>
</ol>

<p><strong>The Full Generation Process</strong></p>

<p><strong>Step 1</strong>: Start with pure noise: \(x_1 \sim N(0, \sigma_q^2)\)<br />
<strong>Step 2</strong>: Apply Algorithm 1 repeatedly:</p>

<ul>
  <li>
\[x_1 \rightarrow x_{1-\Delta t} \rightarrow x_{1-2\Delta t} \rightarrow ... \rightarrow x_0\]
  </li>
</ul>

<p><strong>Step 3</strong>: The final \(x_0\) is your generated sample</p>

<h3 id="more">More</h3>

<p><strong>Why This Works (Conceptually)</strong></p>

<ul>
  <li>The magic relies on <strong>Fact 1</strong>: that the true conditional distribution \(p(x_{t-\Delta t} \mid x_t)\) is approximately Gaussian when \(\Delta t\) is small.</li>
  <li>If this is true, then:
    <ul>
      <li>We only need to learn the mean \(\mu_{t-\Delta t}(x_t)\) (since we know the variance is \(\sigma_q^2 \Delta t\))</li>
      <li>We can sample from this conditional by taking the predicted mean plus Gaussian noise</li>
      <li>Each step undoes a small amount of the forward corruption</li>
    </ul>
  </li>
</ul>

<p><strong>The Stochastic Nature</strong></p>

<ul>
  <li>Notice that this sampler is <strong>stochastic</strong> - even if you start with the same noise \(x_1\), you’ll get different samples \(x_0\) because of the added noise at each step. This is different from deterministic samplers like DDIM.</li>
</ul>

<h2 id="21-correctness-of-ddpm-look-in-paper-for-the-proof">2.1 Correctness of DDPM: Look in paper for the proof</h2>

<p><strong>The Problem</strong>: We needed to prove that DDPM’s reverse sampler actually works - that it can successfully generate samples from our target distribution.</p>

<p><strong>The Key Question</strong>: Why is the reverse process (going from noisy to clean) approximately Gaussian?</p>

<p><strong>The Answer</strong>:</p>

<ol>
  <li><strong>Used Bayes’ rule</strong> to express the reverse conditional probability \(p(x_{t-\Delta t} \mid x_t)\)</li>
  <li><strong>Applied Taylor expansion</strong> around the current point</li>
  <li><strong>Completed the square</strong> to show it has Gaussian form</li>
</ol>

<p><strong>The Result</strong>:</p>

\[p(x_{t-\Delta t} \mid x_t) = N(\text{mean}, \sigma_q^2 \Delta t)\]

<p>where the mean involves the “score” (gradient of log probability).</p>

<p><strong>Why This Matters</strong>:</p>

<ul>
  <li>Since the reverse process is Gaussian, we only need to learn its mean</li>
  <li>Learning the mean is just a regression problem (predicting clean from noisy)</li>
  <li>This justifies why DDPM works: each reverse step is a simple denoising operation</li>
</ul>

<p><strong>The Bottom Line</strong>: DDPM works because when you add small amounts of noise, reversing that process is approximately Gaussian, which makes it learnable through standard regression techniques.</p>

<h2 id="22-algorithms">2.2 Algorithms</h2>

<h3 id="pseudocode-1-ddpm-training">Pseudocode 1: DDPM Training</h3>

<p><strong>What it does</strong>: Trains the neural network to do denoising regression.</p>

<p><strong>Step by step</strong>:</p>

<ol>
  <li><strong>Get clean data</strong>: Sample \(x_0\) from target distribution (e.g., real images)</li>
  <li><strong>Pick random time</strong>: Sample \(t\) uniformly from \([0,1]\)</li>
  <li><strong>Add noise up to time t</strong>: Create \(x_t = x_0 + N(0, \sigma_q^2 t)\)</li>
  <li><strong>Add one more step of noise</strong>: Create \(x_{t+\Delta t} = x_t + N(0, \sigma_q^2 \Delta t)\)</li>
  <li><strong>Train to denoise:</strong>  \(\text{Loss} = \left\| f_\theta(x_{t+\Delta t}, t+\Delta t) - x_t \right\|^2\)</li>
</ol>

<p><strong>Key insight</strong>: The network learns to predict the cleaner version \(x_t\) given the noisier version \(x_{t+\Delta t}\) and the time \(t+\Delta t\).</p>

<h3 id="pseudocode-2-ddpm-sampling">Pseudocode 2: DDPM Sampling</h3>

<p>What it does: Generates new samples using the trained model.</p>

<p>Step by step:</p>

<ol>
  <li>Start with pure noise: \(x_1 \sim N(0, \sigma_q^2)\)</li>
  <li>Go backwards in time: For \(t = 1, 1-\Delta t, 1-2\Delta t, ..., \Delta t\)</li>
  <li>Predict + add noise: \(x_{t-\Delta t} = f_\theta(x_t, t) + N(0, \sigma_q^2 \Delta t)\)</li>
  <li>Return final result: \(x_0\) is your generated sample</li>
</ol>

<p>Key insight: Each step predicts the cleaner version, then adds noise to account for uncertainty (this is the stochastic part).</p>

<h3 id="pseudocode-3-ddim-sampling-preview">Pseudocode 3: DDIM Sampling (Preview)</h3>

<p><strong>What it does</strong>: Deterministic version of sampling (no added noise).</p>

<p><strong>Key difference</strong>: Instead of adding random noise, it uses a deterministic update rule with a mixing coefficient \(\lambda\).</p>

<h3 id="important-notes">Important Notes</h3>

<ul>
  <li>Training is simultaneous: The network learns to denoise at ALL timesteps at once.</li>
  <li>Sampling goes backwards: We go from \(t=1\) (pure noise) to \(t=0\) (clean data)</li>
  <li>Same network for all steps: \(f_\theta(x,t)\) handles all timesteps using the time input \(t\)</li>
</ul>

<h2 id="23-variance-reduction-predicting-x_0">2.3 Variance Reduction: Predicting \(x_0\)</h2>

<p>This section explains an important practical trick used in diffusion models! Let me break it down:</p>

<h3 id="the-two-training-approaches">The Two Training Approaches</h3>

<p><strong>Original approach</strong>: Train the network to predict \(E[x_{t-\Delta t} \mid x_t]\) - the previous timestep<br />
<strong>Alternative approach</strong>: Train the network to predict \(E[x_0 \mid x_t]\) - the original clean data</p>

<h3 id="why-this-works-claim-2">Why This Works (Claim 2):</h3>

<p>We have:</p>

\[E[(x_{t-\Delta t} - x_t) \mid x_t] = \frac{\Delta t}{t} E[(x_0 - x_t) \mid x_t]\]

<p>and its equivalent to:</p>

\[E[x_{t-\Delta t} \mid x_t] = \left(\frac{\Delta t}{t}\right) E[x_0 \mid x_t] + \left(1 - \frac{\Delta t}{t}\right) x_t\]

<p>This means: if you can predict the clean image \(x_0\), you can easily compute what the previous timestep \(x_{t-\Delta t}\) should be.</p>

<h3 id="the-intuitive-explanation">The Intuitive Explanation</h3>

<p><strong>The noise symmetry argument</strong>:</p>

<ul>
  <li>When you observe \(x_t\), it’s the sum: \(x_0 + \eta_1 + \eta_2 + \ldots + \eta_t\) (all the noise steps)</li>
  <li>You can’t tell which noise came from which step—they all “look the same”</li>
  <li>So instead of predicting one noise step \(\eta_{t-\Delta t}\), you can predict the average of all noise steps</li>
  <li>The average has much lower variance than individual steps!</li>
</ul>

<h3 id="why-this-is-better-variance-reduction">Why This is Better (Variance Reduction)</h3>

<p><strong>Problem with predicting \(x_{t-\Delta t}\)</strong>: You’re trying to estimate one noisy step from another noisy observation—high variance.</p>

<p><strong>Solution with predicting \(x_0\)</strong>: You’re averaging over all the noise steps, which reduces variance significantly.</p>

<p>Think of it like this:</p>

<ul>
  <li><strong>High variance</strong>: “Given this noisy image, what did the slightly less noisy version look like?”</li>
  <li><strong>Low variance</strong>: “Given this noisy image, what did the original clean image look like?”</li>
</ul>

<p>The second question is easier because you’re not trying to distinguish between very similar noise levels.</p>

<h3 id="important-warning">Important Warning</h3>

<p><strong>Critical point</strong>: The model predicts \(E[x_0 \mid x_t]\), which is the <strong>expected value</strong>, not a sample!</p>

<p><strong>What this means</strong>:</p>

<ul>
  <li>If you’re generating faces, \(E[x_0 \mid x_t]\) might be a blurry average of all possible faces</li>
  <li>It won’t look like a real face—it’s a mathematical expectation</li>
  <li>This is normal and expected!</li>
</ul>

<p><strong>Common misconception</strong>: People think “predicting \(x_0\)” means the model outputs something that looks like a real sample. It doesn’t—it outputs the average of all possible samples.</p>

<h3 id="practical-implementation">Practical Implementation</h3>

<p>In practice:</p>

<ol>
  <li><strong>Train</strong> the model to predict \(E[x_0 \mid x_t]\) (better variance)</li>
  <li><strong>During sampling</strong>, use the relationship in Claim 2 to convert this back to \(E[x_{t-\Delta t} \mid x_t]\)</li>
  <li><strong>Apply the sampling algorithm</strong> as usual</li>
</ol>

<h3 id="the-mathematical-relationship">The Mathematical Relationship</h3>

<p>The division by \(\left(\frac{t}{\Delta t}\right)\) in the formula represents the number of steps taken so far. Since we’ve accumulated \(\left(\frac{t}{\Delta t}\right)\) noise steps, we divide the total predicted noise by this amount to get the average per step.</p>

<h1 id="3-deterministic-sampling-ddim">3. Deterministic Sampling: DDIM</h1>

<p>DDIM: Denoising Diffusion Implicit Model → A deterministic alternative to the stochastic DDPM sampler.</p>

<h2 id="algorithm-2-deterministic-reverse-sampler-ddim-like">Algorithm 2: Deterministic Reverse Sampler (DDIM-like)</h2>

<p>Instead of using the stochastic sampler that adds random noise at each step, DDIM uses a <strong>deterministic function</strong> that always produces the same output for the same input.</p>

<p>For input sample \(x_t\), and step index \(t\), output:</p>

\[\hat{x}_{t-\Delta t} = x_t + \lambda \left( \mu_{t-\Delta t}(x_t) - x_t \right)\]

<p>Where:</p>

\[\lambda = \frac{\sigma_t}{\sigma_{t-\Delta t} + \sigma_t}\]

<p>and</p>

\[\sigma_t = \sigma_q \sqrt{t}\]

<ul>
  <li>\(\mu_{t-\Delta t}(x_t) = E[x_{t-\Delta t} \mid x_t]\) is the conditional expectation (what we’d predict on average)</li>
  <li>\(\lambda = \frac{\sigma_t}{\sigma_{t-\Delta t} + \sigma_t}\) is a scaling factor</li>
  <li>\(\sigma_t = \sigma_q \sqrt{t}\) from the noise schedule</li>
</ul>

<h3 id="understanding-the-formula">Understanding the Formula</h3>

<p>Let’s interpret what this update is doing:</p>

<p><strong>Step 1</strong>: \(\mu_{t-\Delta t}(x_t) - x_t\)</p>

<ul>
  <li>This is the “direction” we need to move to get from the current noisy sample to the predicted less-noisy sample.</li>
</ul>

<p><strong>Step 2</strong>: \(\lambda (\mu_{t-\Delta t}(x_t) - x_t)\)</p>

<ul>
  <li>We scale this direction by factor \(\lambda\). This determines how far we actually move.</li>
</ul>

<p><strong>Step 3</strong>: \(x_t + \lambda (\mu_{t-\Delta t}(x_t) - x_t)\)</p>

<ul>
  <li>We take a step in that direction from our current position.</li>
</ul>

<h3 id="why-this-scaling-factor-lambda">Why This Scaling Factor \(\lambda\)?</h3>

<p>The scaling factor \(\lambda\) has a nice interpretation:</p>

<ul>
  <li>When \(\sigma_{t-\Delta t} \approx \sigma_t\) (small time step), then \(\lambda \approx \frac{1}{2}\) (take a moderate step)</li>
  <li>When \(\sigma_{t-\Delta t} \ll \sigma_t\) (large time step), then \(\lambda \approx 1\) (take the full predicted step)</li>
  <li>When \(\sigma_{t-\Delta t} \gg \sigma_t\) (this shouldn’t happen in forward process), then \(\lambda \approx 0\)</li>
</ul>

<h3 id="deterministic-vs-stochastic">Deterministic vs Stochastic</h3>

<p><strong>DDPM (Stochastic)</strong>:</p>

<ul>
  <li>Samples from \(p(x_{t-\Delta t} \mid x_t)\)</li>
  <li>Same input can give different outputs</li>
  <li>Adds randomness at each step</li>
</ul>

<p><strong>DDIM (Deterministic)</strong>:</p>

<ul>
  <li>Uses a fixed function \(F_t(x_t)\)</li>
  <li>Same input always gives same output</li>
  <li>No randomness in the reverse process</li>
</ul>

<h3 id="the-transport-map-perspective">The Transport Map Perspective</h3>

<p>Instead of thinking about sampling from conditional distributions, DDIM thinks about <strong>transport maps</strong>—functions that transform one distribution into another.</p>

<p>The goal is to show that the function \(F_t\) defined by the DDIM update “pushes” the distribution \(p_t\) to \(p_{t-\Delta t}\):</p>

\[F_t \,\sharp\, p_t \approx p_{t-\Delta t}\]

<p>The notation \(F\,\sharp\,p\) means “the distribution you get when you apply function \(F\) to samples from distribution \(p\)”.</p>

<h3 id="advantages-of-ddim"><strong>Advantages of DDIM</strong>:</h3>

<ol>
  <li><strong>Faster sampling</strong>: Can take bigger steps since it’s deterministic</li>
  <li><strong>Reproducible</strong>: Same starting noise always gives same result</li>
  <li><strong>Interpolation</strong>: Can smoothly interpolate between samples</li>
  <li><strong>Fewer steps</strong>: Often works well with far fewer steps than DDPM</li>
</ol>

<p><strong>Connection to other methods</strong>: This deterministic approach connects to flow-matching and other continuous-time methods.</p>

<h3 id="we-need-to-prove-that-ddim-is-correct-and-works">We need to Prove that DDIM is correct and works:</h3>

<p>The authors will prove this works by:</p>

<ol>
  <li><strong>Point-mass case</strong>: Show it works for the simplest distributions (single points)</li>
  <li><strong>Marginalization</strong>: Extend to full distributions by considering all possible points</li>
</ol>

<p>This is similar to how flow-matching methods are analyzed—by showing the transport map works pointwise and then extending to distributions.</p>

<p>The key insight is that even though we’re not sampling from \(p(x_{t-\Delta t} \mid x_t)\), we can still achieve the same marginal distribution \(p_{t-\Delta t}\) through this deterministic transport.</p>

<h2 id="31-case-1-single-point">3.1 Case 1: Single Point</h2>

<p><strong>Avoiding complicated math: Refer to paper</strong></p>

<p><strong>What are we trying to prove?</strong></p>

<ul>
  <li>We want to show that DDIM (the deterministic sampler) actually works. But proving it for complicated distributions is hard, so we start with the <strong>simplest possible case</strong>.</li>
</ul>

<p><strong>The simplest case: One dot</strong></p>

<ul>
  <li>Imagine our target is just a single dot at position 0. That’s it—we want to generate samples that are exactly at position 0.</li>
</ul>

<p><strong>What happens when we add noise?</strong></p>

<ul>
  <li>Start: We have a dot at position 0</li>
  <li>After some time: The dot has moved randomly and is now somewhere else (due to noise)</li>
  <li>Our job: Figure out how to move it back toward 0</li>
</ul>

<p><strong>The obvious solution</strong></p>

<ul>
  <li>If we know the dot started at 0, and now it’s at some noisy position, the obvious thing to do is <strong>shrink it back toward 0</strong>.</li>
  <li>If the dot is currently at position 10, and we know it should be closer to 0, we should move it to maybe position 7 or 5 (somewhere closer to 0).</li>
</ul>

<p><strong>The key insight</strong></p>

<ul>
  <li><strong>The fancy DDIM formula is actually just doing this simple shrinking!</strong></li>
</ul>

\[\text{New position} = \text{Old position} + \lambda (\text{Predicted position} - \text{Old position})\]

<ul>
  <li>But in the simple case, this reduces to:</li>
</ul>

\[\text{New position} = (\text{shrink\_factor}) \times \text{Old position}\]

<ul>
  <li>Where \(\text{shrink_factor}\) is less than 1, so we’re moving the dot closer to 0.</li>
</ul>

<p><strong>Why this matters</strong></p>

<ul>
  <li>This proves that DDIM works correctly in the simplest case. It’s doing exactly what we’d expect—gradually shrinking the noise to bring samples back to the target.</li>
</ul>

<p><strong>The bigger picture</strong></p>

<ul>
  <li><strong>DDIM looks complicated</strong> with all its formulas and Greek letters</li>
  <li><strong>But in the simplest case</strong>, it’s just gradually shrinking noisy samples back toward the target</li>
  <li><strong>This gives us confidence</strong> that it’s doing something sensible in more complex cases too</li>
</ul>

<p>Think of it like this: if you wanted to guide a lost person back to their house, you’d tell them to walk in the direction of their house. DDIM is doing the same thing—it’s figuring out which direction to move to get closer to the target, then taking a step in that direction.</p>

<h2 id="32-velocity-fields-and-gases">3.2 Velocity Fields and Gases</h2>

<p>Instead of thinking about DDIM as a mathematical formula, we can think of it as a <strong>velocity field</strong>—like wind patterns that tell particles which way to move.</p>

<p>The DDIM update can be rewritten as:</p>

\[\hat{x}_{t-\Delta t} = x_t + v_t(x_t) \cdot \Delta t\]

<p>Where:</p>

\[v_t(x_t) = \frac{\lambda}{\Delta t} \left( E[x_{t-\Delta t} \mid x_t] - x_t \right)\]

<p>This looks just like <strong>physics</strong>: position = old position + velocity × time!</p>

<h3 id="the-gas-analogy">The Gas Analogy</h3>

<p>Imagine a <strong>gas made of particles</strong>:</p>

<ul>
  <li>Each particle represents a possible sample</li>
  <li>The density of particles at any location represents the probability of that sample</li>
  <li>The gas starts with density pattern \(p_t\) (more spread out/noisy)</li>
  <li>We want it to end up with density pattern \(p_{t-\Delta t}\) (less spread out/noisy)</li>
</ul>

<h3 id="how-the-velocity-field-works">How the Velocity Field Works</h3>

<p>The velocity field \(v_t(x)\) tells each particle at position \(x\) which direction to move:</p>

<ul>
  <li><strong>Direction</strong>: Toward where that particle “should” be (based on \(E[x_{t-\Delta t} \mid x_t]\))</li>
  <li><strong>Speed</strong>: Proportional to how far it needs to move</li>
</ul>

<p>When all particles move according to this velocity field, the overall gas density transforms from \(p_t\) to \(p_{t-\Delta t}\).</p>

<h2 id="note-skipping-proofs">Note: Skipping Proofs</h2>

<p>3.3 Case 2: Two Points</p>

<p>3.4 Case 3: Arbitrary Distributions</p>

<p>3.5 The Probability Flow ODE [Optional]</p>

<h2 id="36-discussion-ddpm-vs-ddim">3.6 Discussion: DDPM vs DDIM</h2>

<p><strong>DDPM (Stochastic)</strong>:</p>

<ul>
  <li>Takes a sample and produces a <strong>random output</strong> from \(p(x_{t-\Delta t} \mid x_t)\)</li>
  <li>Same input can give different outputs each time</li>
</ul>

<p><strong>DDIM (Deterministic)</strong>:</p>

<ul>
  <li>Takes a sample and produces the <strong>same output</strong> every time</li>
  <li>Creates a fixed mapping from input to output</li>
</ul>

<h3 id="the-iteration-behaviour">The Iteration Behaviour</h3>

<p>When you run these algorithms from start to finish, they behave very differently:</p>

<p><strong>DDPM: Independence from Starting Point</strong></p>

<ul>
  <li><strong>Key insight</strong>: If you start DDPM from different initial noise samples \(x_1\), you’ll get samples that are essentially independent of where you started.</li>
  <li><strong>Why</strong>: The forward process “mixes” well—it scrambles the original data so much that the final noise \(x_1\) contains almost no information about the original \(x_0\).</li>
  <li><strong>Result</strong>: \(p(x_0 \mid x_1) \approx p(x_0)\)—the output doesn’t depend on the starting noise!</li>
  <li><strong>Analogy</strong>: Like shuffling a deck of cards so thoroughly that the final order tells you nothing about the original order.</li>
</ul>

<p><strong>DDIM: Strong Dependence on Starting Point</strong></p>

<ul>
  <li><strong>Key insight</strong>: DDIM creates a deterministic function from noise to data.</li>
  <li><strong>Why</strong>: Since it’s deterministic, the same starting noise \(x_1\) always produces the same final output \(x_0\).</li>
  <li><strong>Result</strong>: Different starting points lead to different, but predictable outputs.</li>
  <li><strong>Analogy</strong>: Like having a specific recipe—same ingredients always give the same dish.</li>
</ul>

<h3 id="the-mapping-perspective">The Mapping Perspective</h3>

<p>This reveals something profound about DDIM:</p>

<p><strong>DDIM as a Special Map</strong></p>

<ul>
  <li><strong>What it does</strong>: Creates a deterministic function from Gaussian noise \(\rightarrow\) target distribution</li>
  <li><strong>Sounds familiar</strong>: This is similar to GANs and Normalizing Flows, which also map noise to data. <strong>But there’s a key difference</strong>… <strong>The Constraint Makes It Special</strong>
    <ul>
      <li><strong>GANs</strong>: Can learn <strong>any</strong> mapping that works—complete freedom</li>
      <li><strong>DDIM</strong>: Must learn the <strong>specific</strong> mapping determined by the target distribution</li>
    </ul>
  </li>
  <li><strong>Why this matters</strong>:
    <ul>
      <li><strong>Supervised vs Unsupervised</strong>: DDIM has a “correct answer” to learn toward</li>
      <li><strong>Smoothness</strong>: The DDIM map inherits smoothness from the target distribution</li>
      <li><strong>Structure</strong>: The mapping respects the geometry of the data</li>
    </ul>
  </li>
</ul>

<h3 id="practical-implications">Practical Implications</h3>

<p><strong>DDPM Advantages:</strong></p>

<ul>
  <li><strong>Sample diversity</strong>: Randomness can help explore different modes</li>
  <li><strong>Robustness</strong>: Less sensitive to the exact starting point</li>
</ul>

<p><strong>DDIM Advantages:</strong></p>

<ul>
  <li><strong>Reproducibility</strong>: Same noise always gives same result</li>
  <li><strong>Interpolation</strong>: Can smoothly interpolate between samples</li>
  <li><strong>Speed</strong>: Often works with fewer steps</li>
  <li><strong>Control</strong>: Deterministic nature enables better control</li>
</ul>

<h3 id="the-learning-trade-off">The Learning Trade-off</h3>

<p><strong>Easier aspects of DDIM</strong>:</p>

<ul>
  <li>Has a “ground truth” target function to learn</li>
  <li>Inherits nice properties from the target distribution</li>
  <li>Supervised learning setup</li>
</ul>

<p><strong>Harder aspects of DDIM</strong>:</p>

<ul>
  <li>Must learn the specific “correct” mapping</li>
  <li>Less flexibility than arbitrary mappings</li>
  <li>May miss easier-to-learn alternatives</li>
</ul>

<h3 id="visual-intuition">Visual Intuition</h3>

<ul>
  <li><strong>DDPM</strong>: Like a skilled artist who can paint many different dogs from the same reference photo—each painting is different but all are valid dogs.</li>
  <li><strong>DDIM</strong>: Like a precise photocopier that always produces the exact same copy from the same input—deterministic but perfectly reproducible.</li>
</ul>

<h3 id="the-philosophical-difference">The Philosophical Difference</h3>

<ul>
  <li><strong>DDPM</strong>: “Generate samples that look like they came from the target distribution”</li>
  <li><strong>DDIM</strong>: “Learn the specific transformation that the diffusion process implies”</li>
</ul>

<p>This fundamental difference in philosophy leads to all the practical differences we observe in how these methods behave!</p>

<h2 id="37-remarks-on-generalization">3.7 Remarks on Generalization</h2>

<p>This section addresses a crucial practical issue that often gets overlooked in theoretical discussions of diffusion models: <strong>How do we actually learn these models from real data without just memorizing the training set?</strong></p>

<h3 id="the-core-problem">The Core Problem</h3>

<p><strong>What we want</strong>: A model that learns the underlying distribution and can generate new, similar samples.</p>

<p><strong>What we might get</strong>: A model that just memorizes the training data and can only reproduce exact copies of what it saw.</p>

<h3 id="the-empirical-risk-minimization-trap">The Empirical Risk Minimization Trap</h3>

<p><strong>Standard approach</strong>: Train by minimizing prediction error on the training set.</p>

<p><strong>The problem</strong>: If we minimize this error perfectly, we get a model that:</p>

<ul>
  <li>Perfectly predicts the training data</li>
  <li>Only generates samples that are exactly from the training set</li>
  <li>Never creates anything genuinely new</li>
</ul>

<p><strong>Why this fails</strong>: Perfect memorization of finite training data doesn’t help us learn the true underlying distribution.</p>

<p>Imagine learning to draw dogs:</p>

<ul>
  <li><strong>Bad approach</strong>: Memorize every pixel of 1000 dog photos and only reproduce those exact photos</li>
  <li><strong>Good approach</strong>: Learn what makes something “dog-like” and generate new dog images</li>
</ul>

<h3 id="the-regularization-solution">The Regularization Solution</h3>

<p><strong>The key insight</strong>: We need to prevent perfect memorization through regularization.</p>

<p><strong>Explicit regularization</strong>: Add penalties to prevent overfitting<br />
<strong>Implicit regularization</strong>: Natural limitations prevent memorization:</p>

<ul>
  <li><strong>Finite model capacity</strong>: The neural network can’t memorize everything</li>
  <li><strong>Optimization randomness</strong>: SGD doesn’t find the perfect memorizing solution</li>
  <li><strong>Early stopping</strong>: We don’t train to perfect convergence</li>
</ul>

<p><strong>Why This Matters</strong></p>

<ul>
  <li><strong>For researchers</strong>: Understanding that perfect optimization isn’t the goal—we want controlled generalization.</li>
  <li><strong>For practitioners</strong>:
    <ul>
      <li>Larger datasets help prevent memorization</li>
      <li>Some “imperfection” in training is actually beneficial</li>
      <li>Need to balance fitting the data vs. generalizing</li>
    </ul>
  </li>
</ul>

<p><strong>The Security/Copyright Issue</strong></p>

<ul>
  <li><strong>Real concern</strong>: Models trained on copyrighted or private data might reproduce it exactly.</li>
  <li><strong>Evidence</strong>: Researchers have shown they can extract training images from models like Stable Diffusion with carefully crafted prompts.</li>
</ul>

<h3 id="practical-takeaways">Practical Takeaways</h3>

<ol>
  <li><strong>Don’t aim for perfect training loss</strong>—some generalization error is good</li>
  <li><strong>Use larger datasets</strong> when possible to reduce memorization</li>
  <li><strong>Implicit regularization</strong> from neural network training often helps naturally</li>
  <li><strong>Be aware of privacy/copyright implications</strong> of potential memorization</li>
</ol>

<h1 id="4-flow-matching">4 Flow Matching</h1>

<p>Flow matching is a generalization of DDIM that provides much more flexibility in designing generative models.</p>

<p>The core ideas behind DDIM don’t actually require:</p>

<ul>
  <li>Gaussian noise</li>
  <li>The specific Gaussian forward process</li>
  <li>Any particular base distribution</li>
</ul>

<p>Instead, the fundamental concept is about <strong>transporting distributions</strong> using <strong>vector fields</strong>.</p>

<h2 id="the-two-step-construction-from-ddim">The Two-Step Construction from DDIM</h2>

<p>Looking back at how DDIM worked, there were really two key steps:</p>

<h3 id="step-1-point-to-point-transport">Step 1: Point-to-Point Transport</h3>

<p>For any single target point \(a\), we can construct a vector field \(v[a]_t\) that transports a sample from the base distribution (like standard Gaussian) to exactly that point \(a\).</p>

<p>Think of this as: “How do I move a particle from random noise to land exactly at point \(a\)?”</p>

<p><strong>Example</strong>:</p>

<ul>
  <li>Target point \(a\) = “golden retriever sitting”</li>
  <li>Vector field \(v[a]_t\) = instructions for how to move a noise sample to become exactly this image</li>
</ul>

<h3 id="step-2-combining-vector-fields">Step 2: Combining Vector Fields</h3>

<p>When we have multiple target points (or a whole distribution), we combine the individual vector fields into a single effective vector field.</p>

<p>This is like: “If I want to transport noise to match a complex distribution, I combine the ‘instructions’ for reaching each individual point.”</p>

<p>Example: If we have many target points, we need to combine all these individual vector fields into <strong>one unified vector field</strong> that can generate the entire distribution.</p>

<ul>
  <li>\(v[a_1]_t\) → path to “golden retriever sitting”</li>
  <li>\(v[a_2]_t\) → path to “beagle running”</li>
  <li>\(v[a_3]_t\) → path to “poodle sleeping”</li>
  <li>etc.</li>
</ul>

\[v_t(x) = \int v[a]_t(x) \cdot p^*(a) \, da\]

<p>Or in discrete terms:</p>

\[v_t(x) = \sum v[a]_t(x) \cdot P(\text{target} = a)\]

<p><strong>What This Means Intuitively</strong></p>

<p>At any point \(x\) and time \(t\), the combined vector field tells you:</p>

<ul>
  <li>“Move in the direction that’s the average of all individual directions”</li>
  <li>“Weight each direction by how likely that target is in your dataset”</li>
</ul>

<p>Suppose at some point \(x\) during the denoising process:</p>

<ul>
  <li>\(v[a_1]_t(x)\) says “move right” (toward golden retriever)</li>
  <li>\(v[a_2]_t(x)\) says “move left” (toward beagle)</li>
  <li>\(v[a_3]_t(x)\) says “move up” (toward poodle)</li>
</ul>

<p>And your dataset has:</p>

<ul>
  <li>50% golden retrievers</li>
  <li>30% beagles</li>
  <li>20% poodles</li>
</ul>

<p>Then the combined vector field would be:</p>

\[v_t(x) = 0.5 \times \text{"right"} + 0.3 \times \text{"left"} + 0.2 \times \text{"up"}\]

<p><strong>The Learning Process</strong></p>

<p>In practice, we don’t know all the individual vector fields \(v[a]_t\) ahead of time. Instead:</p>

<ol>
  <li><strong>Sample pairs</strong>: Take samples \((x_0, x_1)\) where \(x_0\) is from base distribution and \(x_1\) is from target distribution</li>
  <li><strong>Construct path</strong>: For each pair, define a path from \(x_0\) to \(x_1\) (like a straight line)</li>
  <li><strong>Learn average</strong>: Train a neural network to predict the average velocity along all these paths</li>
</ol>

<p><strong>Connection to DDIM</strong></p>

<p>In DDIM, this combination happens implicitly:</p>

<ul>
  <li>The conditional expectation \(E[x_{t-\Delta t} \mid x_t]\) is already the result of combining all possible paths</li>
  <li>The Gaussian assumptions make this combination mathematically tractable</li>
  <li>The vector field emerges from the denoising objective</li>
</ul>

<h3 id="the-generalization">The Generalization</h3>

<p>Flow matching asks: <strong>What if we drop all the Gaussian assumptions?</strong></p>

<p>Instead of being limited to:</p>

<ul>
  <li>Gaussian base distributions</li>
  <li>Gaussian forward processes</li>
  <li>Specific noise schedules</li>
</ul>

<p>We can now think about:</p>

<ul>
  <li><strong>Any two points</strong> \(x_0\) and \(x_1\)</li>
  <li><strong>Any two distributions</strong> \(p\) (data) and \(q\) (base)</li>
  <li><strong>Any smooth path</strong> connecting them</li>
</ul>

<h2 id="why-this-matters">Why This Matters</h2>

<blockquote>
  <p>In traditional diffusion models (DDPM/DDIM), the paths are <strong>curved</strong> because of how Gaussian noise is added and removed.</p>

  <ul>
    <li><strong>Why curved?</strong> The forward process adds noise gradually: clean \(\rightarrow\) slightly noisy \(\rightarrow\) more noisy \(\rightarrow\) pure noise. The reverse process follows the same curved trajectory backwards.</li>
    <li>Imagine a ball rolling down a curved hill—it doesn’t go straight down, it follows the curved surface.</li>
  </ul>
</blockquote>

<p><strong>More flexible paths</strong>: Instead of the specific curved paths that Gaussian diffusion creates, we can design:</p>

<h3 id="1-straight-lines-rectified-flows"><strong>1. Straight lines</strong> (rectified flows)</h3>

<p>Instead of curved paths, we connect each noise sample to its corresponding data sample with a straight line.</p>

<p>If you start at noise point \(x_1\) and want to reach data point \(x_0\):</p>

\[x(t) = (1-t)x_1 + t \cdot x_0\]

<p><strong>Why this is better</strong>:</p>

<ul>
  <li><strong>Faster sampling</strong>: Straight lines are the shortest distance between two points</li>
  <li><strong>Fewer steps needed</strong>: You can take bigger steps along a straight path</li>
  <li><strong>More predictable</strong>: Easier to control and understand</li>
  <li><strong>Less computation</strong>: Simpler math than curved trajectories</li>
</ul>

<p><strong>Used in Stable Diffusion 3</strong>: This makes SD3 faster and more efficient than previous versions.</p>

<h3 id="2-custom-trajectories"><strong>2. Custom trajectories</strong></h3>

<p>Design paths that are optimized for your specific data type or use case.</p>

<p><strong>Like For images</strong>:</p>

<ul>
  <li>Paths that preserve image structure early in generation</li>
  <li>Trajectories that handle different frequency components separately</li>
  <li>Paths optimized for specific image types (faces, landscapes, etc.)</li>
</ul>

<p><strong>For text</strong>:</p>

<ul>
  <li>Paths that maintain syntactic structure while changing semantics</li>
  <li>Trajectories that respect language hierarchies (words \(\rightarrow\) sentences \(\rightarrow\) paragraphs)</li>
</ul>

<p><strong>For 3D shapes</strong>:</p>

<ul>
  <li>Paths that preserve geometric constraints</li>
  <li>Trajectories that respect physical laws (like gravity for fluid simulations)</li>
</ul>

<p><strong>For audio</strong>:</p>

<ul>
  <li>Paths that preserve harmonic structure</li>
  <li>Trajectories optimized for different types of sounds (speech, music, etc.)</li>
</ul>

<h3 id="3-paths-that-avoid-low-probability-regions"><strong>3. Paths that avoid low-probability regions</strong></h3>

<p>This is a sophisticated optimization that’s really powerful:</p>

<p><strong>The problem</strong>: In high-dimensional spaces, there are regions where data almost never appears. Traditional diffusion might accidentally pass through these “impossible” regions.</p>

<p><strong>Example with faces</strong>:</p>

<ul>
  <li>Low-probability region: Images with eyes in impossible positions, or faces that morph unnaturally</li>
  <li>Good path: Stays in regions that look like plausible faces throughout the generation process</li>
</ul>

<p><strong>Visual analogy</strong>: Imagine you’re hiking from point A to point B. You could:</p>

<ul>
  <li>Take a straight line (might go through dangerous cliffs)</li>
  <li>Take a curved path that stays on safe, well-traveled trails</li>
</ul>

<p><strong>How it works</strong>:</p>

<ul>
  <li>Instead of: noise \(\rightarrow\) weird intermediate states \(\rightarrow\) final image</li>
  <li>Design: noise \(\rightarrow\) always plausible-looking states \(\rightarrow\) final image</li>
</ul>

<p><strong>Benefits</strong>:</p>

<ul>
  <li><strong>Better intermediate results</strong>: Every step looks reasonable</li>
  <li><strong>More stable training</strong>: Less likely to get stuck in impossible configurations</li>
  <li><strong>Higher quality</strong>: Final results are more realistic</li>
  <li><strong>Conditional generation</strong>: Better control over the generation process</li>
</ul>

<h3 id="different-base-distributions"><strong>Different base distributions</strong>:</h3>

<p>We’re not limited to Gaussian noise. We could use:</p>

<ul>
  <li>Uniform distributions</li>
  <li>Other structured noise patterns</li>
  <li>Even data-dependent base distributions</li>
</ul>

<h3 id="broader-applications"><strong>Broader applications</strong>:</h3>

<p>This framework works for:</p>

<ul>
  <li>Continuous data (images, audio)</li>
  <li>Discrete data (with appropriate metrics)</li>
  <li>Structured data (graphs, molecules)</li>
  <li>Any domain where you can define smooth interpolation</li>
</ul>

<h3 id="the-mathematical-framework">The Mathematical Framework</h3>

<p>The core mathematical object is a <strong>vector field</strong> \(v_t(x)\) that tells you:</p>

<ul>
  <li>At time \(t\)</li>
  <li>At position \(x\)</li>
  <li>Which direction and how fast to move</li>
</ul>

<p>The flow is generated by solving the ODE:</p>

\[\frac{dx}{dt} = v_t(x)\]

<h3 id="modern-applications">Modern Applications</h3>

<p><strong>Conditional flows</strong>: Generate samples conditioned on additional information (text, class labels, etc.)</p>

<p>This framework has become the foundation for many state-of-the-art generative models because of its flexibility and mathematical elegance.</p>

<h2 id="41-flows">4.1 Flows</h2>

<p>This section formalizes the mathematical foundation of flows.</p>

<p><strong>What is a Flow?</strong></p>

<p>A <strong>flow</strong> is a collection of time-indexed vector fields:</p>

\[v = \{ v_t \}_{t \in [0,1]}\]

<p>Think of it as a <strong>velocity field</strong> that tells particles how to move at each point in space and time.</p>

<p><strong>Physical analogy</strong>: Imagine a river with currents. At each location \((x, y)\) and time \(t\), the current has a specific velocity and direction. The flow tells you: “If you’re at position \(x\) at time \(t\), move in direction \(v_t(x)\).”</p>

<h3 id="the-flow-ode">The Flow ODE</h3>

<p>Any flow defines how particles move via the differential equation:</p>

\[\frac{dx}{dt} = -v_t(x_t)\]

<p><strong>Starting condition</strong>: Begin at \(x_1\) at time \(t = 1\)<br />
<strong>Ending condition</strong>: End at \(x_0\) at time \(t = 0\)</p>

<p><strong>Note the negative sign</strong>: This is because time runs backwards from 1 to 0 (following diffusion convention where \(t=0\) is clean data).</p>

<h3 id="runflow-function">RunFlow Function</h3>

<p>The \(\text{RunFlow}(v, x_1, t)\) function solves the ODE and tells you:</p>

<ul>
  <li><strong>Input</strong>: Starting point \(x_1\), flow \(v\), target time \(t\)</li>
  <li><strong>Output</strong>: Where the particle ends up at time \(t\)</li>
</ul>

<p><strong>Intuitive meaning</strong>: “If I start at \(x_1\) and follow the flow \(v\), where will I be at time \(t\)?”</p>

<p>Flows don’t just move individual points—they transport <strong>entire distributions</strong>:</p>

<ul>
  <li><strong>Individual point</strong>: \(x_1 \rightarrow \text{RunFlow}(v, x_1, 0) = x_0\)</li>
  <li><strong>Entire distribution</strong>: \(p_1 \rightarrow p_0\)</li>
</ul>

<h3 id="the-ultimate-goal">The Ultimate Goal</h3>

<p>We want to learn a flow \(v^*\) such that:</p>

\[q \xrightarrow{v^*} p\]

<p>Where:</p>

<ul>
  <li><strong>q</strong>: Easy-to-sample base distribution (like Gaussian noise)</li>
  <li><strong>p</strong>: Target distribution (like dog images)</li>
  <li><strong>v</strong>: The optimal flow that connects them</li>
</ul>

<h3 id="generation-process">Generation Process</h3>

<p>Once we have \(v^*\), generating samples is simple:</p>

<ol>
  <li><strong>Sample</strong>: \(x_1 \sim q\) (sample from base distribution)</li>
  <li><strong>Transport</strong>: \(x_0 = \text{RunFlow}(v^*, x_1, 0)\) (follow the flow)</li>
  <li><strong>Output</strong>: \(x_0\) (this is your generated sample)</li>
</ol>

<h3 id="connection-to-ddim">Connection to DDIM</h3>

<p><strong>DDIM is actually a special case</strong> of flow matching!</p>

<p><strong>DDIM’s flow</strong>: The continuous-time limit of DDIM corresponds to the flow:</p>

\[v_t(x_t) = \frac{1}{2t} E[x_0 - x_t \mid x_t]\]

<p><strong>Components</strong>:</p>

<ul>
  <li><strong>Base distribution</strong>: Gaussian</li>
  <li><strong>DDIM sampling</strong>: Discretized method for evaluating RunFlow</li>
  <li><strong>DDPM training</strong>: Method for learning \(v^*\) (but relies on Gaussian structure)</li>
</ul>

<h2 id="42-pointwise-flows">4.2 Pointwise Flows</h2>

<p><strong>Core idea</strong>: A pointwise flow connects <strong>one specific point</strong> \(x_1\) to <strong>one specific point</strong> \(x_0\).</p>

<p><strong>What it does</strong>: Given any path from \(x_1\) to \(x_0\), the pointwise flow describes the <strong>velocity at each point</strong> along that path.</p>

<p><strong>Mathematical definition</strong>: \(v^{[x_1, x_0]}\) is a flow that satisfies the ODE with boundary conditions:</p>

<ul>
  <li>Starts at \(x_1\) when \(t = 1\)</li>
  <li>Ends at \(x_0\) when \(t = 0\)</li>
</ul>

<p><strong>Key insight</strong>: Pointwise flows are <strong>not unique</strong>. You can choose different paths between the same two points: straight line, curved path, any smooth trajectory.</p>

<h2 id="43-marginal-flows">4.3 Marginal Flows</h2>

<p><strong>The problem</strong>: We have many individual pointwise flows, but we need <strong>one unified flow</strong> that handles the entire distribution.</p>

<p><strong>The setup</strong>:</p>

<ol>
  <li>Pick a <strong>coupling</strong> \(\Pi_{q,p}\) (way to pair noise samples with data samples)</li>
  <li>For each pair \((x_1, x_0)\), use pointwise flow \(v^{[x_1, x_0]}\)</li>
  <li>This gives us a “collection of particle trajectories”</li>
</ol>

<p><strong>The solution</strong>: Combine all pointwise flows into one marginal flow \(v^*\) using <strong>weighted averaging</strong>:</p>

\[v^*_t(x_t) = E[ v^{[x_1, x_0]}_t(x_t) \mid x_t ]\]

<p><strong>Intuitive meaning</strong>: At any point \(x_t\) and time \(t\), the marginal flow velocity is the <strong>average velocity of all particles</strong> that happen to be at \(x_t\) at that time.</p>

<p><strong>Why this works</strong>:</p>

<ul>
  <li>Individual particles follow their own pointwise flows</li>
  <li>The bulk behavior emerges from averaging all individual behaviors</li>
  <li>Result: one flow that transports \(q \rightarrow p\)</li>
</ul>

<p><strong>Gas analogy</strong>: Instead of tracking every individual gas particle, we describe the <strong>bulk fluid motion</strong>—the average velocity at each location.</p>

<p><strong>Remaining challenges</strong>:</p>

<ol>
  <li><strong>Which pointwise flow to choose?</strong> (straight lines? curves?)</li>
  <li><em>How to compute \(v\) in practice?</em></li>
</ol>

<p>These questions drive the practical algorithms we’ll see next.</p>

<h2 id="44-a-simple-choice-of-pointwise-flow">4.4 A Simple Choice of Pointwise Flow</h2>

<p><strong>The Three Design Choices</strong></p>

<p>To build a flow matching model, we need to choose:</p>

<ol>
  <li><strong>Base distribution \(q\)</strong>: What we sample from initially
    <ol>
      <li>Gaussian (most common)</li>
      <li>Uniform</li>
      <li>Annular (ring-shaped)</li>
    </ol>
  </li>
  <li><strong>Coupling \(\Pi_{q,p}\)</strong>: How we pair base samples with target samples. Independent sampling—just sample from \(p\) and \(q\) separately and pair them randomly.</li>
  <li><strong>Pointwise flow</strong>: How we connect each pair</li>
</ol>

<h3 id="linear-pointwise-flow">Linear Pointwise Flow</h3>

<p>The simplest pointwise flow is <strong>straight-line interpolation</strong>:</p>

\[v^{[x_1, x_0]}_t(x_t) = x_0 - x_1\]

<p>This gives a <strong>constant velocity</strong> pointing from \(x_1\) to \(x_0\).</p>

<p><strong>The resulting trajectory</strong>:</p>

\[\text{RunFlow}(v^{[x_1, x_0]}, x_1, t) = t x_1 + (1-t) x_0\]

<p>This is just <strong>linear interpolation</strong> between the two points!</p>

<p>At different times \(t\):</p>

<ul>
  <li><strong>\(t = 1\)</strong>: Position is \(x_1\) (base distribution sample)</li>
  <li><strong>\(t = 0.5\)</strong>: Position is \(0.5 x_1 + 0.5 x_0\) (halfway between)</li>
  <li><strong>\(t = 0\)</strong>: Position is \(x_0\) (target distribution sample)</li>
</ul>

<p><strong>Physical interpretation</strong>: A particle moves at constant speed from \(x_1\) to \(x_0\), taking exactly 1 time unit to complete the journey.</p>

<h2 id="45-flow-matching">4.5 Flow Matching</h2>

<p>We want to compute the optimal vector field \(v^*_t(x_t)\), but naively this requires sampling from \(p(x_0 \mid x_t)\)—which is exactly the hard problem we’re trying to solve! It’s circular reasoning.</p>

<h3 id="the-ddpm-trick-applied-to-flow-matching">The DDPM Trick Applied to Flow Matching</h3>

<p>Just like in DDPM, we can avoid this circular problem by using <strong>regression</strong>:</p>

<p>Instead of trying to sample from \(p(x_0 \mid x_t)\), we:</p>

<ol>
  <li>Sample from the <strong>joint distribution</strong> \((x_0, x_1)\)—this is easy!</li>
  <li>Compute \(x_t\) deterministically using our chosen flow</li>
  <li>Set up a regression problem to learn the expected vector field</li>
</ol>

<p>The key insight is that:</p>

\[v^*_t(x_t) = E[ v^{[x_1, x_0]}_t(x_t) \mid x_t ]\]

<p>And by the fundamental regression theorem:</p>

\[v^*_t = \arg\min_f E\left[ \| f(x_t) - v^{[x_1, x_0]}_t(x_t) \|^2 \right]\]

<p>This means we can learn \(v^*_t\) by minimizing squared error!</p>

<h3 id="the-training-process-1">The Training Process</h3>

<p><strong>Pseudocode 4:</strong> Flow-matching train loss, generic pointwise flow [or linear flow]</p>

\[\text{(See figure below)}\]

<p><img src="/assets/images/2025-07-17/c.png" alt="420" /></p>

<p>Let me walk through each step:</p>

<p><strong>Step 1</strong>: \((x_1, x_0) \leftarrow \text{Sample}(\Pi_{q,p})\)</p>

<ul>
  <li>Sample a source point \(x_1\) from base distribution \(q\) (e.g., Gaussian noise)</li>
  <li>Sample a target point \(x_0\) from data distribution \(p\) (e.g., real image)</li>
  <li>These form a training pair</li>
</ul>

<p><strong>Step 2</strong>: \(t \leftarrow \text{Unif}[0, 1]\)</p>

<ul>
  <li>Pick a random time point during the flow</li>
</ul>

<p><strong>Step 3</strong>: \(x_t \leftarrow \text{RunFlow}(v^{[x_1, x_0]}, x_1, t)\)</p>

<ul>
  <li>Starting from \(x_1\), run the pointwise flow for time \(t\) to get \(x_t\)</li>
  <li>For linear flows: \(x_t = t \cdot x_1 + (1-t) \cdot x_0\)</li>
</ul>

<p><strong>Step 4</strong>: \(L \leftarrow \| f_\theta(x_t, t) - v^{[x_1, x_0]}_t(x_t) \|^2\)</p>

<ul>
  <li>\(f_\theta(x_t, t)\): What our neural network predicts the velocity should be</li>
  <li>\(v^{[x_1, x_0]}_t(x_t)\): What the true velocity should be for this specific flow</li>
  <li>For linear flows: \(v^{[x_1, x_0]}_t(x_t) = x_0 - x_1\)</li>
</ul>

<h3 id="the-sampling-process">The Sampling Process</h3>

<p><strong>Pseudocode 5:</strong> Flow-matching sampling</p>

\[\text{(See figure below)}\]

<p><img src="/assets/images/2025-07-17/d.png" alt="420" /></p>

<p><strong>Step 1</strong>: \(x_1 \leftarrow \text{Sample}(q)\)</p>

<ul>
  <li>Start with a random sample from the base distribution (noise)</li>
</ul>

<p><strong>Steps 2-4</strong>: Iterative integration</p>

<ul>
  <li>For each time step, update: \(x_{t-\Delta t} \leftarrow x_t + f_\theta(x_t, t) \Delta t\)</li>
  <li>This is <strong>Euler integration</strong> of the ODE \(\frac{dx}{dt} = f_\theta(x, t)\)</li>
  <li>We’re following the learned vector field from noise to data</li>
</ul>

<p><strong>The Beautiful Simplicity:</strong> This framework is elegant because:</p>

<ul>
  <li><strong>No complex probability calculations</strong>—just regression</li>
  <li><strong>Flexible path design</strong>—choose any pointwise flow you want</li>
  <li><strong>Efficient sampling</strong>—straightforward ODE integration</li>
  <li><strong>Scalable training</strong>—standard neural network optimization</li>
</ul>

<p>The key insight is that by breaking the problem into pointwise flows and then learning their average, we can solve generative modeling using simple, well-understood techniques.</p>

<p>https://mlg.eng.cam.ac.uk/blog/2024/01/20/flow-matching.html: helpful visualizations of flows, and uses notation more consistent with the current literature.</p>

<h1 id="5-diffusion-in-practice">5 Diffusion in Practice</h1>

<h2 id="samplers-in-practice">Samplers in Practice</h2>

<h3 id="the-speed-problem"><strong>The Speed Problem</strong></h3>

<ul>
  <li>DDPM and DDIM samplers are essentially the “Model T” of diffusion sampling. Each sampling step requires an expensive neural network forward pass, and even today’s best samplers need around 10 steps minimum.</li>
  <li><strong>This is a massive bottleneck.</strong> Imagine waiting 10+ seconds for a single image generation when users expect near-instantaneous results.</li>
</ul>

<h3 id="the-sdeode-connection-unlocks-better-samplers"><strong>The SDE/ODE Connection Unlocks Better Samplers</strong></h3>

<p>Since DDPM and DDIM are discretizations of the reverse SDE and Probability Flow ODE respectively, we can leverage decades of numerical methods research.</p>

<p>Any ODE/SDE solver becomes a potential diffusion sampler:</p>

<ul>
  <li>Euler methods</li>
  <li>Heun’s method</li>
  <li>Runge-Kutta variants</li>
  <li>Custom solvers designed for diffusion’s specific structure</li>
</ul>

<p>This perspective transformed sampler development from ad-hoc tweaking to principled numerical analysis.</p>

<h3 id="the-distillation-revolution"><strong>The Distillation Revolution</strong></h3>

<p><strong>Distillation methods</strong> that train student models to match multi-step diffusion teachers in just one step:</p>

<ul>
  <li><strong>Consistency Models</strong></li>
  <li><strong>Adversarial Distillation</strong></li>
</ul>

<p>⚠️ <strong>Important caveat</strong>: These distilled models aren’t technically diffusion models anymore—they’re neural networks trained to mimic diffusion output, but they’ve abandoned the iterative denoising process entirely.</p>

<h2 id="noise-schedules">Noise Schedules</h2>

<h3 id="why-schedules-matter"><strong>Why Schedules Matter</strong></h3>

<p>The noise schedule (\(\sigma_t\)) determines how much noise gets added at each timestep. This seemingly simple choice has profound implications for training stability, sample quality, and convergence speed.</p>

<h3 id="variance-exploding-vs-variance-preserving"><strong>Variance Exploding vs. Variance Preserving</strong></h3>

<p>Simple diffusion has \(p(x_t) \sim N(x_0, \sigma_t^2)\) with \(\sigma_t \propto \sqrt{t}\), meaning <strong>variance explodes</strong> over time. This is one of two major paradigms:</p>

<ul>
  <li><strong>Variance Exploding (VE)</strong>: Noise variance grows unboundedly</li>
  <li><strong>Variance Preserving (VP)</strong>: Noise variance stays controlled</li>
</ul>

<h3 id="the-ho-et-al-schedule-still-industry-standard">The Ho et al. Schedule (Still Industry Standard)</h3>

<p>The most popular schedule comes from the original DDPM paper:</p>

\[x_t = \sqrt{1 - \beta(t)} \cdot x_{t-1} + \sqrt{\beta(t)} \cdot \varepsilon_t\]

<p>Where \(\beta(t)\) is carefully chosen so that:</p>

<ul>
  <li>\(t = 1\): Nearly clean data</li>
  <li>\(t = 1\): Pure noise</li>
  <li>Variance remains bounded throughout</li>
</ul>

<h3 id="the-karras-reparameterization">The Karras Reparameterization</h3>

<p>Karras et al. [2022] introduced a more intuitive way to think about schedules using:</p>

<ul>
  <li><strong>Overall scaling</strong>: \(s(t)\)</li>
  <li><strong>Variance</strong>: \(\sigma(t)\)</li>
</ul>

<p>Their suggested schedule: \(s(t) = 1, \sigma(t) = t\)</p>

<p>This framework makes it much easier to reason about and experiment with different noise schedules.</p>

<h3 id="the-sde-framework-maximum-flexibility">The SDE Framework: Maximum Flexibility</h3>

<p>The general SDE formulation gives us incredible flexibility:</p>

\[dx_t = f(x_t, t)dt + g(t)dw_t\]

<p><strong>Examples of what this enables:</strong></p>

<ul>
  <li>Our simple diffusion: \(f = 0\),  \(g = \sigma_q\)</li>
  <li>Ho et al. schedule:  \(f = -\frac{1}{2}\beta(t)\),  \(g = \sqrt{\beta(t)}\)</li>
  <li>Karras schedule: \(f = 0\), \(g = \sqrt{2t}\)</li>
</ul>

<h2 id="likelihood-interpretations-and-vaes">Likelihood Interpretations and VAEs.</h2>

<h3 id="diffusion-as-hierarchical-vae">Diffusion as Hierarchical VAE</h3>

<p>Here’s a perspective that fundamentally changed how we think about diffusion models: <strong>they’re actually a special case of deep hierarchical VAEs</strong>. This isn’t just theoretical elegance—it has profound practical implications.</p>

<p><strong>The key insight</strong>: Each diffusion timestep corresponds to one “layer” of a VAE decoder, with the forward diffusion process acting as a fixed (non-learned) encoder that produces the sequence of noisy latents \(\{x_t\}\).</p>

<h3 id="why-this-perspective-revolutionized-training">Why This Perspective Revolutionized Training</h3>

<p>Traditional deep VAEs suffer from notorious training instability because gradients must flow through all layers. <strong>Diffusion’s Markovian structure breaks this dependency</strong>—each layer can be trained in isolation without forward/backward passing through previous layers.</p>

<p>This is why diffusion models train so much more stably than traditional deep generative models.</p>

<h3 id="the-likelihood-advantage">The Likelihood Advantage</h3>

<p>The VAE interpretation gives us something incredibly valuable: <strong>actual likelihood estimates</strong> via the Evidence Lower Bound (ELBO). This means we can train diffusion models with principled maximum-likelihood objectives.</p>

<p><strong>Plot twist</strong>: The ELBO for diffusion VAEs reduces to exactly the L2 regression loss we’ve been using, but with specific time-weighting that treats regression errors differently at different timesteps.</p>

<p>⚠️ <strong>The practical dilemma</strong>: The “principled” VAE-derived time-weighting doesn’t always produce the best samples. Ho et al. [2020] famously just dropped the time-weighting and uniformly weighted all timesteps—sometimes theory and practice diverge!</p>

<h2 id="parametrization-the-x_0--varepsilon--v-prediction-wars">Parametrization: The \(x_0\) / \(\varepsilon\) / \(v\)-Prediction Wars</h2>

<h3 id="what-should-your-network-actually-predict">What Should Your Network Actually Predict?</h3>

<p>This is one of the most important practical decisions you’ll make, and it’s not obvious. You have three main options:</p>

<p><strong>1. Direct Prediction (What We’ve Been Doing)</strong></p>

\[\min \| f_\theta(x_t, t) - x_{t-\Delta t} \|^2\]

<p>Network predicts the partially-denoised data.</p>

<p><strong>2. \(x_0\)-Prediction</strong></p>

\[\min \| f_\theta(x_t, t) - x_0 \|^2\]

<p>Network predicts the fully-denoised original data. This is <em>nearly</em> equivalent to direct prediction, differing only by a time-weighting factor of \(1/t\).</p>

<p><strong>3. \(\varepsilon\)-Prediction</strong></p>

\[\min \| f_\theta(x_t, t) - \varepsilon_t \|^2\]

<p>Network predicts the noise that was added. Where \(\varepsilon_t = (1/\sigma_t) E[x_0 - x_t \mid x_1]\).</p>

<p><strong>4. \(v\)-Prediction</strong><br />
Network predicts \(v = \alpha_t \varepsilon - \sigma_t x_0\)—essentially predicting data at high noise levels and noise at low noise levels.</p>

<h3 id="why-this-choice-matters-enormously">Why This Choice Matters Enormously</h3>

<p><strong>Mathematically</strong>, these are equivalent—they differ only by time-weightings. <strong>In practice</strong>, they behave very differently because:</p>

<ol>
  <li><strong>Learning is imperfect</strong>—certain objectives may be more robust to errors</li>
  <li><strong>Different parametrizations have different failure modes</strong></li>
  <li><strong>Some combinations are fundamentally problematic</strong></li>
</ol>

<p><strong>Example failure case</strong>: \(x_0\)-prediction with schedules that heavily weight low noise levels often fails because the identity function achieves low loss but produces terrible samples.</p>

<h2 id="the-error-landscape-what-actually-goes-wrong">The Error Landscape: What Actually Goes Wrong</h2>

<h3 id="training-time-errors">Training-Time Errors</h3>

<p>These are standard statistical learning errors in approximating the population-optimal regression function:</p>

<ul>
  <li><strong>Approximation error</strong>: Your network architecture isn’t expressive enough</li>
  <li><strong>Estimation error</strong>: You don’t have enough training data</li>
  <li><strong>Optimization error</strong>: Your training procedure doesn’t find the global optimum</li>
</ul>

<h3 id="sampling-time-errors">Sampling-Time Errors</h3>

<p>These are discretization errors from using finite step-sizes \(\Delta t\):</p>

<ul>
  <li><strong>For DDPM</strong>: Error in the Gaussian approximation of the reverse process</li>
  <li><strong>For DDIM/Flow Matching</strong>: Error in simulating continuous-time flows discretely</li>
</ul>

<h3 id="the-interaction-problem">The Interaction Problem</h3>

<p>Here’s what makes this challenging: <strong>these errors interact and compound in complex, poorly understood ways</strong>. We don’t fully understand how regression errors translate into distributional errors of the final generative model.</p>

<p><strong>Surprising twist</strong>: These “errors” can actually be beneficial on small datasets, acting as regularization that prevents the model from just memorizing training samples.</p>

<h3 id="key-practical-takeaways">Key Practical Takeaways</h3>

<ol>
  <li><strong>VAE Perspective Guides Training Strategy:</strong> Understanding diffusion as hierarchical VAE explains why they train so stably and provides principled likelihood-based objectives (even if you sometimes ignore the principled weighting).</li>
  <li><strong>Parametrization Choice Is Critical:</strong> The \(x_0\)/\(\varepsilon\)/\(v\)-prediction choice significantly impacts training dynamics and sample quality. There’s no universal best choice—it depends on your specific use case and schedule.</li>
  <li><strong>Error Sources Are Inevitable But Manageable:</strong> Both training-time and sampling-time errors are unavoidable, but understanding their sources helps you make informed trade-offs between speed, quality, and robustness.</li>
  <li><strong>Theory vs. Practice Tension:</strong> The “principled” choices from theory don’t always win in practice. Be prepared to empirically validate theoretical insights rather than blindly following them.</li>
</ol>

<p><img src="/assets/images/2025-07-17/e.png" alt="420" /></p>

<h2 id="further-reading-and-resources">Further Reading and Resources</h2>

<ul>
  <li><a href="https://www.tonyduan.com/diffusion/index.html">Tony Duan’s Diffusion Tutorial</a></li>
  <li><a href="https://cvpr2023-tutorial-diffusion-models.github.io/">CVPR 2023 Diffusion Models Tutorial</a></li>
  <li><a href="https://sander.ai/2023/07/20/perspectives.html">Sander Dieleman’s Perspectives on Diffusion Models</a></li>
  <li><a href="https://iclr-blogposts.github.io/2024/blog/diffusion-theory-from-scratch/">ICLR Blog: Diffusion Theory from Scratch (2024)</a></li>
</ul>]]></content><author><name></name></author><category term="diffusion" /><summary type="html"><![CDATA[Nakkiran, Preetum, Arwen Bradley, Hattie Zhou, and Madhu Advani. “Step-by-Step Diffusion: An Elementary Tutorial.” arXiv, June 23, 2024. https://doi.org/10.48550/arXiv.2406.08929.]]></summary></entry><entry><title type="html">Structure, Layout and Markdown for maintaining this self-notes website</title><link href="https://aayush9753.in/blog/2025/structure-layout-markdown/" rel="alternate" type="text/html" title="Structure, Layout and Markdown for maintaining this self-notes website" /><published>2025-07-04T00:00:00+00:00</published><updated>2025-07-04T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2025/structure-layout-markdown</id><content type="html" xml:base="https://aayush9753.in/blog/2025/structure-layout-markdown/"><![CDATA[<h1 id="markdown">Markdown</h1>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#basic-markdown-elements">Basic Markdown Elements</a></li>
  <li><a href="#extended-markdown-features">Extended Markdown Features</a></li>
  <li><a href="#advanced-formatting">Advanced Formatting</a></li>
  <li><a href="#quick-reference">Quick Reference</a></li>
</ol>

<hr />

<h2 id="introduction">Introduction</h2>

<p>Markdown is a lightweight markup language that transforms plain text into beautifully formatted documents. This guide covers everything from basic syntax to advanced features.</p>

<blockquote>
  <p><strong>Note</strong>: This guide follows the <a href="https://blog.webdevsimplified.com/2023-06/markdown-crash-course/">Markdown Crash Course</a> methodology with enhanced formatting and organization.</p>
</blockquote>

<hr />

<h2 id="basic-markdown-elements">Basic Markdown Elements</h2>

<h3 id="headings-creating-document-structure">Headings: Creating Document Structure</h3>

<p>Markdown provides six levels of headings, each serving a specific purpose in document hierarchy:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Primary Title (H1)</span>
<span class="gu">## Section Headers (H2)</span>
<span class="gu">### Subsection Headers (H3)</span>
<span class="gu">#### Minor Headers (H4)</span>
<span class="gu">##### Small Headers (H5)</span>
<span class="gu">###### Smallest Headers (H6)</span>
</code></pre></div></div>

<p><strong>Output:</strong></p>

<h1 id="primary-title-h1">Primary Title (H1)</h1>
<h2 id="section-headers-h2">Section Headers (H2)</h2>
<h3 id="subsection-headers-h3">Subsection Headers (H3)</h3>
<h4 id="minor-headers-h4">Minor Headers (H4)</h4>
<h5 id="small-headers-h5">Small Headers (H5)</h5>
<h6 id="smallest-headers-h6">Smallest Headers (H6)</h6>

<h3 id="paragraphs-and-line-breaks">Paragraphs and Line Breaks</h3>

<p>Understanding paragraph formatting is crucial for readable content:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This is a standard paragraph. Text flows naturally within paragraph boundaries.

A blank line separates paragraphs, creating distinct content blocks.

For line breaks within paragraphs,  
add two spaces at the end of a line  
to create soft breaks without paragraph separation.
</code></pre></div></div>

<p><strong>Output:</strong></p>

<p>This is a standard paragraph. Text flows naturally within paragraph boundaries.</p>

<p>A blank line separates paragraphs, creating distinct content blocks.</p>

<p>For line breaks within paragraphs,<br />
add two spaces at the end of a line<br />
to create soft breaks without paragraph separation.</p>

<hr />

<h2 id="extended-markdown-features">Extended Markdown Features</h2>

<h3 id="text-styling-and-emphasis">Text Styling and Emphasis</h3>

<p>Create visual hierarchy and emphasis with various text formatting options:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="ge">*Italic text*</span> or _italic text_
<span class="gs">**Bold text**</span> or __bold text__
<span class="ges">***Bold and italic***</span> or ___bold and italic___
~~Strikethrough text~~
<span class="nt">&lt;mark&gt;</span>Highlighted text<span class="nt">&lt;/mark&gt;</span>
Regular text with <span class="nt">&lt;sup&gt;</span>superscript<span class="nt">&lt;/sup&gt;</span> and <span class="nt">&lt;sub&gt;</span>subscript<span class="nt">&lt;/sub&gt;</span>
</code></pre></div></div>

<p><strong>Output:</strong></p>

<p><em>Italic text</em> or <em>italic text</em><br />
<strong>Bold text</strong> or <strong>bold text</strong><br />
<strong><em>Bold and italic</em></strong> or <strong><em>bold and italic</em></strong><br />
<del>Strikethrough text</del><br />
<mark>Highlighted text</mark><br />
Regular text with <sup>superscript</sup> and <sub>subscript</sub></p>

<h3 id="code-display">Code Display</h3>

<h4 id="inline-code">Inline Code</h4>

<p>Use backticks for <code class="language-plaintext highlighter-rouge">inline code</code> within sentences:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Use the <span class="sb">`console.log()`</span> function for debugging JavaScript applications.
</code></pre></div></div>

<p><strong>Output:</strong> Use the <code class="language-plaintext highlighter-rouge">console.log()</code> function for debugging JavaScript applications.</p>

<h4 id="code-blocks">Code Blocks</h4>
<p>To display a larger block of code you can wrap your code in three <code class="language-plaintext highlighter-rouge">`</code> characters.</p>
<ul>
  <li>You can also specify the language of your code block by adding the language name after the three <code class="language-plaintext highlighter-rouge">`</code> characters.</li>
</ul>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// JavaScript example with syntax highlighting</span>
<span class="kd">function</span> <span class="nf">greetUser</span><span class="p">(</span><span class="nx">name</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="s2">`Hello, </span><span class="p">${</span><span class="nx">name</span><span class="p">}</span><span class="s2">! Welcome to Markdown.`</span><span class="p">;</span>
<span class="p">}</span>

<span class="kd">const</span> <span class="nx">message</span> <span class="o">=</span> <span class="nf">greetUser</span><span class="p">(</span><span class="dl">"</span><span class="s2">Developer</span><span class="dl">"</span><span class="p">);</span>
<span class="nx">console</span><span class="p">.</span><span class="nf">log</span><span class="p">(</span><span class="nx">message</span><span class="p">);</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Python example
</span><span class="k">def</span> <span class="nf">calculate_area</span><span class="p">(</span><span class="n">radius</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Calculate the area of a circle.</span><span class="sh">"""</span>
    <span class="kn">import</span> <span class="n">math</span>
    <span class="k">return</span> <span class="n">math</span><span class="p">.</span><span class="n">pi</span> <span class="o">*</span> <span class="n">radius</span> <span class="o">**</span> <span class="mi">2</span>

<span class="n">area</span> <span class="o">=</span> <span class="nf">calculate_area</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Circle area: </span><span class="si">{</span><span class="n">area</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<hr />

<h2 id="advanced-formatting">Advanced Formatting</h2>

<h3 id="links-and-navigation">Links and Navigation</h3>

<p>Create various types of links for enhanced navigation:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nv">External link</span><span class="p">](</span><span class="sx">https://blog.webdevsimplified.com</span><span class="p">)</span>
<span class="p">[</span><span class="nv">Relative link</span><span class="p">](</span><span class="sx">/2023-06/markdown-crash-course</span><span class="p">)</span>
<span class="p">[</span><span class="nv">Reference link</span><span class="p">][</span><span class="ss">1</span><span class="p">]</span>
<span class="nv">&lt;https://direct-url-display.com&gt;</span>

<span class="p">[</span><span class="ss">1</span><span class="p">]:</span> <span class="sx">https://example.com</span> <span class="nn">"Reference link tooltip"</span>
</code></pre></div></div>

<p><strong>Output:</strong></p>

<p><a href="https://blog.webdevsimplified.com">External link</a><br />
<a href="/2023-06/markdown-crash-course">Relative link</a><br />
<a href="https://example.com" title="Reference link tooltip">Reference link</a><br />
<a href="https://direct-url-display.com">https://direct-url-display.com</a></p>

<h3 id="images-and-media">Images and Media</h3>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">![</span><span class="nv">Descriptive alt text</span><span class="p">](</span><span class="sx">/assets/images/google.png</span> <span class="nn">"The Google Logo"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/images/google.png" alt="The Google Logo" /></p>

<h3 id="blockquotes-and-citations">Blockquotes and Citations</h3>

<p>Create elegant quotations and nested content:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gt">&gt; "The best way to predict the future is to create it."</span>
<span class="gt">&gt; — Peter Drucker</span>
<span class="gt">
&gt; Primary quotation with important information</span>
<span class="gt">&gt;&gt; Nested quotation for additional context</span>
<span class="gt">&gt;&gt;&gt; Deep nesting for complex hierarchies</span>
</code></pre></div></div>

<p><strong>Output:</strong></p>

<blockquote>
  <p>“The best way to predict the future is to create it.”<br />
— Peter Drucker</p>
</blockquote>

<blockquote>
  <p>Primary quotation with important information</p>
  <blockquote>
    <p>Nested quotation for additional context</p>
    <blockquote>
      <p>Deep nesting for complex hierarchies</p>
    </blockquote>
  </blockquote>
</blockquote>

<h3 id="lists-and-organization">Lists and Organization</h3>

<h4 id="unordered-lists">Unordered Lists</h4>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">-</span> <span class="gs">**Primary item**</span> with emphasis
<span class="p">-</span> Secondary item
<span class="p">  -</span> Nested sub-item
<span class="p">  -</span> Another sub-item
<span class="p">    -</span> Deep nesting example
<span class="p">-</span> Final primary item
</code></pre></div></div>

<p><strong>Output:</strong></p>

<ul>
  <li><strong>Primary item</strong> with emphasis</li>
  <li>Secondary item
    <ul>
      <li>Nested sub-item</li>
      <li>Another sub-item
        <ul>
          <li>Deep nesting example</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Final primary item</li>
</ul>

<h4 id="ordered-lists">Ordered Lists</h4>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">1.</span> <span class="gs">**First step**</span> (numbers auto-increment)
<span class="p">2.</span> Second step with detailed explanation
<span class="p">   1.</span> Sub-step A
<span class="p">   2.</span> Sub-step B
<span class="p">3.</span> Final step
</code></pre></div></div>

<p><strong>Output:</strong></p>

<ol>
  <li><strong>First step</strong> (numbers auto-increment)</li>
  <li>Second step with detailed explanation
    <ol>
      <li>Sub-step A</li>
      <li>Sub-step B</li>
    </ol>
  </li>
  <li>Final step</li>
</ol>

<h4 id="task-lists">Task Lists</h4>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">-</span> [x] ✅ Completed task
<span class="p">-</span> [x] ✅ Another finished item
<span class="p">-</span> [ ] ⏳ Pending task
<span class="p">-</span> [ ] ⏳ Future task
</code></pre></div></div>

<p><strong>Output:</strong></p>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" checked="checked" />✅ Completed task</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" checked="checked" />✅ Another finished item</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />⏳ Pending task</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />⏳ Future task</li>
</ul>

<h3 id="tables-and-data-presentation">Tables and Data Presentation</h3>

<ul>
  <li>Below the first row you need to add a row where each column consists of at least three <code class="language-plaintext highlighter-rouge">-</code>s and optionally a <code class="language-plaintext highlighter-rouge">:</code> character on either side of the <code class="language-plaintext highlighter-rouge">-</code>s.
    <ul>
      <li>The <code class="language-plaintext highlighter-rouge">:</code> character is used to align the text in the column.</li>
      <li>If you add a <code class="language-plaintext highlighter-rouge">:</code> character on the left side of the <code class="language-plaintext highlighter-rouge">-</code>s then the text will be left aligned.</li>
      <li>If you add a <code class="language-plaintext highlighter-rouge">:</code> character on the right side of the <code class="language-plaintext highlighter-rouge">-</code>s then the text will be right aligned.</li>
      <li>If you add a <code class="language-plaintext highlighter-rouge">:</code> character on both sides of the <code class="language-plaintext highlighter-rouge">-</code>s then the text will be center aligned</li>
    </ul>
  </li>
  <li>Finally, you can continue to add rows to your table with the same format as your first row.</li>
</ul>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>| Feature | Description | Status |
|:--------|:------------|-------:|
| <span class="gs">**Basic Syntax**</span> | Core Markdown elements | ✅ Complete |
| <span class="gs">**Extended Features**</span> | GitHub Flavored Markdown | ✅ Complete |
| <span class="gs">**Advanced Topics**</span> | Complex formatting | 🔄 In Progress |
| <span class="gs">**Best Practices**</span> | Professional guidelines | ⏳ Planned |
</code></pre></div></div>

<p><strong>Output:</strong></p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Feature</th>
      <th style="text-align: left">Description</th>
      <th style="text-align: right">Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Basic Syntax</strong></td>
      <td style="text-align: left">Core Markdown elements</td>
      <td style="text-align: right">✅ Complete</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Extended Features</strong></td>
      <td style="text-align: left">GitHub Flavored Markdown</td>
      <td style="text-align: right">✅ Complete</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Advanced Topics</strong></td>
      <td style="text-align: left">Complex formatting</td>
      <td style="text-align: right">🔄 In Progress</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Best Practices</strong></td>
      <td style="text-align: left">Professional guidelines</td>
      <td style="text-align: right">⏳ Planned</td>
    </tr>
  </tbody>
</table>

<h3 id="horizontal-rules-and-separators">Horizontal Rules and Separators</h3>

<p>Create visual breaks in your content:</p>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Content above separator
<span class="p">
---
</span>
Content between separators
<span class="p">
***
</span>
Content below separator
</code></pre></div></div>

<p><strong>Output:</strong></p>

<p>Content above separator</p>

<hr />

<p>Content between separators</p>

<hr />

<p>Content below separator</p>

<hr />

<h2 id="quick-reference">Quick Reference</h2>

<h3 id="essential-syntax">Essential Syntax</h3>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh"># Headers              → # H1, ## H2, ### H3</span>
<span class="ge">*Emphasis*</span>             → <span class="ge">*italic*</span>, <span class="gs">**bold**</span>, <span class="ges">***both***</span>
<span class="sb">`Code`</span>                 → <span class="sb">`inline`</span> or <span class="sb">```block```</span>
<span class="p">[</span><span class="nv">Links</span><span class="p">](</span><span class="sx">url</span><span class="p">)</span>           → <span class="p">[</span><span class="nv">text</span><span class="p">](</span><span class="sx">url</span><span class="p">)</span>
<span class="p">![</span><span class="nv">Images</span><span class="p">](</span><span class="sx">url</span><span class="p">)</span>         → !<span class="p">[</span><span class="nv">alt</span><span class="p">](</span><span class="sx">url</span><span class="p">)</span>
<span class="gt">&gt; Blockquotes          → &gt; text</span>
<span class="p">-</span> Lists                → - item or 1. item
| Tables |             → | col1 | col2 |
---                    → Horizontal rule
</code></pre></div></div>

<h3 id="github-flavored-markdown">GitHub Flavored Markdown</h3>

<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~~Strikethrough~~      → ~~text~~
<span class="p">-</span> [ ] Tasks            → - [ ] todo, - [x] done
</code></pre></div></div>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Mastering Markdown enables you to create professional, readable documentation with minimal effort. This guide provides the foundation for beautiful content creation across platforms like GitHub, documentation sites, and blogs.</p>

<p><strong>Happy writing!</strong> 📝✨</p>

<hr />

<p><em>Last updated: July 4, 2025</em><br />
<em>Version: 1.0</em></p>]]></content><author><name></name></author><category term="[&quot;random&quot;]" /><summary type="html"><![CDATA[Markdown]]></summary></entry><entry><title type="html">Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra</title><link href="https://aayush9753.in/blog/2024/google-gemini-updates-flash-15-gemma-2-and-project-astra/" rel="alternate" type="text/html" title="Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra" /><published>2024-05-14T00:00:00+00:00</published><updated>2024-05-14T00:00:00+00:00</updated><id>https://aayush9753.in/blog/2024/google-gemini-updates-flash-15-gemma-2-and-project-astra</id><content type="html" xml:base="https://aayush9753.in/blog/2024/google-gemini-updates-flash-15-gemma-2-and-project-astra/"><![CDATA[<p>Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:Learn more:May 14, 2024
          We’re introducing a series of updates across the Gemini family of models, including the new 1.5 Flash, our lightweight model for speed and efficiency, and Project Astra, our vision for the future of AI assistants.
        In December, we launched our first natively multimodal model Gemini 1.0 in three sizes: Ultra, Pro and Nano. Just a few months later we released 1.5 Pro, with enhanced performance and a breakthrough long context window of 1 million tokens.Developers and enterprise customers have been putting 1.5 Pro to use in incredible ways and finding its long context window, multimodal reasoning capabilities and impressive overall performance incredibly useful.We know from user feedback that some applications need lower latency and a lower cost to serve. This inspired us to keep innovating, so today, we’re introducing Gemini 1.5 Flash: a model that’s lighter-weight than 1.5 Pro, and designed to be fast and efficient to serve at scale.Both 1.5 Pro and 1.5 Flash are available in public preview with a 1 million token context window in Google AI Studio and Vertex AI. And now, 1.5 Pro is also available with a 2 million token context window via waitlist to developers using the API and to Google Cloud customers.We’re also introducing updates across the Gemini family of models, announcing our next generation of open models, Gemma 2, and sharing progress on the future of AI assistants, with Project Astra.Context lengths of leading foundation models compared with Gemini 1.5’s 2 million token capability1.5 Flash is the newest addition to the Gemini model family and the fastest Gemini model served in the API. It’s optimized for high-volume, high-frequency tasks at scale, is more cost-efficient to serve and features our breakthrough long context window.While it’s a lighter weight model than 1.5 Pro, it’s highly capable of multimodal reasoning across vast amounts of information and delivers impressive quality for its size.The new Gemini 1.5 Flash model is optimized for speed and efficiency, is highly capable of multimodal reasoning and features our breakthrough long context window.1.5 Flash excels at summarization, chat applications, image and video captioning, data extraction from long documents and tables, and more. This is because it’s been trained by 1.5 Pro through a process called “distillation,” where the most essential knowledge and skills from a larger model are transferred to a smaller, more efficient model.Read more about 1.5 Flash in our updated Gemini 1.5 technical report, on the Gemini technology page, and learn about 1.5 Flash’s availability and pricing.Over the last few months, we’ve significantly improved 1.5 Pro, our best model for general performance across a wide range of tasks.Beyond extending its context window to 2 million tokens, we’ve enhanced its code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding through data and algorithmic advances. We see strong improvements on public and internal benchmarks for each of these tasks.1.5 Pro can now follow increasingly complex and nuanced instructions, including ones that specify product-level behavior involving role, format and style. We’ve improved control over the model’s responses for specific use cases, like crafting the persona and response style of a chat agent or automating workflows through multiple function calls. And we’ve enabled users to steer model behavior by setting system instructions.We added audio understanding in the Gemini API and Google AI Studio, so 1.5 Pro can now reason across image and audio for videos uploaded in Google AI Studio. And we’re now integrating 1.5 Pro into Google products, including Gemini Advanced and in Workspace apps.Read more about 1.5 Pro in our updated Gemini 1.5 technical report and on the Gemini technology page.Gemini Nano is expanding beyond text-only inputs to include images as well. Starting with Pixel, applications using Gemini Nano with Multimodality will be able to understand the world the way people do — not just through text, but also through sight, sound and spoken language.Read more about Gemini 1.0 Nano on Android.Today, we’re also sharing a series of updates to Gemma, our family of open models built from the same research and technology used to create the Gemini models.We’re announcing Gemma 2, our next generation of open models for responsible AI innovation. Gemma 2 has a new architecture designed for breakthrough performance and efficiency, and will be available in new sizes.The Gemma family is also expanding with PaliGemma, our first vision-language model inspired by PaLI-3. And we’ve upgraded our Responsible Generative AI Toolkit with LLM Comparator for evaluating the quality of model responses.Read more on the Developer blog.As part of Google DeepMind’s mission to build AI responsibly to benefit humanity, we’ve always wanted to develop universal AI agents that can be helpful in everyday life. That’s why today, we’re sharing our progress in building the future of AI assistants with Project Astra (advanced seeing and talking responsive agent).To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do — and take in and remember what it sees and hears to understand context and take action. It also needs to be proactive, teachable and personal, so users can talk to it naturally and without lag or delay.While we’ve made incredible progress developing AI systems that can understand multimodal information, getting response time down to something conversational is a difficult engineering challenge. Over the past few years, we’ve been working to improve how our models perceive, reason and converse to make the pace and quality of interaction feel more natural.Building on Gemini, we’ve developed prototype agents that can process information faster by continuously encoding video frames, combining the video and speech input into a timeline of events, and caching this information for efficient recall.By leveraging our leading speech models, we also enhanced how they sound, giving the agents a wider range of intonations. These agents can better understand the context they’re being used in, and respond quickly, in conversation.With technology like this, it’s easy to envision a future where people could have an expert AI assistant by their side, through a phone or glasses. And some of these capabilities are coming to Google products, like the Gemini app and web experience, later this year.We’ve made incredible progress so far with our family of Gemini models, and we’re always striving to advance the state-of-the-art even further. By investing in a relentless production line of innovation, we’re able to explore new ideas at the frontier, while also unlocking the possibility of new and exciting Gemini use cases.Learn more about Gemini and its capabilities.
            Your information will be used in accordance with
            Google’s privacy policy.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      Done. Just one step more.
    
      Check your inbox to confirm your subscription.
    You are already subscribed to our newsletter.
    You can also subscribe with a
    different email address
    
    .
    
  Let’s stay in touch. Get the latest news from Google in your inbox.
          Follow Us
</code></pre></div></div>]]></content><author><name></name></author><category term="external-posts" /><category term="google" /><summary type="html"><![CDATA[We’re sharing updates across our Gemini family of models and a glimpse of Project Astra, our vision for the future of AI assistants.]]></summary></entry><entry><title type="html">Displaying External Posts on Your al-folio Blog</title><link href="https://aayush9753.in/blog/2022/displaying-external-posts-on-your-al-folio-blog/" rel="alternate" type="text/html" title="Displaying External Posts on Your al-folio Blog" /><published>2022-04-23T23:20:09+00:00</published><updated>2022-04-23T23:20:09+00:00</updated><id>https://aayush9753.in/blog/2022/displaying-external-posts-on-your-al-folio-blog</id><content type="html" xml:base="https://aayush9753.in/blog/2022/displaying-external-posts-on-your-al-folio-blog/"><![CDATA[<h3>External Posts on Your al-folio Blog</h3>
<p>If you prefer publishing blog posts on medium.com or other external sources, starting version v0.5.0, <a href="https://github.com/alshedivat/al-folio">al-folio</a> lets you to display your external posts in the blog feed of your website! 🎉🎉</p>
<p>Configuring external sources of super simple. After upgrading to v0.5.0, just add the following section to your _config.yml:</p>
<pre>external_sources:<br />  - name: medium.com  # name of the source (arbitrary string)<br />    rss_url: <a href="https://medium.com/@al-folio/feed">https://medium.com/@&lt;your-medium-username&gt;/feed</a></pre>
<p>The example above adds your medium.com blog post feed as an external source. But you can add arbitrary RSS feeds as sources.</p>
<p>Any questions or suggestions? 👉 Start <a href="https://github.com/alshedivat/al-folio/discussions">a discussion on GitHub</a>!</p>
<p><img src="https://medium.com/_/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=b60a1d241a0a" width="1" height="1" alt="" /></p>]]></content><author><name></name></author><category term="jekyll" /><category term="jekyll-themes" /><category term="personal-blog" /><category term="blog" /><category term="academic" /><category term="medium" /></entry></feed>