Skip to content

Commit 66c5348

Browse files
sanchit-gandhikashif
authored andcommitted
[AudioLDM2] Blog post fixes (huggingface#1434)
* [AudioLDM2] Add diffusion tag * update blog post
1 parent 8cab92a commit 66c5348

File tree

2 files changed

+20
-19
lines changed

2 files changed

+20
-19
lines changed

_blog.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2752,4 +2752,5 @@
27522752
tags:
27532753
- guide
27542754
- audio
2755-
- diffusers
2755+
- diffusers
2756+
- diffusion

audioldm2.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio r
3737

3838
The overall generation process is summarised as follows:
3939

40-
1. Given a text input $\boldsymbol{x}$, two text encoder models are used to compute the text embeddings: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap), and the text-encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5)
40+
1. Given a text input \\(\boldsymbol{x}\\), two text encoder models are used to compute the text embeddings: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap), and the text-encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5)
4141

4242
$$
4343
\boldsymbol{E}_{1} = \text{CLAP}\left(\boldsymbol{x} \right); \quad \boldsymbol{E}_{2} = \text{T5}\left(\boldsymbol{x}\right)
@@ -53,29 +53,31 @@ $$
5353

5454
In the `diffusers` implementation, these projections are defined by the [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2ProjectionModel).
5555

56-
3. A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) language model (LM) is used to auto-regressively generate a sequence of $N$ new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings:
56+
3. A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) language model (LM) is used to auto-regressively generate a sequence of \\(N\\) new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings:
5757

5858
$$
5959
\boldsymbol{E}_{i} = \text{GPT2}\left(\boldsymbol{P}_{1}, \boldsymbol{P}_{2}, \boldsymbol{E}_{1:i-1}\right) \qquad \text{for } i=1,\dots,N
6060
$$
6161

62-
4. The generated embedding vectors $\boldsymbol{E}_{1:N}$ and Flan-T5 text embeddings $\boldsymbol{E}_{2}$ are used as cross-attention conditioning in the LDM, which *de-noises*
63-
a random latent via a reverse diffusion process. The LDM is run in the reverse diffusion process for a total of $T$ inference steps:
62+
4. The generated embedding vectors \\(\boldsymbol{E}_{1:N}\\) and Flan-T5 text embeddings \\(\boldsymbol{E}_{2}\\) are used as cross-attention conditioning in the LDM, which *de-noises*
63+
a random latent via a reverse diffusion process. The LDM is run in the reverse diffusion process for a total of \\(T\\) inference steps:
6464

6565
$$
6666
\boldsymbol{z}_{t} = \text{LDM}\left(\boldsymbol{z}_{t-1} | \boldsymbol{E}_{1:N}, \boldsymbol{E}_{2}\right) \qquad \text{for } t = 1, \dots, T
6767
$$
6868

69-
where the initial latent variable $\boldsymbol{z}_{0}$ is drawn from a normal distribution $\mathcal{N} \left(\boldsymbol{0}, \boldsymbol{I} \right)$. The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel) of the LDM is unique in
70-
the sense that it takes **two** sets of cross-attention embeddings, $\boldsymbol{E}_{1:N}$ from the GPT2 langauge model, and $\boldsymbol{E}_{2}$ from Flan-T5, as opposed to one cross-attention conditioning as in most other LDMs.
69+
where the initial latent variable \\(\boldsymbol{z}_{0}\\) is drawn from a normal distribution \\(\mathcal{N} \left(\boldsymbol{0}, \boldsymbol{I} \right)\\).
70+
The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel) of the LDM is unique in
71+
the sense that it takes **two** sets of cross-attention embeddings, \\(\boldsymbol{E}_{1:N}\\) from the GPT2 langauge model, and \\(\boldsymbol{E}_{2}\\)
72+
from Flan-T5, as opposed to one cross-attention conditioning as in most other LDMs.
7173

72-
5. The final de-noised latents $\boldsymbol{z}_{T}$ are passed to the VAE decoder to recover the Mel spectrogram $\boldsymbol{s}$:
74+
5. The final de-noised latents \\(\boldsymbol{z}_{T}\\) are passed to the VAE decoder to recover the Mel spectrogram \\(\boldsymbol{s}\\):
7375

7476
$$
7577
\boldsymbol{s} = \text{VAE}_{\text{dec}} \left(\boldsymbol{z}_{T}\right)
7678
$$
7779

78-
6. The Mel spectrogram is passed to the vocoder to obtain the output audio waveform $\mathbf{y}$:
80+
6. The Mel spectrogram is passed to the vocoder to obtain the output audio waveform \\(\mathbf{y}\\):
7981

8082
$$
8183
\boldsymbol{y} = \text{Vocoder}\left(\boldsymbol{s}\right)
@@ -116,7 +118,7 @@ pipe = AudioLDM2Pipeline.from_pretrained(model_id)
116118
```
117119
**Output:**
118120
```
119-
Loading pipeline components...: 100%|███████████████████████████████████████████████████| 11/11 [00:01<00:00, 7.62it/s]
121+
Loading pipeline components...: 100%|███████████████████████████████████████████| 11/11 [00:01<00:00, 7.62it/s]
120122
```
121123

122124
The pipeline can be moved to the GPU in much the same way as a standard PyTorch nn module:
@@ -147,7 +149,7 @@ audio = pipe(prompt, audio_length_in_s=10.24, generator=generator).audios[0]
147149

148150
**Output:**
149151
```
150-
100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [00:13<00:00, 15.27it/s]
152+
100%|███████████████████████████████████████████| 200/200 [00:13<00:00, 15.27it/s]
151153
```
152154

153155
Cool! That run took about 13 seconds to generate. Let's have a listen to the output audio:
@@ -177,7 +179,7 @@ audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual
177179

178180
**Output:**
179181
```
180-
100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [00:12<00:00, 16.50it/s]
182+
100%|███████████████████████████████████████████| 200/200 [00:12<00:00, 16.50it/s]
181183
```
182184

183185
The inference time is un-changed when using a negative prompt\\({}^1\\); we simply replace the unconditional input to the
@@ -216,7 +218,7 @@ audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual
216218

217219
**Output:**
218220
```
219-
100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [00:12<00:00, 16.60it/s]
221+
100%|███████████████████████████████████████████| 200/200 [00:12<00:00, 16.60it/s]
220222
```
221223

222224
For more details on the use of SDPA in `diffusers`, refer to the corresponding [documentation](https://huggingface.co/docs/diffusers/optimization/torch2.0).
@@ -248,7 +250,7 @@ Audio(audio, rate=16000)
248250
**Output:**
249251

250252
```
251-
100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [00:09<00:00, 20.94it/s]
253+
100%|███████████████████████████████████████████| 200/200 [00:09<00:00, 20.94it/s]
252254
```
253255

254256
<audio controls>
@@ -280,7 +282,7 @@ audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual
280282

281283
**Output:**
282284
```
283-
100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [01:23<00:00, 2.39it/s]
285+
100%|███████████████████████████████████████████| 200/200 [01:23<00:00, 2.39it/s]
284286
```
285287

286288
Great! Now that the UNet is compiled, we can now run the full diffusion process and reap the benefits of faster inference:
@@ -291,7 +293,7 @@ audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual
291293

292294
**Output:**
293295
```
294-
100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [00:04<00:00, 48.98it/s]
296+
100%|███████████████████████████████████████████| 200/200 [00:04<00:00, 48.98it/s]
295297
```
296298

297299
Only 4 seconds to generate! In practice, you will only have to compile the UNet once, and then get faster inference for
@@ -451,7 +453,5 @@ saving tricks, such as half-precision and CPU offload, to reduce peak memory usa
451453
checkpoint sizes.
452454

453455
Blog post by [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi). Many thanks to [Vaibhav Srivastav](https://huggingface.co/reach-vb)
454-
and [Sayak Paul](https://huggingface.co/sayakpaul) for their constructive comments.
455-
456-
Spectrogram image source: [Getting to Know the Mel Spectrogram](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0).
456+
and [Sayak Paul](https://huggingface.co/sayakpaul) for their constructive comments. Spectrogram image source: [Getting to Know the Mel Spectrogram](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0).
457457
Waveform image source: [Aalto Speech Processing](https://speechprocessingbook.aalto.fi/Representations/Waveform.html).

0 commit comments

Comments
 (0)