You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: audioldm2.md
+18-18Lines changed: 18 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio r
37
37
38
38
The overall generation process is summarised as follows:
39
39
40
-
1. Given a text input $\boldsymbol{x}$, two text encoder models are used to compute the text embeddings: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap), and the text-encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5)
40
+
1. Given a text input \\(\boldsymbol{x}\\), two text encoder models are used to compute the text embeddings: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap), and the text-encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5)
In the `diffusers` implementation, these projections are defined by the [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2ProjectionModel).
55
55
56
-
3. A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) language model (LM) is used to auto-regressively generate a sequence of $N$ new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings:
56
+
3. A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) language model (LM) is used to auto-regressively generate a sequence of \\(N\\) new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings:
4. The generated embedding vectors $\boldsymbol{E}_{1:N}$ and Flan-T5 text embeddings $\boldsymbol{E}_{2}$ are used as cross-attention conditioning in the LDM, which *de-noises*
63
-
a random latent via a reverse diffusion process. The LDM is run in the reverse diffusion process for a total of $T$ inference steps:
62
+
4. The generated embedding vectors \\(\boldsymbol{E}_{1:N}\\) and Flan-T5 text embeddings \\(\boldsymbol{E}_{2}\\) are used as cross-attention conditioning in the LDM, which *de-noises*
63
+
a random latent via a reverse diffusion process. The LDM is run in the reverse diffusion process for a total of \\(T\\) inference steps:
64
64
65
65
$$
66
66
\boldsymbol{z}_{t} = \text{LDM}\left(\boldsymbol{z}_{t-1} | \boldsymbol{E}_{1:N}, \boldsymbol{E}_{2}\right) \qquad \text{for } t = 1, \dots, T
67
67
$$
68
68
69
-
where the initial latent variable $\boldsymbol{z}_{0}$ is drawn from a normal distribution $\mathcal{N} \left(\boldsymbol{0}, \boldsymbol{I} \right)$. The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel) of the LDM is unique in
70
-
the sense that it takes **two** sets of cross-attention embeddings, $\boldsymbol{E}_{1:N}$ from the GPT2 langauge model, and $\boldsymbol{E}_{2}$ from Flan-T5, as opposed to one cross-attention conditioning as in most other LDMs.
69
+
where the initial latent variable \\(\boldsymbol{z}_{0}\\) is drawn from a normal distribution \\(\mathcal{N} \left(\boldsymbol{0}, \boldsymbol{I} \right)\\).
70
+
The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel) of the LDM is unique in
71
+
the sense that it takes **two** sets of cross-attention embeddings, \\(\boldsymbol{E}_{1:N}\\) from the GPT2 langauge model, and \\(\boldsymbol{E}_{2}\\)
72
+
from Flan-T5, as opposed to one cross-attention conditioning as in most other LDMs.
71
73
72
-
5. The final de-noised latents $\boldsymbol{z}_{T}$ are passed to the VAE decoder to recover the Mel spectrogram $\boldsymbol{s}$:
74
+
5. The final de-noised latents \\(\boldsymbol{z}_{T}\\) are passed to the VAE decoder to recover the Mel spectrogram \\(\boldsymbol{s}\\):
For more details on the use of SDPA in `diffusers`, refer to the corresponding [documentation](https://huggingface.co/docs/diffusers/optimization/torch2.0).
Only 4 seconds to generate! In practice, you will only have to compile the UNet once, and then get faster inference for
@@ -451,7 +453,5 @@ saving tricks, such as half-precision and CPU offload, to reduce peak memory usa
451
453
checkpoint sizes.
452
454
453
455
Blog post by [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi). Many thanks to [Vaibhav Srivastav](https://huggingface.co/reach-vb)
454
-
and [Sayak Paul](https://huggingface.co/sayakpaul) for their constructive comments.
455
-
456
-
Spectrogram image source: [Getting to Know the Mel Spectrogram](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0).
456
+
and [Sayak Paul](https://huggingface.co/sayakpaul) for their constructive comments. Spectrogram image source: [Getting to Know the Mel Spectrogram](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0).
0 commit comments