AI-Crash-Course/paper_analysis/Latent-Diffusion/original_content.txt at main · mtr7x/AI-Crash-Course · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach1* Andreas Blattmann1Dominik Lorenz1Patrick Esser
 Bj¨orn Ommer1
1Ludwig Maximilian University of Munich & IWR, Heidelberg University, Germany
 Runway ML
https://github.com/CompVis/latent-diffusion
Abstract
By decomposing the image formation process into a se-
quential application of denoising autoencoders, diffusion
models (DMs) achieve state-of-the-art synthesis results on
image data and beyond. Additionally, their formulation al-
lows for a guiding mechanism to control the image gen-
eration process without retraining. However, since these
models typically operate directly in pixel space, optimiza-
tion of powerful DMs often consumes hundreds of GPU
days and inference is expensive due to sequential evalu-
ations. To enable DM training on limited computational
resources while retaining their quality and ﬂexibility, we
apply them in the latent space of powerful pretrained au-
toencoders. In contrast to previous work, training diffusion
models on such a representation allows for the ﬁrst time
to reach a near-optimal point between complexity reduc-
tion and detail preservation, greatly boosting visual ﬁdelity.
By introducing cross-attention layers into the model archi-
tecture, we turn diffusion models into powerful and ﬂexi-
ble generators for general conditioning inputs such as text
or bounding boxes and high-resolution synthesis becomes
possible in a convolutional manner. Our latent diffusion
models (LDMs) achieve new state-of-the-art scores for im-
age inpainting and class-conditional image synthesis and
highly competitive performance on various tasks, includ-
ing text-to-image synthesis, unconditional image generation
and super-resolution, while signiﬁcantly reducing computa-
tional requirements compared to pixel-based DMs.
1. Introduction
Image synthesis is one of the computer vision ﬁelds with
the most spectacular recent development, but also among
those with the greatest computational demands. Espe-
cially high-resolution synthesis of complex, natural scenes
is presently dominated by scaling up likelihood-based mod-
els, potentially containing billions of parameters in autore-
gressive (AR) transformers [66,67]. In contrast, the promis-
ing results of GANs [3, 27, 40] have been revealed to be
mostly conﬁned to data with comparably limited variability
as their adversarial learning procedure does not easily scale
to modeling complex, multi-modal distributions. Recently,
diffusion models [82], which are built from a hierarchy of
denoising autoencoders, have shown to achieve impressive
*The ﬁrst two authors contributed equally to this work.Inputours ( f= 4)
PSNR: 27:4R-FID: 0:58DALL-E ( f= 8)
PSNR: 22:8R-FID: 32:01VQGAN ( f= 16 )
PSNR: 19:9R-FID: 4:98
Figure 1. Boosting the upper bound on achievable quality with
less agressive downsampling. Since diffusion models offer excel-
lent inductive biases for spatial data, we do not need the heavy spa-
tial downsampling of related generative models in latent space, but
can still greatly reduce the dimensionality of the data via suitable
autoencoding models, see Sec. 3. Images are from the DIV2K [1]
validation set, evaluated at 5122px. We denote the spatial down-
sampling factor by f. Reconstruction FIDs [29] and PSNR are
calculated on ImageNet-val. [12]; see also Tab. 8.
results in image synthesis [30,85] and beyond [7,45,48,57],
and deﬁne the state-of-the-art in class-conditional image
synthesis [15,31] and super-resolution [72]. Moreover, even
unconditional DMs can readily be applied to tasks such
as inpainting and colorization [85] or stroke-based syn-
thesis [53], in contrast to other types of generative mod-
els [19,46,69]. Being likelihood-based models, they do not
exhibit mode-collapse and training instabilities as GANs
and, by heavily exploiting parameter sharing, they can
model highly complex distributions of natural images with-
out involving billions of parameters as in AR models [67].
Democratizing High-Resolution Image Synthesis DMs
belong to the class of likelihood-based models, whose
mode-covering behavior makes them prone to spend ex-
cessive amounts of capacity (and thus compute resources)
on modeling imperceptible details of the data [16, 73]. Al-
though the reweighted variational objective [30] aims to ad-
dress this by undersampling the initial denoising steps, DMs
are still computationally demanding, since training and
evaluating such a model requires repeated function evalu-
ations (and gradient computations) in the high-dimensional
space of RGB images. As an example, training the most
powerful DMs often takes hundreds of GPU days ( e.g. 150 -
1000 V100 days in [15]) and repeated evaluations on a noisy
version of the input space render also inference expensive,
1arXiv:2112.10752v2  [cs.CV]  13 Apr 2022
so that producing 50k samples takes approximately 5 days
[15] on a single A100 GPU. This has two consequences for
the research community and users in general: Firstly, train-
ing such a model requires massive computational resources
only available to a small fraction of the ﬁeld, and leaves a
huge carbon footprint [65, 86]. Secondly, evaluating an al-
ready trained model is also expensive in time and memory,
since the same model architecture must run sequentially for
a large number of steps ( e.g. 25 - 1000 steps in [15]).
To increase the accessibility of this powerful model class
and at the same time reduce its signiﬁcant resource con-
sumption, a method is needed that reduces the computa-
tional complexity for both training and sampling. Reducing
the computational demands of DMs without impairing their
performance is, therefore, key to enhance their accessibility.
Departure to Latent Space Our approach starts with
the analysis of already trained diffusion models in pixel
space: Fig. 2 shows the rate-distortion trade-off of a trained
model. As with any likelihood-based model, learning can
be roughly divided into two stages: First is a perceptual
compression stage which removes high-frequency details
but still learns little semantic variation. In the second stage,
the actual generative model learns the semantic and concep-
tual composition of the data ( semantic compression ). We
thus aim to ﬁrst ﬁnd a perceptually equivalent, but compu-
tationally more suitable space , in which we will train diffu-
sion models for high-resolution image synthesis.
Following common practice [11, 23, 66, 67, 96], we sep-
arate training into two distinct phases: First, we train
an autoencoder which provides a lower-dimensional (and
thereby efﬁcient) representational space which is perceptu-
ally equivalent to the data space. Importantly, and in con-
trast to previous work [23,66], we do not need to rely on ex-
cessive spatial compression, as we train DMs in the learned
latent space, which exhibits better scaling properties with
respect to the spatial dimensionality. The reduced complex-
ity also provides efﬁcient image generation from the latent
space with a single network pass. We dub the resulting
model class Latent Diffusion Models (LDMs).
A notable advantage of this approach is that we need to
train the universal autoencoding stage only once and can
therefore reuse it for multiple DM trainings or to explore
possibly completely different tasks [81]. This enables efﬁ-
cient exploration of a large number of diffusion models for
various image-to-image and text-to-image tasks. For the lat-
ter, we design an architecture that connects transformers to
the DM’s UNet backbone [71] and enables arbitrary types
of token-based conditioning mechanisms, see Sec. 3.3.
In sum, our work makes the following contributions :
(i) In contrast to purely transformer-based approaches
[23, 66], our method scales more graceful to higher dimen-
sional data and can thus (a) work on a compression level
which provides more faithful and detailed reconstructions
than previous work (see Fig. 1) and (b) can be efﬁciently
Figure 2. Illustrating perceptual and semantic compression: Most
bits of a digital image correspond to imperceptible details. While
DMs allow to suppress this semantically meaningless information
by minimizing the responsible loss term, gradients (during train-
ing) and the neural network backbone (training and inference) still
need to be evaluated on all pixels, leading to superﬂuous compu-
tations and unnecessarily expensive optimization and inference.
We propose latent diffusion models (LDMs) as an effective gener-
ative model and a separate mild compression stage that only elim-
inates imperceptible details. Data and images from [30].
applied to high-resolution synthesis of megapixel images.
(ii) We achieve competitive performance on multiple
tasks (unconditional image synthesis, inpainting, stochastic
super-resolution) and datasets while signiﬁcantly lowering
computational costs. Compared to pixel-based diffusion ap-
proaches, we also signiﬁcantly decrease inference costs.
(iii) We show that, in contrast to previous work [93]
which learns both an encoder/decoder architecture and a
score-based prior simultaneously, our approach does not re-
quire a delicate weighting of reconstruction and generative
abilities. This ensures extremely faithful reconstructions
and requires very little regularization of the latent space.
(iv) We ﬁnd that for densely conditioned tasks such
as super-resolution, inpainting and semantic synthesis, our
model can be applied in a convolutional fashion and render
large, consistent images of 10242px.
(v) Moreover, we design a general-purpose conditioning
mechanism based on cross-attention, enabling multi-modal
training. We use it to train class-conditional, text-to-image
and layout-to-image models.
(vi) Finally, we release pretrained latent diffusion
and autoencoding models at https : / / github .
com/CompVis/latent-diffusion which might be
reusable for a various tasks besides training of DMs [81].
2. Related Work
Generative Models for Image Synthesis The high di-
mensional nature of images presents distinct challenges
to generative modeling. Generative Adversarial Networks
(GAN) [27] allow for efﬁcient sampling of high resolution
images with good perceptual quality [3, 42], but are difﬁ-
2
cult to optimize [2, 28, 54] and struggle to capture the full
data distribution [55]. In contrast, likelihood-based meth-
ods emphasize good density estimation which renders op-
timization more well-behaved. Variational autoencoders
(V AE) [46] and ﬂow-based models [18, 19] enable efﬁcient
synthesis of high resolution images [9, 44, 92], but sam-
ple quality is not on par with GANs. While autoregressive
models (ARM) [6, 10, 94, 95] achieve strong performance
in density estimation, computationally demanding architec-
tures [97] and a sequential sampling process limit them to
low resolution images. Because pixel based representations
of images contain barely perceptible, high-frequency de-
tails [16,73], maximum-likelihood training spends a dispro-
portionate amount of capacity on modeling them, resulting
in long training times. To scale to higher resolutions, several
two-stage approaches [23,67,101,103] use ARMs to model
a compressed latent image space instead of raw pixels.
Recently, Diffusion Probabilistic Models (DM) [82],
have achieved state-of-the-art results in density estimation
[45] as well as in sample quality [15]. The generative power
of these models stems from a natural ﬁt to the inductive bi-
ases of image-like data when their underlying neural back-
bone is implemented as a UNet [15, 30, 71, 85]. The best
synthesis quality is usually achieved when a reweighted ob-
jective [30] is used for training. In this case, the DM corre-
sponds to a lossy compressor and allow to trade image qual-
ity for compression capabilities. Evaluating and optimizing
these models in pixel space, however, has the downside of
low inference speed and very high training costs. While
the former can be partially adressed by advanced sampling
strategies [47, 75, 84] and hierarchical approaches [31, 93],
training on high-resolution image data always requires to
calculate expensive gradients. We adress both drawbacks
with our proposed LDMs , which work on a compressed la-
tent space of lower dimensionality. This renders training
computationally cheaper and speeds up inference with al-
most no reduction in synthesis quality (see Fig. 1).
Two-Stage Image Synthesis To mitigate the shortcom-
ings of individual generative approaches, a lot of research
[11, 23, 67, 70, 101, 103] has gone into combining the
strengths of different methods into more efﬁcient and per-
formant models via a two stage approach. VQ-V AEs [67,
101] use autoregressive models to learn an expressive prior
over a discretized latent space. [66] extend this approach to
text-to-image generation by learning a joint distributation
over discretized image and text representations. More gen-
erally, [70] uses conditionally invertible networks to pro-
vide a generic transfer between latent spaces of diverse do-
mains. Different from VQ-V AEs, VQGANs [23, 103] em-
ploy a ﬁrst stage with an adversarial and perceptual objec-
tive to scale autoregressive transformers to larger images.
However, the high compression rates required for feasible
ARM training, which introduces billions of trainable pa-
rameters [23, 66], limit the overall performance of such ap-proaches and less compression comes at the price of high
computational cost [23, 66]. Our work prevents such trade-
offs, as our proposed LDMs scale more gently to higher
dimensional latent spaces due to their convolutional back-
bone. Thus, we are free to choose the level of compression
which optimally mediates between learning a powerful ﬁrst
stage, without leaving too much perceptual compression up
to the generative diffusion model while guaranteeing high-
ﬁdelity reconstructions (see Fig. 1).
While approaches to jointly [93] or separately [80] learn
an encoding/decoding model together with a score-based
prior exist, the former still require a difﬁcult weighting be-
tween reconstruction and generative capabilities [11] and
are outperformed by our approach (Sec. 4), and the latter
focus on highly structured images such as human faces.
3. Method
To lower the computational demands of training diffu-
sion models towards high-resolution image synthesis, we
observe that although diffusion models allow to ignore
perceptually irrelevant details by undersampling the corre-
sponding loss terms [30], they still require costly function
evaluations in pixel space, which causes huge demands in
computation time and energy resources.
We propose to circumvent this drawback by introducing
an explicit separation of the compressive from the genera-
tive learning phase (see Fig. 2). To achieve this, we utilize
an autoencoding model which learns a space that is percep-
tually equivalent to the image space, but offers signiﬁcantly
reduced computational complexity.
Such an approach offers several advantages: (i) By leav-
ing the high-dimensional image space, we obtain DMs
which are computationally much more efﬁcient because
sampling is performed on a low-dimensional space. (ii) We
exploit the inductive bias of DMs inherited from their UNet
architecture [71], which makes them particularly effective
for data with spatial structure and therefore alleviates the
need for aggressive, quality-reducing compression levels as
required by previous approaches [23, 66]. (iii) Finally, we
obtain general-purpose compression models whose latent
space can be used to train multiple generative models and
which can also be utilized for other downstream applica-
tions such as single-image CLIP-guided synthesis [25].
3.1. Perceptual Image Compression
Our perceptual compression model is based on previous
work [23] and consists of an autoencoder trained by com-
bination of a perceptual loss [106] and a patch-based [33]
adversarial objective [20, 23, 103]. This ensures that the re-
constructions are conﬁned to the image manifold by enforc-
ing local realism and avoids bluriness introduced by relying
solely on pixel-space losses such as L2orL1objectives.
More precisely, given an image x2RHW3in RGB
space, the encoder Eencodesxinto a latent representa-
3
tionz=E(x), and the decoder Dreconstructs the im-
age from the latent, giving ~x=D(z) =D(E(x)), where
z2Rhwc. Importantly, the encoder downsamples the
image by a factor f=H=h =W=w , and we investigate
different downsampling factors f= 2m, withm2N.
In order to avoid arbitrarily high-variance latent spaces,
we experiment with two different kinds of regularizations.
The ﬁrst variant, KL-reg. , imposes a slight KL-penalty to-
wards a standard normal on the learned latent, similar to a
V AE [46, 69], whereas VQ-reg. uses a vector quantization
layer [96] within the decoder. This model can be interpreted
as a VQGAN [23] but with the quantization layer absorbed
by the decoder. Because our subsequent DM is designed
to work with the two-dimensional structure of our learned
latent space z=E(x), we can use relatively mild compres-
sion rates and achieve very good reconstructions. This is
in contrast to previous works [23, 66], which relied on an
arbitrary 1D ordering of the learned space zto model its
distribution autoregressively and thereby ignored much of
the inherent structure of z. Hence, our compression model
preserves details of xbetter (see Tab. 8). The full objective
and training details can be found in the supplement.
3.2. Latent Diffusion Models
Diffusion Models [82] are probabilistic models designed to
learn a data distribution p(x)by gradually denoising a nor-
mally distributed variable, which corresponds to learning
the reverse process of a ﬁxed Markov Chain of length T.
For image synthesis, the most successful models [15,30,72]
rely on a reweighted variant of the variational lower bound
onp(x), which mirrors denoising score-matching [85].
These models can be interpreted as an equally weighted
sequence of denoising autoencoders (xt;t);t= 1:::T ,
which are trained to predict a denoised variant of their input
xt, wherextis a noisy version of the input x. The corre-
sponding objective can be simpliﬁed to (Sec. B)
LDM=Ex;N(0;1);th
k(xt;t)k2
2i
; (1)
withtuniformly sampled from f1;:::;Tg.
Generative Modeling of Latent Representations With
our trained perceptual compression models consisting of E
andD, we now have access to an efﬁcient, low-dimensional
latent space in which high-frequency, imperceptible details
are abstracted away. Compared to the high-dimensional
pixel space, this space is more suitable for likelihood-based
generative models, as they can now (i) focus on the impor-
tant, semantic bits of the data and (ii) train in a lower di-
mensional, computationally much more efﬁcient space.
Unlike previous work that relied on autoregressive,
attention-based transformer models in a highly compressed,
discrete latent space [23,66,103], we can take advantage of
image-speciﬁc inductive biases that our model offers. This
Semantic
 Map
crossattentionLatent SpaceConditioning
TextDiffusion Process
denoising step switch skip connectionRepres
entations
Pixel SpaceImagesDenoising U-Net
concatFigure 3. We condition LDMs either via concatenation or by a
more general cross-attention mechanism. See Sec. 3.3
includes the ability to build the underlying UNet primar-
ily from 2D convolutional layers, and further focusing the
objective on the perceptually most relevant bits using the
reweighted bound, which now reads
LLDM :=EE(x);N(0;1);th
k(zt;t)k2
2i
:(2)
The neural backbone (;t)of our model is realized as a
time-conditional UNet [71]. Since the forward process is
ﬁxed,ztcan be efﬁciently obtained from Eduring training,
and samples from p(z) can be decoded to image space with
a single pass through D.
3.3. Conditioning Mechanisms
Similar to other types of generative models [56, 83],
diffusion models are in principle capable of modeling
conditional distributions of the form p(zjy). This can
be implemented with a conditional denoising autoencoder
(zt;t;y)and paves the way to controlling the synthesis
process through inputs ysuch as text [68], semantic maps
[33, 61] or other image-to-image translation tasks [34].
In the context of image synthesis, however, combining
the generative power of DMs with other types of condition-
ings beyond class-labels [15] or blurred variants of the input
image [72] is so far an under-explored area of research.
We turn DMs into more ﬂexible conditional image gener-
ators by augmenting their underlying UNet backbone with
the cross-attention mechanism [97], which is effective for
learning attention-based models of various input modali-
ties [35,36]. To pre-process yfrom various modalities (such
as language prompts) we introduce a domain speciﬁc en-
coderthat projects yto an intermediate representation
(y)2RMd, which is then mapped to the intermediate
layers of the UNet via a cross-attention layer implementing
Attention (Q;K;V ) =softmax
QKT
p
d
V, with
Q=W(i)
Q'i(zt); K=W(i)
K(y); V=W(i)
V(y):
Here,'i(zt)2RNdi
denotes a (ﬂattened) intermediate
representation of the UNet implementing andW(i)
V2
4
CelebAHQ FFHQ LSUN-Churches LSUN-Beds ImageNet
Figure 4. Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and class-
conditional ImageNet [12], each with a resolution of 256256. Best viewed when zoomed in. For more samples cf. the supplement.
Rddi
,W(i)
Q2Rdd&W(i)
K2Rddare learnable pro-
jection matrices [36, 97]. See Fig. 3 for a visual depiction.
Based on image-conditioning pairs, we then learn the
conditional LDM via
LLDM :=EE(x);y;N(0;1);th
k(zt;t;(y))k2
2i
;(3)
where bothandare jointly optimized via Eq. 3. This
conditioning mechanism is ﬂexible as can be parameter-
ized with domain-speciﬁc experts, e.g. (unmasked) trans-
formers [97] when yare text prompts (see Sec. 4.3.1)
4. Experiments
LDMs provide means to ﬂexible and computationally
tractable diffusion based image synthesis of various image
modalities, which we empirically show in the following.
Firstly, however, we analyze the gains of our models com-
pared to pixel-based diffusion models in both training and
inference. Interestingly, we ﬁnd that LDMs trained in VQ-
regularized latent spaces sometimes achieve better sample
quality, even though the reconstruction capabilities of VQ-
regularized ﬁrst stage models slightly fall behind those of
their continuous counterparts, cf. Tab. 8. A visual compari-
son between the effects of ﬁrst stage regularization schemes
onLDM training and their generalization abilities to resolu-
tions>2562can be found in Appendix D.1. In E.2 we list
details on architecture, implementation, training and evalu-
ation for all results presented in this section.
4.1. On Perceptual Compression Tradeoffs
This section analyzes the behavior of our LDMs with dif-
ferent downsampling factors f2f1;2;4;8;16;32g(abbre-
viated as LDM-f, where LDM-1 corresponds to pixel-based
DMs). To obtain a comparable test-ﬁeld, we ﬁx the com-
putational resources to a single NVIDIA A100 for all ex-
periments in this section and train all models for the same
number of steps and with the same number of parameters.
Tab. 8 shows hyperparameters and reconstruction perfor-
mance of the ﬁrst stage models used for the LDMs com-pared in this section. Fig. 6 shows sample quality as a func-
tion of training progress for 2M steps of class-conditional
models on the ImageNet [12] dataset. We see that, i) small
downsampling factors for LDM-f1,2gresult in slow train-
ing progress, whereas ii) overly large values of fcause stag-
nating ﬁdelity after comparably few training steps. Revis-
iting the analysis above (Fig. 1 and 2) we attribute this to
i) leaving most of perceptual compression to the diffusion
model and ii) too strong ﬁrst stage compression resulting
in information loss and thus limiting the achievable qual-
ity.LDM-f4-16gstrike a good balance between efﬁciency
and perceptually faithful results, which manifests in a sig-
niﬁcant FID [29] gap of 38 between pixel-based diffusion
(LDM-1 ) and LDM-8 after 2M training steps.
In Fig. 7, we compare models trained on CelebA-
HQ [39] and ImageNet in terms sampling speed for differ-
ent numbers of denoising steps with the DDIM sampler [84]
and plot it against FID-scores [29]. LDM-f4-8goutper-
form models with unsuitable ratios of perceptual and con-
ceptual compression. Especially compared to pixel-based
LDM-1 , they achieve much lower FID scores while simulta-
neously signiﬁcantly increasing sample throughput. Com-
plex datasets such as ImageNet require reduced compres-
sion rates to avoid reducing quality. In summary, LDM-4
and-8offer the best conditions for achieving high-quality
synthesis results.
4.2. Image Generation with Latent Diffusion
We train unconditional models of 2562images on
CelebA-HQ [39], FFHQ [41], LSUN-Churches and
-Bedrooms [102] and evaluate the i) sample quality and ii)
their coverage of the data manifold using ii) FID [29] and
ii) Precision-and-Recall [50]. Tab. 1 summarizes our re-
sults. On CelebA-HQ, we report a new state-of-the-art FID
of5:11, outperforming previous likelihood-based models as
well as GANs. We also outperform LSGM [93] where a la-
tent diffusion model is trained jointly together with the ﬁrst
stage. In contrast, we train diffusion models in a ﬁxed space
5
Text-to-Image Synthesis on LAION. 1.45B Model.
’A street sign that reads
“Latent Diffusion” ’’A zombie in the
style of Picasso’’An image of an animal
half mouse half octopus’’An illustration of a slightly
conscious neural network’’A painting of a
squirrel eating a burger’’A watercolor painting of a
chair that looks like an octopus’’A shirt with the inscription:
“I love generative models!” ’
Figure 5. Samples for user-deﬁned text prompts from our model for text-to-image synthesis, LDM-8 (KL) , which was trained on the
LAION [78] database. Samples generated with 200 DDIM steps and = 1:0. We use unconditional guidance [32] with s= 10:0.
Figure 6. Analyzing the training of class-conditional LDMs with
different downsampling factors fover 2M train steps on the Im-
ageNet dataset. Pixel-based LDM-1 requires substantially larger
train times compared to models with larger downsampling factors
(LDM-f4-16g). Too much perceptual compression as in LDM-32
limits the overall sample quality. All models are trained on a sin-
gle NVIDIA A100 with the same computational budget. Results
obtained with 100 DDIM steps [84] and = 0.
Figure 7. Comparing LDMs with varying compression on the
CelebA-HQ (left) and ImageNet (right) datasets. Different mark-
ers indicatef10;20;50;100;200gsampling steps using DDIM,
from right to left along each line. The dashed line shows the FID
scores for 200 steps, indicating the strong performance of LDM-
f4-8g. FID scores assessed on 5000 samples. All models were
trained for 500k (CelebA) / 2M (ImageNet) steps on an A100.
and avoid the difﬁculty of weighing reconstruction quality
against learning the prior over the latent space, see Fig. 1-2.
We outperform prior diffusion based approaches on all
but the LSUN-Bedrooms dataset, where our score is close
to ADM [15], despite utilizing half its parameters and re-
quiring 4-times less train resources (see Appendix E.3.5).CelebA-HQ 256256 FFHQ 256256
Method FID# Prec." Recall" Method FID# Prec." Recall"
DC-V AE [63] 15.8 - - ImageBART [21] 9.57 - -
VQGAN+T. [23] (k=400) 10.2 - - U-Net GAN (+aug) [77] 10.9 (7.6) - -
PGGAN [39] 8.0 - - UDM [43] 5.54 - -
LSGM [93] 7.22 - - StyleGAN [41] 4.16 0.71 0.46
UDM [43] 7.16 - - ProjectedGAN [76] 3.08 0.65 0.46
LDM-4 (ours, 500-sy) 5.11 0.72 0.49 LDM-4 (ours, 200-s) 4.98 0.73 0.50
LSUN-Churches 256256 LSUN-Bedrooms 256256
Method FID# Prec." Recall" Method FID# Prec." Recall"
DDPM [30] 7.89 - - ImageBART [21] 5.51 - -
ImageBART [21] 7.32 - - DDPM [30] 4.9 - -
PGGAN [39] 6.42 - - UDM [43] 4.57 - -
StyleGAN [41] 4.21 - - StyleGAN [41] 2.35 0.59 0.48
StyleGAN2 [42] 3.86 - - ADM [15] 1.90 0.66 0.51
ProjectedGAN [76] 1.59 0.61 0.44 ProjectedGAN [76] 1.52 0.61 0.34
LDM-8(ours, 200-s) 4.02 0.64 0.52 LDM-4 (ours, 200-s) 2.95 0.66 0.48
Table 1. Evaluation metrics for unconditional image synthesis.
CelebA-HQ results reproduced from [43, 63, 100], FFHQ from
[42, 43].y:N-s refers toNsampling steps with the DDIM [84]
sampler.: trained in KL-regularized latent space. Additional re-
sults can be found in the supplementary.
Text-Conditional Image Synthesis
Method FID# IS" Nparams
CogViewy[17] 27.10 18.20 4B self-ranking, rejection rate 0.017
LAFITEy[109] 26.94 26.02 75M
GLIDE[59] 12.24 - 6B 277 DDIM steps, c.f.g. [32] s= 3
Make-A-Scene[26] 11.84 - 4B c.f.g for AR models [98] s= 5
LDM-KL-8 23.31 20.03 0.33 1.45B 250 DDIM steps
LDM-KL-8-G12.63 30.29 0.42 1.45B 250 DDIM steps, c.f.g. [32] s= 1:5
Table 2. Evaluation of text-conditional image synthesis on the
256256-sized MS-COCO [51] dataset: with 250 DDIM [84]
steps our model is on par with the most recent diffusion [59] and
autoregressive [26] methods despite using signiﬁcantly less pa-
rameters.y/:Numbers from [109]/ [26]
Moreover, LDMs consistently improve upon GAN-based
methods in Precision and Recall, thus conﬁrming the ad-
vantages of their mode-covering likelihood-based training
objective over adversarial approaches. In Fig. 4 we also
show qualitative results on each dataset.
6
Figure 8. Layout-to-image synthesis with an LDM on COCO [4],
see Sec. 4.3.1. Quantitative evaluation in the supplement D.3.
4.3. Conditional Latent Diffusion
4.3.1 Transformer Encoders for LDMs
By introducing cross-attention based conditioning into
LDMs we open them up for various conditioning modali-
ties previously unexplored for diffusion models. For text-
to-image image modeling, we train a 1.45B parameter
KL-regularized LDM conditioned on language prompts on
LAION-400M [78]. We employ the BERT-tokenizer [14]
and implement as a transformer [97] to infer a latent
code which is mapped into the UNet via (multi-head) cross-
attention (Sec. 3.3). This combination of domain speciﬁc
experts for learning a language representation and visual
synthesis results in a powerful model, which generalizes
well to complex, user-deﬁned text prompts, cf. Fig. 8 and 5.
For quantitative analysis, we follow prior work and evaluate
text-to-image generation on the MS-COCO [51] validation
set, where our model improves upon powerful AR [17, 66]
and GAN-based [109] methods, cf. Tab. 2. We note that ap-
plying classiﬁer-free diffusion guidance [32] greatly boosts
sample quality, such that the guided LDM-KL-8-G is on par
with the recent state-of-the-art AR [26] and diffusion mod-
els [59] for text-to-image synthesis, while substantially re-
ducing parameter count. To further analyze the ﬂexibility of
the cross-attention based conditioning mechanism we also
train models to synthesize images based on semantic lay-
outs on OpenImages [49], and ﬁnetune on COCO [4], see
Fig. 8. See Sec. D.3 for the quantitative evaluation and im-
plementation details.
Lastly, following prior work [3, 15, 21, 23], we evalu-
ate our best-performing class-conditional ImageNet mod-
els withf2 f4;8gfrom Sec. 4.1 in Tab. 3, Fig. 4 and
Sec. D.4. Here we outperform the state of the art diffu-
sion model ADM [15] while signiﬁcantly reducing compu-
tational requirements and parameter count, cf. Tab 18.
4.3.2 Convolutional Sampling Beyond 2562
By concatenating spatially aligned conditioning informa-
tion to the input of ,LDMs can serve as efﬁcient general-Method FID# IS" Precision" Recall"Nparams
BigGan-deep [3] 6.95 203.6 2.6 0.87 0.28 340M -
ADM [15] 10.94 100.98 0.69 0.63 554M 250 DDIM steps
ADM-G [15] 4.59 186.7 0.82 0.52 608M 250 DDIM steps
LDM-4 (ours) 10.56 103.49 1.24 0.71 0.62 400M 250 DDIM steps
LDM-4 -G (ours) 3.60 247.67 5.59 0.87 0.48 400M 250 steps, c.f.g [32], s= 1:5
Table 3. Comparison of a class-conditional ImageNet LDM with
recent state-of-the-art methods for class-conditional image gener-
ation on ImageNet [12]. A more detailed comparison with addi-
tional baselines can be found in D.4, Tab. 10 and F. c.f.g. denotes
classiﬁer-free guidance with a scale sas proposed in [32].
purpose image-to-image translation models. We use this
to train models for semantic synthesis, super-resolution
(Sec. 4.4) and inpainting (Sec. 4.5). For semantic synthe-
sis, we use images of landscapes paired with semantic maps
[23, 61] and concatenate downsampled versions of the se-
mantic maps with the latent image representation of a f= 4
model (VQ-reg., see Tab. 8). We train on an input resolution
of2562(crops from 3842) but ﬁnd that our model general-
izes to larger resolutions and can generate images up to the
megapixel regime when evaluated in a convolutional man-
ner (see Fig. 9). We exploit this behavior to also apply the
super-resolution models in Sec. 4.4 and the inpainting mod-
els in Sec. 4.5 to generate large images between 5122and
10242. For this application, the signal-to-noise ratio (in-
duced by the scale of the latent space) signiﬁcantly affects
the results. In Sec. D.1 we illustrate this when learning an
LDM on (i) the latent space as provided by a f= 4 model
(KL-reg., see Tab. 8), and (ii) a rescaled version, scaled by
the component-wise standard deviation.
The latter, in combination with classiﬁer-free guid-
ance [32], also enables the direct synthesis of >2562im-
ages for the text-conditional LDM-KL-8-G as in Fig. 13.
Figure 9. A LDM trained on 2562resolution can generalize to
larger resolution (here: 5121024 ) for spatially conditioned tasks
such as semantic synthesis of landscape images. See Sec. 4.3.2.
4.4. Super-Resolution with Latent Diffusion
LDMs can be efﬁciently trained for super-resolution by
diretly conditioning on low-resolution images via concate-
nation ( cf. Sec. 3.3). In a ﬁrst experiment, we follow SR3
7
bicubic LDM -SR SR3
Figure 10. ImageNet 64 !256 super-resolution on ImageNet-Val.
LDM-SR has advantages at rendering realistic textures but SR3
can synthesize more coherent ﬁne structures. See appendix for
additional samples and cropouts. SR3 results from [72].
[72] and ﬁx the image degradation to a bicubic interpola-
tion with 4-downsampling and train on ImageNet follow-
ing SR3’s data processing pipeline. We use the f= 4 au-
toencoding model pretrained on OpenImages (VQ-reg., cf.
Tab. 8) and concatenate the low-resolution conditioning y
and the inputs to the UNet, i.e.is the identity. Our quali-
tative and quantitative results (see Fig. 10 and Tab. 5) show
competitive performance and LDM-SR outperforms SR3
in FID while SR3 has a better IS. A simple image regres-
sion model achieves the highest PSNR and SSIM scores;
however these metrics do not align well with human per-
ception [106] and favor blurriness over imperfectly aligned
high frequency details [72]. Further, we conduct a user
study comparing the pixel-baseline with LDM-SR. We fol-
low SR3 [72] where human subjects were shown a low-res
image in between two high-res images and asked for pref-
erence. The results in Tab. 4 afﬁrm the good performance
of LDM-SR. PSNR and SSIM can be pushed by using a
post-hoc guiding mechanism [15] and we implement this
image-based guider via a perceptual loss, see Sec. D.6.
SR on ImageNet Inpainting on Places
User Study Pixel-DM ( f1)LDM-4 LAMA [88] LDM-4
Task 1: Preference vs GT" 16.0% 30.4% 13.6% 21.0%
Task 2: Preference Score" 29.4% 70.6% 31.9% 68.1%
Table 4. Task 1: Subjects were shown ground truth and generated
image and asked for preference. Task 2: Subjects had to decide
between two generated images. More details in E.3.6
Since the bicubic degradation process does not generalize
well to images which do not follow this pre-processing, we
also train a generic model, LDM-BSR , by using more di-
verse degradation. The results are shown in Sec. D.6.1.Method FID# IS" PSNR" SSIM"Nparams [samples
s]()
Image Regression [72] 15.2 121.1 27.9 0.801 625M N/A
SR3 [72] 5.2 180.1 26.4 0.762 625M N/A
LDM-4 (ours, 100 steps) 2.8y/4.8z166.3 24.4 3.8 0.690.14 169M 4.62
emphLDM-4 (ours, big, 100 steps) 2.4y/4.3z174.9 24.74.1 0.710.15 552M 4.5
LDM-4 (ours, 50 steps, guiding) 4.4y/6.4z153.7 25.8 3.7 0.740.12 184M 0.38
Table 5.4upscaling results on ImageNet-Val. ( 2562);y: FID
features computed on validation split,z: FID features computed
on train split;: Assessed on a NVIDIA A100
train throughput sampling throughputytrain+val FID@2k
Model (reg.-type) samples/sec. @256 @512 hours/epoch epoch 6
LDM-1 (no ﬁrst stage) 0.11 0.26 0.07 20.66 24.74
LDM-4 (KL, w/ attn) 0.32 0.97 0.34 7.66 15.21
LDM-4 (VQ, w/ attn) 0.33 0.97 0.34 7.04 14.99
LDM-4 (VQ, w/o attn) 0.35 0.99 0.36 6.66 15.95
Table 6. Assessing inpainting efﬁciency.y: Deviations from Fig. 7
due to varying GPU settings/batch sizes cf. the supplement.
4.5. Inpainting with Latent Diffusion
Inpainting is the task of ﬁlling masked regions of an im-
age with new content either because parts of the image are
are corrupted or to replace existing but undesired content
within the image. We evaluate how our general approach
for conditional image generation compares to more special-
ized, state-of-the-art approaches for this task. Our evalua-
tion follows the protocol of LaMa [88], a recent inpainting
model that introduces a specialized architecture relying on
Fast Fourier Convolutions [8]. The exact training & evalua-
tion protocol on Places [108] is described in Sec. E.2.2.
We ﬁrst analyze the effect of different design choices for
the ﬁrst stage. In particular, we compare the inpainting ef-
ﬁciency of LDM-1 (i.e. a pixel-based conditional DM) with
LDM-4 , for both KLandVQregularizations, as well as VQ-
LDM-4 without any attention in the ﬁrst stage (see Tab. 8),
where the latter reduces GPU memory for decoding at high
resolutions. For comparability, we ﬁx the number of param-
eters for all models. Tab. 6 reports the training and sampling
throughput at resolution 2562and5122, the total training
time in hours per epoch and the FID score on the validation
split after six epochs. Overall, we observe a speed-up of at
least2:7between pixel- and latent-based diffusion models
while improving FID scores by a factor of at least 1:6.
The comparison with other inpainting approaches in
Tab. 7 shows that our model with attention improves the
overall image quality as measured by FID over that of [88].
LPIPS between the unmasked images and our samples is
slightly higher than that of [88]. We attribute this to [88]
only producing a single result which tends to recover more
of an average image compared to the diverse results pro-
duced by our LDM cf. Fig. 21. Additionally in a user study
(Tab. 4) human subjects favor our results over those of [88].
Based on these initial results, we also trained a larger dif-
fusion model ( bigin Tab. 7) in the latent space of the VQ-
regularized ﬁrst stage without attention. Following [15],
the UNet of this diffusion model uses attention layers on
three levels of its feature hierarchy, the BigGAN [3] residual
block for up- and downsampling and has 387M parameters
8
input result
Figure 11. Qualitative results on object removal with our big, w/
ftinpainting model. For more results, see Fig. 22.
instead of 215M. After training, we noticed a discrepancy
in the quality of samples produced at resolutions 2562and
5122, which we hypothesize to be caused by the additional
attention modules. However, ﬁne-tuning the model for half
an epoch at resolution 5122allows the model to adjust to
the new feature statistics and sets a new state of the art FID
on image inpainting ( big, w/o attn, w/ ft in Tab. 7, Fig. 11.).
5. Limitations & Societal Impact
Limitations While LDMs signiﬁcantly reduce computa-
tional requirements compared to pixel-based approaches,
their sequential sampling process is still slower than that
of GANs. Moreover, the use of LDMs can be question-
able when high precision is required: although the loss of
image quality is very small in our f= 4autoencoding mod-
els (see Fig. 1), their reconstruction capability can become
a bottleneck for tasks that require ﬁne-grained accuracy in
pixel space. We assume that our superresolution models
(Sec. 4.4) are already somewhat limited in this respect.
Societal Impact Generative models for media like im-
agery are a double-edged sword: On the one hand, they40-50% masked All samples
Method FID# LPIPS# FID# LPIPS#
LDM-4 (ours, big, w/ ft) 9.39 0.246 0.042 1.50 0.137 0.080
LDM-4 (ours, big, w/o ft) 12.89 0.257 0.047 2.40 0.142 0.085
LDM-4 (ours, w/ attn) 11.87 0.257 0.042 2.15 0.144 0.084
LDM-4 (ours, w/o attn) 12.60 0.259 0.041 2.37 0.145 0.084
LaMa [88]y12.31 0.243 0.038 2.23 0.134 0.080
LaMa [88] 12.0 0.24 2.21 0.14
CoModGAN [107] 10.4 0.26 1.82 0.15
RegionWise [52] 21.3 0.27 4.75 0.15
DeepFill v2 [104] 22.1 0.28 5.20 0.16
EdgeConnect [58] 30.5 0.28 8.37 0.16
Table 7. Comparison of inpainting performance on 30k crops of
size512512from test images of Places [108]. The column 40-
50% reports metrics computed over hard examples where 40-50%
of the image region have to be inpainted.yrecomputed on our test
set, since the original test set used in [88] was not available.
enable various creative applications, and in particular ap-
proaches like ours that reduce the cost of training and in-
ference have the potential to facilitate access to this tech-
nology and democratize its exploration. On the other hand,
it also means that it becomes easier to create and dissemi-
nate manipulated data or spread misinformation and spam.
In particular, the deliberate manipulation of images (“deep
fakes”) is a common problem in this context, and women in
particular are disproportionately affected by it [13, 24].
Generative models can also reveal their training data
[5, 90], which is of great concern when the data contain
sensitive or personal information and were collected with-
out explicit consent. However, the extent to which this also
applies to DMs of images is not yet fully understood.
Finally, deep learning modules tend to reproduce or ex-
acerbate biases that are already present in the data [22, 38,
91]. While diffusion models achieve better coverage of the
data distribution than e.g. GAN-based approaches, the ex-
tent to which our two-stage approach that combines adver-
sarial training and a likelihood-based objective misrepre-
sents the data remains an important research question.
For a more general, detailed discussion of the ethical
considerations of deep generative models, see e.g. [13].
6. Conclusion
We have presented latent diffusion models, a simple and
efﬁcient way to signiﬁcantly improve both the training and
sampling efﬁciency of denoising diffusion models with-
out degrading their quality. Based on this and our cross-
attention conditioning mechanism, our experiments could
demonstrate favorable results compared to state-of-the-art
methods across a wide range of conditional image synthesis
tasks without task-speciﬁc architectures.
This work has been supported by the German Federal Ministry for
Economic Affairs and Energy within the project ’KI-Absicherung - Safe
AI for automated driving’ and by the German Research Foundation (DFG)
project 421703927.
9
References
[1] Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal-
lenge on single image super-resolution: Dataset and study.
In2017 IEEE Conference on Computer Vision and Pattern
Recognition Workshops, CVPR Workshops 2017, Honolulu,
HI, USA, July 21-26, 2017 , pages 1122–1131. IEEE Com-
puter Society, 2017. 1
[2] Martin Arjovsky, Soumith Chintala, and L ´eon Bottou.
Wasserstein gan, 2017. 3
[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
scale GAN training for high ﬁdelity natural image synthe-
sis. In Int. Conf. Learn. Represent. , 2019. 1, 2, 7, 8, 22,
28
[4] Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari.
Coco-stuff: Thing and stuff classes in context. In 2018
IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2018, Salt Lake City, UT, USA, June 18-
22, 2018 , pages 1209–1218. Computer Vision Foundation /
IEEE Computer Society, 2018. 7, 20, 22
[5] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew
Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam
Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al.
Extracting training data from large language models. In
30th USENIX Security Symposium (USENIX Security 21) ,
pages 2633–2650, 2021. 9
[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
woo Jun, David Luan, and Ilya Sutskever. Generative pre-
training from pixels. In ICML , volume 119 of Proceedings
of Machine Learning Research , pages 1691–1703. PMLR,
2020. 3
[7] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mo-
hammad Norouzi, and William Chan. Wavegrad: Estimat-
ing gradients for waveform generation. In ICLR . OpenRe-
view.net, 2021. 1
[8] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolu-
tion. In NeurIPS , 2020. 8
[9] Rewon Child. Very deep vaes generalize autoregressive
models and can outperform them on images. CoRR ,
abs/2011.10650, 2020. 3
[10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.
Generating long sequences with sparse transformers.
CoRR , abs/1904.10509, 2019. 3
[11] Bin Dai and David P. Wipf. Diagnosing and enhancing V AE
models. In ICLR (Poster) . OpenReview.net, 2019. 2, 3
[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Fei-Fei Li. Imagenet: A large-scale hierarchical im-
age database. In CVPR , pages 248–255. IEEE Computer
Society, 2009. 1, 5, 7, 22
[13] Emily Denton. Ethical considerations of generative ai. AI
for Content Creation Workshop, CVPR, 2021. 9
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. BERT: pre-training of deep bidirec-
tional transformers for language understanding. CoRR ,
abs/1810.04805, 2018. 7
[15] Prafulla Dhariwal and Alex Nichol. Diffusion models beat
gans on image synthesis. CoRR , abs/2105.05233, 2021. 1,
2, 3, 4, 6, 7, 8, 18, 22, 25, 26, 28[16] Sander Dieleman. Musings on typicality, 2020. 1, 3
[17] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-
image generation via transformers. CoRR , abs/2105.13290,
2021. 6, 7
[18] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice:
Non-linear independent components estimation, 2015. 3
[19] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben-
gio. Density estimation using real NVP. In 5th Inter-
national Conference on Learning Representations, ICLR
2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings . OpenReview.net, 2017. 1, 3
[20] Alexey Dosovitskiy and Thomas Brox. Generating images
with perceptual similarity metrics based on deep networks.
In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg,
Isabelle Guyon, and Roman Garnett, editors, Adv. Neural
Inform. Process. Syst. , pages 658–666, 2016. 3
[21] Patrick Esser, Robin Rombach, Andreas Blattmann, and
Bj¨orn Ommer. Imagebart: Bidirectional context with multi-
nomial diffusion for autoregressive image synthesis. CoRR ,
abs/2108.08827, 2021. 6, 7, 22
[22] Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. A
note on data biases in generative models. arXiv preprint
arXiv:2012.02516 , 2020. 9
[23] Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming
transformers for high-resolution image synthesis. CoRR ,
abs/2012.09841, 2020. 2, 3, 4, 6, 7, 21, 22, 29, 34, 36
[24] Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and
videotape: Deep fakes and free speech delusions. Md. L.
Rev., 78:892, 2018. 9
[25] Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw:
Exploring text-to-drawing synthesis through language-
image encoders. ArXiv , abs/2106.14843, 2021. 3
[26] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,
Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-
based text-to-image generation with human priors. CoRR ,
abs/2203.13131, 2022. 6, 7, 16
[27] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,
and Yoshua Bengio. Generative adversarial networks.
CoRR , 2014. 1, 2
[28] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
Dumoulin, and Aaron Courville. Improved training of
wasserstein gans, 2017. 3
[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by
a two time-scale update rule converge to a local nash equi-
librium. In Adv. Neural Inform. Process. Syst. , pages 6626–
6637, 2017. 1, 5, 26
[30] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif-
fusion probabilistic models. In NeurIPS , 2020. 1, 2, 3, 4,
6, 17
[31] Jonathan Ho, Chitwan Saharia, William Chan, David J.
Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded
diffusion models for high ﬁdelity image generation. CoRR ,
abs/2106.15282, 2021. 1, 3, 22
10
[32] Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion
guidance. In NeurIPS 2021 Workshop on Deep Generative
Models and Downstream Applications , 2021. 6, 7, 16, 22,
28, 37, 38
[33] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.
Efros. Image-to-image translation with conditional adver-
sarial networks. In CVPR , pages 5967–5976. IEEE Com-
puter Society, 2017. 3, 4
[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.
Efros. Image-to-image translation with conditional adver-
sarial networks. 2017 IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR) , pages 5967–5976,
2017. 4
[35] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste
Alayrac, Carl Doersch, Catalin Ionescu, David Ding,
Skanda Koppula, Daniel Zoran, Andrew Brock, Evan
Shelhamer, Olivier J. H ´enaff, Matthew M. Botvinick,
Andrew Zisserman, Oriol Vinyals, and Jo ˜ao Carreira.
Perceiver IO: A general architecture for structured inputs
&outputs. CoRR , abs/2107.14795, 2021. 4
[36] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals,
Andrew Zisserman, and Jo ˜ao Carreira. Perceiver: General
perception with iterative attention. In Marina Meila and
Tong Zhang, editors, Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, 18-24 July
2021, Virtual Event , volume 139 of Proceedings of Machine
Learning Research , pages 4651–4664. PMLR, 2021. 4, 5
[37] Manuel Jahn, Robin Rombach, and Bj ¨orn Ommer. High-
resolution complex scene synthesis with transformers.
CoRR , abs/2105.06458, 2021. 20, 22, 27
[38] Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia
Manikonda, and Subbarao Kambhampati. Imperfect ima-
ganation: Implications of gans exacerbating biases on fa-
cial data augmentation and snapchat selﬁe lenses. arXiv
preprint arXiv:2001.09528 , 2020. 9
[39] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti-
nen. Progressive growing of gans for improved quality, sta-
bility, and variation. CoRR , abs/1710.10196, 2017. 5, 6
[40] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks.
InIEEE Conf. Comput. Vis. Pattern Recog. , pages 4401–
4410, 2019. 1
[41] T. Karras, S. Laine, and T. Aila. A style-based gener-
ator architecture for generative adversarial networks. In
2019 IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR) , 2019. 5, 6
[42] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
ing the image quality of stylegan. CoRR , abs/1912.04958,
2019. 2, 6, 28
[43] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo
Kang, and Il-Chul Moon. Score matching model for un-
bounded data score. CoRR , abs/2106.05527, 2021. 6
[44] Durk P Kingma and Prafulla Dhariwal. Glow: Generative
ﬂow with invertible 1x1 convolutions. In S. Bengio, H. Wal-
lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R.
Garnett, editors, Advances in Neural Information Process-
ing Systems , 2018. 3[45] Diederik P. Kingma, Tim Salimans, Ben Poole, and
Jonathan Ho. Variational diffusion models. CoRR ,
abs/2107.00630, 2021. 1, 3, 16
[46] Diederik P. Kingma and Max Welling. Auto-Encoding Vari-
ational Bayes. In 2nd International Conference on Learn-
ing Representations, ICLR , 2014. 1, 3, 4, 29
[47] Zhifeng Kong and Wei Ping. On fast sampling of diffusion
probabilistic models. CoRR , abs/2106.00132, 2021. 3
[48] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and
Bryan Catanzaro. Diffwave: A versatile diffusion model
for audio synthesis. In ICLR . OpenReview.net, 2021. 1
[49] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R.
Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali,
Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio
Ferrari. The open images dataset V4: uniﬁed image classi-
ﬁcation, object detection, and visual relationship detection
at scale. CoRR , abs/1811.00982, 2018. 7, 20, 22
[50] Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko
Lehtinen, and Timo Aila. Improved precision and re-
call metric for assessing generative models. CoRR ,
abs/1904.06991, 2019. 5, 26
[51] Tsung-Yi Lin, Michael Maire, Serge J. Belongie,
Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro
Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zit-
nick. Microsoft COCO: common objects in context. CoRR ,
abs/1405.0312, 2014. 6, 7, 27
[52] Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Ais-
han Liu, Dacheng Tao, and Edwin Hancock. Region-wise
generative adversarial imageinpainting for large missing ar-
eas. ArXiv , abs/1909.12507, 2019. 9