typos (#1491)

kashif · web-flow · commit 65db264cb593 · 2023-09-13T20:32:46.000+02:00
diff --git a/_blog.yml b/_blog.yml
@@ -2810,7 +2810,7 @@
     - nlp
 
 - local: wuertschen
-  title: "Introducing Würtschen: Fast Diffusion for Image Generation"
+  title: "Introducing Würstchen: Fast Diffusion for Image Generation"
   author: dome272
   thumbnail: /blog/assets/wuerstchen/thumbnail.jpg
   date: September 13, 2023
diff --git a/wuertschen.md b/wuertschen.md
@@ -1,5 +1,5 @@
 ---
-title: "Introducing Würtschen: Fast Diffusion for Image Generation" 
+title: "Introducing Würstchen: Fast Diffusion for Image Generation" 
 thumbnail: /blog/assets/wuerstchen/thumbnail.jpg
 authors:
 - user: dome272
@@ -11,14 +11,14 @@ authors:
 - user: pcuenq
 ---
 
-# Introducing Würtschen: Fast Diffusion for Image Generation
+# Introducing Würstchen: Fast Diffusion for Image Generation
 
 <!-- {blog_metadata} -->
 <!-- {authors} -->
 
-![Collage of images created with Würtschen](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/wuertschen/collage_compressed.jpg)
+![Collage of images created with Würstchen](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/wuertschen/collage_compressed.jpg)
 
-## What is Würtschen?
+## What is Würstchen?
 
 Würstchen is a diffusion model, whose text-conditional component works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by orders of magnitude. Training on 1024×1024 images is way more expensive than training on 32×32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, it achieves a 42x spatial compression! This had never been seen before, because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the  **[paper](https://arxiv.org/abs/2306.00637)**). Together Stage A and B are called the *Decoder*, because they decode the compressed images back into pixel space. A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference. We refer to Stage C as the *Prior*.