huggingface
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/en/_toctree.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/sana_transformer2d.md‎
Lines changed: 32 additions & 0 deletions b/‎docs/source/en/api/models/sana_transformer2d.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎docs/source/en/api/pipelines/sana.md‎
Lines changed: 39 additions & 0 deletions b/‎docs/source/en/api/pipelines/sana.md‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎scripts/convert_sana_to_diffusers.py‎
Lines changed: 23 additions & 33 deletions b/‎scripts/convert_sana_to_diffusers.py‎
Lines changed: 23 additions & 33 deletions
diff --git a/‎src/diffusers/models/attention_processor.py‎
Lines changed: 8 additions & 38 deletions b/‎src/diffusers/models/attention_processor.py‎
Lines changed: 8 additions & 38 deletions
diff --git a/‎src/diffusers/models/normalization.py‎
Lines changed: 0 additions & 39 deletions b/‎src/diffusers/models/normalization.py‎
Lines changed: 0 additions & 39 deletions
@@ -282,6 +282,8 @@
         title: PriorTransformer
       - local: api/models/sd3_transformer2d
         title: SD3Transformer2DModel
+      - local: api/models/sana_transformer2d
+        title: SanaTransformer2DModel
       - local: api/models/stable_audio_transformer
         title: StableAudioDiTModel
       - local: api/models/transformer2d
@@ -428,6 +430,8 @@
       title: PixArt-α
     - local: api/pipelines/pixart_sigma
       title: PixArt-Σ
+    - local: api/pipelines/sana
+      title: Sana
     - local: api/pipelines/self_attention_guidance
       title: Self-Attention Guidance
     - local: api/pipelines/semantic_stable_diffusion
 
@@ -0,0 +1,32 @@
+<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+
+# SanaTransformer2DModel
+
+A Diffusion Transformer model for 2D data from [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) was introduced from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
+
+The abstract from the paper is:
+
+*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*
+
+The model can be loaded with the following code snippet.
+
+```python
+TODO(aryan)
+```
+
+## SanaTransformer2DModel
+
+[[autodoc]] SanaPlusTransformer2DModel
+
+## Transformer2DModelOutput
+
+[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
@@ -0,0 +1,39 @@
+<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. -->
+
+# SanaPipeline
+
+[SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
+
+The abstract from the paper is:
+
+*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model]https://huggingface.co/Efficient-Large-Model).
+
+## SanaPipeline
+
+[[autodoc]] SanaPipeline
+  - all
+  - __call__
+
+## SanaPipelineOutput
+
+[[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput
@@ -38,6 +38,7 @@
 def main(args):
     ckpt_id = ckpt_ids[0]
     cache_dir_path = os.path.expanduser("~/.cache/huggingface/hub")
+    
     if args.orig_ckpt_path is None:
         snapshot_download(
             repo_id=ckpt_id,
@@ -52,6 +53,7 @@ def main(args):
         )
     else:
         file_path = args.orig_ckpt_path
+    
     all_state_dict = torch.load(file_path, weights_only=True)
     state_dict = all_state_dict.pop("state_dict")
     converted_state_dict = {}
@@ -96,8 +98,8 @@ def main(args):
         converted_state_dict[f"transformer_blocks.{depth}.scale_shift_table"] = state_dict.pop(
             f"blocks.{depth}.scale_shift_table"
         )
+        
         # Linear Attention is all you need 🤘
-
         # Self attention.
         q, k, v = torch.chunk(state_dict.pop(f"blocks.{depth}.attn.qkv.weight"), 3, dim=0)
         converted_state_dict[f"transformer_blocks.{depth}.attn1.to_q.weight"] = q
@@ -156,27 +158,20 @@ def main(args):
     # Transformer
     with CTX():
         transformer = SanaTransformer2DModel(
+            in_channels=32,
+            out_channels=32,
             num_attention_heads=model_kwargs[args.model_type]["num_attention_heads"],
             attention_head_dim=model_kwargs[args.model_type]["attention_head_dim"],
+            num_layers=model_kwargs[args.model_type]["num_layers"],
             num_cross_attention_heads=model_kwargs[args.model_type]["num_cross_attention_heads"],
             cross_attention_head_dim=model_kwargs[args.model_type]["cross_attention_head_dim"],
-            in_channels=32,
-            out_channels=32,
-            num_layers=model_kwargs[args.model_type]["num_layers"],
             cross_attention_dim=model_kwargs[args.model_type]["cross_attention_dim"],
             attention_bias=False,
             sample_size=32,
             patch_size=1,
-            upcast_attention=False,
-            norm_type="ada_norm_single",
             norm_elementwise_affine=False,
             norm_eps=1e-6,
-            use_additional_conditions=False,
             caption_channels=2304,
-            use_caption_norm=True,
-            caption_norm_scale_factor=0.1,
-            attention_type="default",
-            use_pe=False,
             expand_ratio=2.5,
         )
     if is_accelerate_available():
@@ -203,24 +198,17 @@ def main(args):
                 attrs=["bold"],
             )
         )
-        transformer.to(weight_dtype).save_pretrained(os.path.join(args.dump_path, "transformer"))
+        transformer.save_pretrained(os.path.join(args.dump_path, "transformer"), safe_serialization=True, max_shard_size="5GB", variant=variant)
     else:
         print(colored(f"Saving the whole SanaPipeline containing {args.model_type}", "green", attrs=["bold"]))
         # VAE
-        ae = AutoencoderDC.from_pretrained(
-            "mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers",
-            torch_dtype=torch.bfloat16,
-        ).to(device)
+        ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers",)
 
         # Text Encoder
         text_encoder_model_path = "google/gemma-2-2b-it"
         tokenizer = AutoTokenizer.from_pretrained(text_encoder_model_path)
         tokenizer.padding_side = "right"
-        text_encoder = (
-            AutoModelForCausalLM.from_pretrained(text_encoder_model_path, torch_dtype=torch.bfloat16)
-            .get_decoder()
-            .to(device)
-        )
+        text_encoder = AutoModelForCausalLM.from_pretrained(text_encoder_model_path).get_decoder()
 
         # Scheduler
         if args.scheduler_type == "flow-dpm_solver":
@@ -234,27 +222,27 @@ def main(args):
         else:
             raise ValueError(f"Scheduler type {args.scheduler_type} is not supported")
 
-        # transformer
-        transformer.to(device).to(weight_dtype)
-
         pipe = SanaPipeline(
             tokenizer=tokenizer,
             text_encoder=text_encoder,
             transformer=transformer,
             vae=ae,
             scheduler=scheduler,
         )
+        pipe.save_pretrained(args.dump_path, safe_serialization=True, max_shard_size="5GB", variant=variant)
 
-        image = pipe(
-            "a dog",
-            height=1024,
-            width=1024,
-            guidance_scale=5.0,
-        )[0]
 
-        image[0].save("sana.png")
+DTYPE_MAPPING = {
+    "fp32": torch.float32,
+    "fp16": torch.float16,
+    "bf16": torch.bfloat16,
+}
 
-        pipe.save_pretrained(args.dump_path)
+VARIANT_MAPPING = {
+    "fp32": None,
+    "fp16": "fp16",
+    "bf16": "bf16",
+}
 
 
 if __name__ == "__main__":
@@ -279,6 +267,7 @@ def main(args):
     )
     parser.add_argument("--dump_path", default=None, type=str, required=True, help="Path to the output pipeline.")
     parser.add_argument("--save_full_pipeline", action="store_true", help="save all the pipelien elemets in one.")
+    parser.add_argument("--dtype", default="fp32", type=str, choices=["fp32", "fp16", "bf16"], help="Weight dtype.")
 
     args = parser.parse_args()
 
@@ -302,6 +291,7 @@ def main(args):
     }
 
     device = "cuda" if torch.cuda.is_available() else "cpu"
-    weight_dtype = torch.float16
+    weight_dtype = DTYPE_MAPPING[args.dtype]
+    variant = VARIANT_MAPPING[args.dtype]
 
     main(args)
@@ -5358,77 +5358,47 @@ def __call__(
         hidden_states: torch.Tensor,
         encoder_hidden_states: Optional[torch.Tensor] = None,
         attention_mask: Optional[torch.Tensor] = None,
-        temb: Optional[torch.Tensor] = None,
-        *args,
-        **kwargs,
     ) -> torch.Tensor:
-        if len(args) > 0 or kwargs.get("scale", None) is not None:
-            deprecation_message = "The `scale` argument is deprecated and will be ignored. Please remove it, as passing it will raise an error in the future. `scale` should directly be passed while calling the underlying pipeline component i.e., via `cross_attention_kwargs`."
-            deprecate("scale", "1.0.0", deprecation_message)
-
-        residual = hidden_states
-        if attn.spatial_norm is not None:
-            hidden_states = attn.spatial_norm(hidden_states, temb)
-
         input_ndim = hidden_states.ndim
+        original_dtype = hidden_states.dtype
 
-        if input_ndim == 4:
-            batch_size, channel, height, width = hidden_states.shape
-            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
-
-        batch_size, sequence_length, _ = (
+        batch_size, _, _ = (
             hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
         )
 
-        query = attn.to_q(hidden_states)
-
         if encoder_hidden_states is None:
             encoder_hidden_states = hidden_states
-        elif attn.norm_cross:
-            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
 
+        query = attn.to_q(hidden_states)
         key = attn.to_k(encoder_hidden_states)
         value = attn.to_v(encoder_hidden_states)
 
         inner_dim = key.shape[-1]
         head_dim = inner_dim // attn.heads
 
-        dtype = query.dtype
-
         query = query.transpose(-1, -2).reshape(batch_size, attn.heads, head_dim, -1)
         key = key.transpose(-1, -2).reshape(batch_size, attn.heads, head_dim, -1).transpose(-1, -2)
         value = value.transpose(-1, -2).reshape(batch_size, attn.heads, head_dim, -1)
 
-        query = self.kernel_func(query)  # B, h, h_d, N
+        query = self.kernel_func(query)
         key = self.kernel_func(key)
 
-        # need torch.float
         query, key, value = query.float(), key.float(), value.float()
 
         value = F.pad(value, (0, 0, 0, 1), mode="constant", value=self.pad_val)
-        vk = torch.matmul(value, key)
-        hidden_states = torch.matmul(vk, query)
+        scores = torch.matmul(value, key)
+        hidden_states = torch.matmul(scores, query)
 
         if hidden_states.dtype in [torch.float16, torch.bfloat16]:
             hidden_states = hidden_states.float()
+        
         hidden_states = hidden_states[:, :, :-1] / (hidden_states[:, :, -1:] + self.eps)
-
         hidden_states = hidden_states.view(batch_size, attn.heads * head_dim, -1).permute(0, 2, 1)
-        hidden_states = hidden_states.to(dtype)
+        hidden_states = hidden_states.to(original_dtype)
 
-        # linear proj
         hidden_states = attn.to_out[0](hidden_states)
-        # dropout
         hidden_states = attn.to_out[1](hidden_states)
 
-        if input_ndim == 4:
-            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
-
-        if attn.residual_connection:
-            hidden_states = hidden_states + residual
-
-        hidden_states = hidden_states / attn.rescale_output_factor
-
         if hidden_states.dtype == torch.float16:
             hidden_states = hidden_states.clip(-65504, 65504)
 
 
@@ -590,42 +590,3 @@ def get_normalization(
     else:
         raise ValueError(f"{norm_type=} is not supported.")
     return norm
-
-
-class RMSNormScaled(nn.Module):
-    def __init__(self, dim, eps: float, elementwise_affine: bool = True, scale_factor: float = 1.0, bias: bool = False):
-        super().__init__()
-        self.weight = nn.Parameter(torch.ones(dim) * scale_factor)
-
-        self.eps = eps
-        self.elementwise_affine = elementwise_affine
-
-        if isinstance(dim, numbers.Integral):
-            dim = (dim,)
-
-        self.dim = torch.Size(dim)
-
-        self.weight = None
-        self.bias = None
-
-        if elementwise_affine:
-            self.weight = nn.Parameter(torch.ones(dim) * scale_factor)
-            if bias:
-                self.bias = nn.Parameter(torch.zeros(dim))
-
-    def forward(self, hidden_states):
-        input_dtype = hidden_states.dtype
-        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
-        hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
-
-        if self.weight is not None:
-            # convert into half-precision if necessary
-            if self.weight.dtype in [torch.float16, torch.bfloat16]:
-                hidden_states = hidden_states.to(self.weight.dtype)
-            hidden_states = hidden_states * self.weight
-            if self.bias is not None:
-                hidden_states = hidden_states + self.bias
-        else:
-            hidden_states = hidden_states.to(input_dtype)
-
-        return hidden_states