Add RAE Diffusion Transformer inference/preliminary training pipelines by plugyawn · Pull Request #13231 · huggingface/diffusers

plugyawn · 2026-03-09T04:49:54Z

What does this PR do?

This PR adds support for Diffusion Transformers with Representation Autoencoders in Diffusers.

It implements the Stage-2 side of the RAE setup:

RAEDiT2DModel
RAEDiTPipeline
checkpoint conversion for published upstream Stage-2 checkpoints
API docs
a small examples/research_projects/rae_dit/ training scaffold

This addresses #13225.

Reference implementation: byteriper's repository

Validation

Inference parity with the official implementation is high. For matched class label / initial latent noise / schedule, I measured:

max_abs_error=0.00001717
mean_abs_error=0.00000122

Qualitative parity artifacts used during validation:

same published Stage-2 checkpoint
same class label
same initial latent noise
same 25-step shifted Euler schedule

Inference is also slightly faster in the current Diffusers port on a 40GB A100:

Precision	CFG	Steps	Diffusers sec/img	Upstream sec/img	Diffusers img/s	Delta
bf16	1.0	25	0.817	0.913	1.225	+11.8%
bf16	4.0	25	0.852	0.931	1.174	+9.3%
bf16	1.0	50	1.568	1.761	0.638	+12.3%
bf16	4.0	50	1.649	1.853	0.606	+12.4%

Notes

This PR intentionally does not add upstream autoguidance / guidance-model support.
The training script is a research-project scaffold under examples/research_projects, not a claim of full upstream training parity.
AutoencoderRAE.from_pretrained() is used for the Stage-1 component so the packaged RAEDiTPipeline.from_pretrained(...) path works with published RAE checkpoints.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

plugyawn · 2026-03-09T05:52:59Z

@kashif @sayakpaul would be great if you could review. Please note the no_init_weights() fix (details in the PR body); if you prefer, that could be a separate PR, but considering diffusers is supposed to be an extension to torch, I guess it makes sense?

sayakpaul · 2026-03-09T11:10:08Z

Thanks for the PR. To keep the scope manageable, could we break it down into separate PRs?

For example,

there is also a change to no_init_weights( ). Specifically: it makes Diffusers’ skip-weight-init behave more like normal PyTorch. Now, when no_init_weights() is active, the torch.nn.init.* functions stop returning the tensor they were called on (for ref: PyTorch does return). Most models never notice this, but the RAE-DiT implementation does rely on the return value during construction, which can make otherwise valid checkpoints fail to load through the standard from_pretrained() path.

could be a separate PR.

sayakpaul

Thanks!

I left some initial comments, let me know if they make sense.

sayakpaul · 2026-03-09T11:11:57Z

examples/research_projects/rae_dit/README.md

+- `examples/dreambooth/train_dreambooth_flux.py`
+  for the flow-matching training loop structure, checkpoint resume flow, and `accelerate.save_state(...)` hooks.
+- `examples/flux-control/train_control_flux.py`
+  for the transformer-only save layout and SD3-style flow-matching timestep weighting helpers.


Doesn't belong here.

sayakpaul · 2026-03-09T11:13:09Z

src/diffusers/models/modeling_utils.py

+        # Preserve the `torch.nn.init.*` return contract so third-party model
+        # constructors that chain on the returned tensor still work under
+        # `no_init_weights()`.
+        return args[0] if len(args) > 0 else None


Can you provide an example?

sayakpaul · 2026-03-09T11:14:56Z

tests/models/transformers/test_models_rae_dit_transformer2d.py

+        super().test_effective_gradient_checkpointing(loss_tolerance=1e-4)
+
+    @unittest.skip(
+        "RAEDiT initializes the output head to zeros, so cosine-based layerwise casting checks are uninformative."


I don't think this is the case? We can always skip layerwise casting for certain layer or layer groups here:

diffusers/src/diffusers/models/modeling_utils.py

Line 246 in a08c274

_skip_layerwise_casting_patterns = None

sayakpaul · 2026-03-09T11:15:35Z

tests/models/transformers/test_models_rae_dit_transformer2d.py

+    model.final_layer.linear.bias.data.normal_(mean=0.0, std=0.02)
+
+
+class RAEDiT2DModelTests(ModelTesterMixin, unittest.TestCase):


Test should use the newly added model tester mixins. You can find an example in #13046

sayakpaul · 2026-03-09T11:19:53Z

src/diffusers/models/transformers/transformer_rae_dit.py

+    if shift is None:
+        shift = torch.zeros_like(scale)


This is a small function, which is okay being present in the caller sites inline?

We also probably don't need _repeat_to_length().

sayakpaul · 2026-03-09T11:28:52Z

src/diffusers/models/transformers/transformer_rae_dit.py

+        if self.use_pos_embed:
+            pos_embed = get_2d_sincos_pos_embed(
+                self.pos_embed.shape[-1], int(sqrt(self.pos_embed.shape[1])), output_type="pt"
+            )
+            self.pos_embed.data.copy_(pos_embed.float().unsqueeze(0))


Can we use how #13046 initialized the position embeddings?

Yeah, that makes sense, will do that.

sayakpaul · 2026-03-09T11:29:21Z

src/diffusers/models/transformers/transformer_rae_dit.py

+        )
+        return hidden_states
+
+    def _run_block(


We don't need this. Let's instead follow this pattern:

diffusers/src/diffusers/models/transformers/transformer_flux.py

Line 714 in a08c274

for index_block, block in enumerate(self.transformer_blocks):

sayakpaul · 2026-03-09T11:30:10Z

src/diffusers/pipelines/rae_dit/pipeline_rae_dit.py

+
+        return class_labels
+
+    def _prepare_latents(


It should be called prepare_latents() similar to other pipelines.

sayakpaul · 2026-03-09T11:31:10Z

src/diffusers/pipelines/rae_dit/pipeline_rae_dit.py

+            if output_type == "pt":
+                output = images
+            else:
+                output = images.cpu().permute(0, 2, 3, 1).float().numpy()
+                if output_type == "pil":
+                    output = self.numpy_to_pil(output)


We should use an image processor instead here. See:

diffusers/src/diffusers/pipelines/flux/pipeline_flux.py

Line 1012 in a08c274

image = self.image_processor.postprocess(image, output_type=output_type)

sayakpaul · 2026-03-09T11:31:30Z

src/diffusers/pipelines/rae_dit/pipeline_rae_dit.py

+        if not return_dict:
+            return (output,)
+
+        return ImagePipelineOutput(images=output)


Let's give this pipeline a separate output class: RAEDiTPipelineOutput.

plugyawn · 2026-03-10T02:42:07Z

@sayakpaul, from what I understand the RAE checkpoint -> DiT checkpoint -> generation pipeline necessarily requires the no_init_weight() change (otherwise the semantics become a bit muddled, imo).

Would it make more sense to open a PR for handling no_init_weights() behavior before this one?

sayakpaul · 2026-03-10T02:44:11Z

Could you explain why that's needed? I am still not sure about that actually. Prefer providing specific examples that fail without the change for init.

plugyawn · 2026-03-10T03:27:12Z

Not sure how to link files, but it seems to be related to changes introduced in #13046.

A specific example,

AutoencoderRAE consturcts DinoV2WithRegistersModel.
ModelMixin.from_pretrained() does this construction under no_init_weights( ) first, before low_cpu_mem_usage kicks in (modelling_utils.py, around line 1300)
AutoencoderRAE constructs Dinov2WithRegistersModel(config) in _build_encoder:84, and
ModelMixin.from_pretrained() always does that construction under no_init_weights() first, even
before low_cpu_mem_usage matters; see modeling_utils.py:1270. In current transformers, DINOv2-
with-registers has init code like this in modeling_dinov2_with_registers.py:464:

  module.weight.data = nn.init.trunc_normal_(
      module.weight.data.to(torch.float32), mean=0.0, std=self.config.initializer_range
  ).to(module.weight.dtype)

Under today’s no_init_weights(), nn.init.trunc_normal_ is replaced with a stub that just passes
and returns None, so that becomes None.to(...) and fails with an AttributeError: 'NoneType' object has no attribute 'to'.

Codex has a better summary, I think:

failing example: AutoencoderRAE builds Dinov2WithRegistersModel(config) in its encoder
path, and ModelMixin.from_pretrained() always instantiates models under no_init_weights() first.
In current transformers, DINOv2’s init_weights() assigns the return value of
nn.init.trunc_normal(...) and then calls .to(...) on it. With the current no_init_weights()
stub, that return value becomes None, so construction fails with AttributeError: 'NoneType'
object has no attribute 'to'. The proposed change keeps skip-init behavior intact, but restores
the normal PyTorch return contract so these constructors remain compatible.

Re: #13046, note test_models_autoencoder_rae.py:45, where the unit tests seem to be a little off, imo. Not sure the tests are aligned.

# ---------------------------------------------------------------------------
# Tiny test encoder for fast unit tests (no transformers dependency)
# ---------------------------------------------------------------------------


class _TinyTestEncoderModule(torch.nn.Module):
    """Minimal encoder that mimics the patch-token interface without any HF model."""

    def __init__(self, hidden_size: int = 16, patch_size: int = 8, **kwargs):
        super().__init__()
        self.patch_size = patch_size
        self.hidden_size = hidden_size

    def forward(self, images: torch.Tensor) -> torch.Tensor:
        pooled = F.avg_pool2d(images.mean(dim=1, keepdim=True), kernel_size=self.patch_size, stride=self.patch_size)
        tokens = pooled.flatten(2).transpose(1, 2).contiguous()
        return tokens.repeat(1, 1, self.hidden_size)


def _tiny_test_encoder_forward(model, images):
    return model(images)


def _build_tiny_test_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers):
    return _TinyTestEncoderModule(hidden_size=hidden_size, patch_size=patch_size)


# Monkey-patch the dispatch tables so "tiny_test" is recognised by AutoencoderRAE
_ENCODER_FORWARD_FNS["tiny_test"] = _tiny_test_encoder_forward
_original_build_encoder = _build_encoder


def _patched_build_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers):
    if encoder_type == "tiny_test":
        return _build_tiny_test_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers)
    return _original_build_encoder(encoder_type, hidden_size, patch_size, num_hidden_layers)


_rae_module._build_encoder = _patched_build_encoder

I'm new to diffusers idiomatics, but I was confused why this appeared to be a problem only now, and asked GPT:

no_init_weights() only becomes a problem when all of these are true at once:

a diffusers ModelMixin.from_pretrained() call is constructing the model

that model’s init() instantiates another model internally

that internal model uses torch.nn.init.* and also relies on its return value

RAE is unusual because it does exactly that. Inside autoencoder_rae.py, the AutoencoderRAE constructor directly >builds a transformers vision backbone:

Dinov2WithRegistersModel:98

SiglipVisionModel:111

ViTMAEModel:124

That is not how most other diffusers integrations are structured. Most of the repo does one of these instead:

native diffusers models in src/diffusers/models, whose init code only relies on side effects

pipelines that accept transformers models as separate top-level components, rather than constructing them inside > a ModelMixin

So other work usually does not run a transformers constructor inside diffusers’ patched no_init_weights() context.

sayakpaul · 2026-03-10T03:33:02Z

Not sure how to link files

Yes, we can link files and I think it's better this way. For example, it's much better to refer to specific lines like

diffusers/src/diffusers/models/modeling_utils.py

Line 129 in 068c6ef

def get_parameter_device(parameter: torch.nn.Module) -> torch.device:

instead of plain text.

Overall, I think that the explanation you provided in the above comment is that helpful. We need to have some specific (preferably very minimal) code snippet with and without that change to better understand what's happening and why.

For this kind of PRs, it's an expectation that the contributors will try to take some time to understand the library code.

plugyawn added 8 commits March 9, 2026 09:08

Preserve torch init return contract under no_init_weights

828d9da

Add Stage-2 RAE DiT model, pipeline, and tooling

df855b7

Fix RAE DiT review regressions

d88d1a6

Add RAE DiT resume-order verifier

38826eb

Add RAE DiT training smoke test

5847b07

Sync RAE DiT stack with diffusers quality checks

5ec84b0

Add RAE DiT API docs

33c57ce

Rename RAEDiTTransformer2DModel to RAEDiT2DModel

8b74498

plugyawn changed the title ~~Add Stage-2 RAE DiT support with pipeline, conversion, and training tooling~~ RAE DiT inference, checkpoint conversion, and preliminary training tooling Mar 9, 2026

plugyawn changed the title ~~RAE DiT inference, checkpoint conversion, and preliminary training tooling~~ Add RAE Diffusion Transformer inference/preliminary training pipelines Mar 9, 2026

plugyawn added 2 commits March 9, 2026 10:51

Fix RAE DiT review regressions

dc437f9

Remove RAE DiT validation helper scripts from PR

fe21820

plugyawn marked this pull request as draft March 9, 2026 05:46

plugyawn marked this pull request as ready for review March 9, 2026 05:51

Add RAE DiT training validation sampling

92455c1

sayakpaul reviewed Mar 9, 2026

View reviewed changes

sayakpaul requested review from dg845 and kashif March 9, 2026 11:33

Align RAE DiT with diffusers patterns

794d350

Localize RAE loading and drop unused guidance transformer

fd0bf6c

		model.final_layer.linear.bias.data.normal_(mean=0.0, std=0.02)


		class RAEDiT2DModelTests(ModelTesterMixin, unittest.TestCase):

Conversation

plugyawn commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Validation

Notes

Before submitting

Uh oh!

plugyawn commented Mar 9, 2026

Uh oh!

sayakpaul commented Mar 9, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plugyawn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 10, 2026

Uh oh!

plugyawn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

plugyawn commented Mar 9, 2026 •

edited

Loading

plugyawn commented Mar 10, 2026 •

edited

Loading

plugyawn commented Mar 10, 2026 •

edited

Loading