Add Step1X-Edit Pipeline #12249

Peyton-Chen · 2025-08-28T09:31:20Z

What does this PR do?

This PR adds support for the Step1X-Edit model for image editing tasks, extending its integration within the Diffusers library. For further details regarding the Step1X-Edit model, please refer to the GitHub Repo and the Technical Report.

Example Code

import torch
from diffusers import Step1XEditPipeline
from diffusers.utils import load_image

pipe = Step1XEditPipeline.from_pretrained("stepfun-ai/Step1X-Edit-v1p1-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")
image = load_image(
    "https://github.com/stepfun-ai/Step1X-Edit/blob/main/examples/0000.jpg?raw=true"
).convert("RGB")
prompt = "Add pendant with a ruby around this girl's neck."

image = pipe(
    image=image,
    prompt=prompt, 
    num_inference_steps=28,
    size_level=1024,
    guidance_scale=6.0,
    generator=torch.Generator().manual_seed(1234),
).images[0]
image.save("output.png")

Result

Init Image	Edited Image

Who can review?

Pipelines and pipeline callbacks: @yiyixuxu and @asomoza

cc @a-r-r-o-w @sayakpaul

sayakpaul

Thanks for getting this started! Looks like a very cool model. I think this PR is already a very good start.

@linoytsaban / @asomoza in case you have some time to check it out.

sayakpaul · 2025-08-30T06:50:17Z

src/diffusers/models/attention_processor.py

        processor._attention_backend = "_native_xla"
        return processor

+class Step1XEditAttnProcessor2_0_NPU:


We can remove this processor for now.

sayakpaul · 2025-08-30T06:50:54Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+
+
+def apply_gate(x, gate=None, tanh=False):
+    """AI is creating summary for apply_gate


What is this description?

sayakpaul · 2025-08-30T06:52:26Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+    return _get_projections(attn, hidden_states, encoder_hidden_states)
+
+
+def get_activation_layer(act_type):


If the activations don't vary across different blocks, can we remove this function and just use the activation functions in-place?

sayakpaul · 2025-08-30T06:53:25Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        return x * gate.unsqueeze(1)
+
+
+def get_norm_layer(norm_layer):


Same as above. It seems like the norm layers aren't changing. So, let's directly use nn.LayerNorm.

sayakpaul · 2025-08-30T06:54:12Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        self.to_v_ip = nn.ModuleList(
+            [
+                nn.Linear(cross_attention_dim, hidden_size, bias=True, device=device, dtype=dtype)
+                for _ in range(len(num_tokens))
+            ]
+        )


Do we already support IP adapters for this model? If so, could you include an example? If not, let's remove this.

sayakpaul · 2025-08-30T07:07:12Z

src/diffusers/pipelines/step1x_edit/pipeline_step1x_edit.py

+        num_images_per_prompt: int = 1,
+        prompt_embeds: Optional[torch.Tensor] = None,
+        prompt_embeds_mask: Optional[torch.Tensor] = None,
+        max_sequence_length: int = 1024,


If it's not used, let's remove.

sayakpaul · 2025-08-30T07:07:32Z

src/diffusers/pipelines/step1x_edit/pipeline_step1x_edit.py

+        if image is not None and not (isinstance(image, torch.Tensor) and image.size(1) == self.latent_channels):
+            img_info = image.size
+            width, height = img_info
+            r = width / height 


Suggested change

r = width / height

aspect_ratio = width / height

sayakpaul · 2025-08-30T07:09:27Z

src/diffusers/pipelines/step1x_edit/pipeline_step1x_edit.py

+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `true_cfg_scale` is
+                not greater than `1`).
+            true_cfg_scale (`float`, *optional*, defaults to 6.0):


I see we have both guidance_scale and true_cfg_scale. Is this support future guidance-distilled models as the model doesn't seem to be a guidance-distilled model?

sayakpaul · 2025-08-30T07:10:15Z

src/diffusers/pipelines/step1x_edit/pipeline_step1x_edit.py

+            guidance_scale (`float`, *optional*, defaults to 6.0):
+                Guidance scale as defined in [Classifier-Free Diffusion
+                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
+                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
+                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
+                the text `prompt`, usually at the expense of lower image quality.


In presence of the true_cfg_scale argument, we need to change this definition a bit:

diffusers/src/diffusers/pipelines/flux/pipeline_flux_control.py

Line 663 in 9b721db

guidance_scale (`float`, *optional*, defaults to 3.5):

sayakpaul · 2025-08-30T07:11:16Z

src/diffusers/pipelines/step1x_edit/pipeline_step1x_edit.py

+            size_level (`int` defaults to 512): The maximum size level of the generated image in pixels. The height and width will be adjusted to fit this
+                area while maintaining the aspect ratio.


Can't we derive this from the requested height and width parameters? Our pipelines don't ever contain arguments like size_level.

yiyixuxu

thanks for the PR!
I left some comments

yiyixuxu · 2025-08-30T21:34:56Z

src/diffusers/pipelines/step1x_edit/pipeline_step1x_edit.py

+import numpy as np
+import torch
+import math
+from qwen_vl_utils import process_vision_info


can you try to not have this dependency?

yiyixuxu · 2025-08-30T22:03:50Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        self.gradient_checkpointing = False
+
+    @staticmethod
+    def timestep_embedding(


let's not make it a method of the transformer class
actually is it same as? https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py#L1302

yiyixuxu · 2025-08-30T22:04:37Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        if txt_ids.ndim == 3:
+            logger.warning(
+                "Passing `txt_ids` 3d torch.Tensor is deprecated."
+                "Please remove the batch dimension and pass it as a 2d torch Tensor"
+            )
+            txt_ids = txt_ids[0]
+        if img_ids.ndim == 3:
+            logger.warning(
+                "Passing `img_ids` 3d torch.Tensor is deprecated."
+                "Please remove the batch dimension and pass it as a 2d torch Tensor"
+            )
+            img_ids = img_ids[0]


Suggested change

if txt_ids.ndim == 3:

logger.warning(

"Passing `txt_ids` 3d torch.Tensor is deprecated."

"Please remove the batch dimension and pass it as a 2d torch Tensor"

)

txt_ids = txt_ids[0]

if img_ids.ndim == 3:

logger.warning(

"Passing `img_ids` 3d torch.Tensor is deprecated."

"Please remove the batch dimension and pass it as a 2d torch Tensor"

)

img_ids = img_ids[0]

we don't need to deprecate for new model class

yiyixuxu · 2025-08-30T22:05:19Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        if joint_attention_kwargs is not None and "ip_adapter_image_embeds" in joint_attention_kwargs:
+            ip_adapter_image_embeds = joint_attention_kwargs.pop("ip_adapter_image_embeds")
+            ip_hidden_states = self.encoder_hid_proj(ip_adapter_image_embeds)
+            joint_attention_kwargs.update({"ip_hidden_states": ip_hidden_states})


Suggested change

if joint_attention_kwargs is not None and "ip_adapter_image_embeds" in joint_attention_kwargs:

ip_adapter_image_embeds = joint_attention_kwargs.pop("ip_adapter_image_embeds")

ip_hidden_states = self.encoder_hid_proj(ip_adapter_image_embeds)

joint_attention_kwargs.update({"ip_hidden_states": ip_hidden_states})

there is no ip-adapter yet,no?

yiyixuxu · 2025-08-30T22:05:54Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+            # controlnet residual
+            if controlnet_block_samples is not None:
+                interval_control = len(self.transformer_blocks) / len(controlnet_block_samples)
+                interval_control = int(np.ceil(interval_control))
+                # For Xlabs ControlNet.
+                if controlnet_blocks_repeat:
+                    hidden_states = (
+                        hidden_states + controlnet_block_samples[index_block % len(controlnet_block_samples)]
+                    )
+                else:
+                    hidden_states = hidden_states + controlnet_block_samples[index_block // interval_control]


Suggested change

# controlnet residual

if controlnet_block_samples is not None:

interval_control = len(self.transformer_blocks) / len(controlnet_block_samples)

interval_control = int(np.ceil(interval_control))

# For Xlabs ControlNet.

if controlnet_blocks_repeat:

hidden_states = (

hidden_states + controlnet_block_samples[index_block % len(controlnet_block_samples)]

)

else:

hidden_states = hidden_states + controlnet_block_samples[index_block // interval_control]

let's add controlnet when we have them:)

yiyixuxu · 2025-08-30T22:38:33Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        x: torch.Tensor,
+        t: torch.LongTensor,
+        mask: Optional[torch.LongTensor] = None,
+        y: torch.LongTensor=None,


Suggested change

y: torch.LongTensor=None,

yiyixuxu · 2025-08-30T22:41:25Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        if self.need_CA:
+            self.input_embedder_CA = nn.Linear(
+            in_channels, hidden_size, bias=True, **factory_kwargs
+        )


Suggested change

if self.need_CA:

self.input_embedder_CA = nn.Linear(

in_channels, hidden_size, bias=True, **factory_kwargs

)

if this layer is not used in this checkpoint, let's just not have it

yiyixuxu · 2025-08-30T22:41:59Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+        if self.need_CA:
+            y = self.input_embedder_CA(y)
+            x = self.individual_token_refiner(x, c, mask, y)
+        else:


Suggested change

if self.need_CA:

y = self.input_embedder_CA(y)

x = self.individual_token_refiner(x, c, mask, y)

else:

yiyixuxu · 2025-08-30T22:42:15Z

src/diffusers/models/transformers/transformer_step1x_edit.py

+
+        global_out = self.global_proj_out(x_mean)
+
+        encoder_hidden_states = self.S(x,t,mask)


It seems like the SingleTokenRefiner should be its own layer, not part of connector: the inputs are passing through without processing here

so
encoder_hidden_states, mask -> global_proj -> global_out
encoder_hidden_states, timesteps, mask -> single token refiner -> encoder_hidden_state

Thank you for your review. We have resolved all other comments! Regarding the design of this connector, this class corresponds to the structural design outlined in the technical report, so we have retained this design.

src/diffusers/models/transformers/transformer_step1x_edit.py

…into step1xedit

Peyton-Chen · 2025-09-01T07:11:05Z

@sayakpaul @yiyixuxu Thank you very much for your patient review. We've made some changes according to your feedback. We sincerely appreciate your efforts once again!

Peyton-Chen and others added 6 commits August 26, 2025 11:31

add Step1X-Edit

e848634

add Step1X-Edit

3645b66

Update pipeline_step1x_edit.py

5ff2cff

Merge branch 'main' into step1xedit

f26ec62

Merge branch 'huggingface:main' into step1xedit

88e0619

Merge branch 'main' into step1xedit

152578d

sayakpaul reviewed Aug 30, 2025

View reviewed changes

sayakpaul requested a review from a-r-r-o-w August 30, 2025 07:12

yiyixuxu reviewed Aug 30, 2025

View reviewed changes

Peyton-Chen and others added 3 commits August 31, 2025 21:52

Merge branch 'huggingface:main' into step1xedit

9f46695

Modify according to the review comments

79e44eb

Merge branch 'step1xedit' of https://github.com/Peyton-Chen/diffusers …

01310e7

…into step1xedit

Peyton-Chen requested review from sayakpaul and yiyixuxu September 1, 2025 07:32

Update pipeline_step1x_edit.py

4ca7aed



		def apply_gate(x, gate=None, tanh=False):
		"""AI is creating summary for apply_gate

		return _get_projections(attn, hidden_states, encoder_hidden_states)


		def get_activation_layer(act_type):

		size_level (`int` defaults to 512): The maximum size level of the generated image in pixels. The height and width will be adjusted to fit this
		area while maintaining the aspect ratio.


		global_out = self.global_proj_out(x_mean)

		encoder_hidden_states = self.S(x,t,mask)

Add Step1X-Edit Pipeline #12249

Are you sure you want to change the base?

Add Step1X-Edit Pipeline #12249

Uh oh!

Conversation

Peyton-Chen commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Example Code

Result

Init Image

Edited Image

Who can review?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Peyton-Chen commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Peyton-Chen commented Aug 28, 2025 •

edited

Loading