Skip to content
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
b02915b
CogVideoX1_1PatchEmbed test
zRzRzRzRzRzRzR Nov 6, 2024
87535d6
1360 * 768
zRzRzRzRzRzRzR Nov 6, 2024
b033aad
refactor
a-r-r-o-w Nov 8, 2024
67cb373
make style
a-r-r-o-w Nov 8, 2024
de84a04
Merge branch 'main' into cogvideox1.1-5b
a-r-r-o-w Nov 8, 2024
e481843
update docs
a-r-r-o-w Nov 8, 2024
9edddc1
add modeling tests for cogvideox 1.5
a-r-r-o-w Nov 8, 2024
ea56788
update
a-r-r-o-w Nov 8, 2024
d833f72
make fix-copies
a-r-r-o-w Nov 8, 2024
b87b07e
add ofs embed(for convert)
zRzRzRzRzRzRzR Nov 9, 2024
e254bcb
add ofs embed(for convert)
zRzRzRzRzRzRzR Nov 9, 2024
5e96cae
Merge branch 'huggingface:main' into cogvideox1.1-5b
zRzRzRzRzRzRzR Nov 10, 2024
be80dbf
more resolution for cogvideox1.5-5b-i2v
zRzRzRzRzRzRzR Nov 10, 2024
be8aff7
Merge branch 'cogvideox1.1-5b' of github.com:zRzRzRzRzRzRzR/diffusers…
zRzRzRzRzRzRzR Nov 10, 2024
b94c704
use even number of latent frames only
a-r-r-o-w Nov 10, 2024
048a5f0
update pipeline implementations
a-r-r-o-w Nov 10, 2024
0c98aad
make style
a-r-r-o-w Nov 10, 2024
7a1b579
set patch_size_t as None by default
zRzRzRzRzRzRzR Nov 11, 2024
27441fc
#skip frames 0
zRzRzRzRzRzRzR Nov 11, 2024
7a15767
refactor
a-r-r-o-w Nov 11, 2024
e2a88cb
make style
a-r-r-o-w Nov 11, 2024
8966cb0
update docs
a-r-r-o-w Nov 11, 2024
f2213e8
fix ofs_embed
a-r-r-o-w Nov 11, 2024
8b28232
update docs
a-r-r-o-w Nov 11, 2024
3587317
invert_scale_latents
a-r-r-o-w Nov 11, 2024
17957d0
update
a-r-r-o-w Nov 11, 2024
3dba37f
Merge branch 'main' into cogvideox1.1-5b
a-r-r-o-w Nov 14, 2024
25a9e1c
fix
a-r-r-o-w Nov 14, 2024
a8ec9f2
Merge branch 'main' into cogvideox1.1-5b
a-r-r-o-w Nov 14, 2024
7990958
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 14, 2024
2c3b78d
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 14, 2024
e063e9d
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 14, 2024
f054c44
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 14, 2024
3849cae
Update src/diffusers/models/transformers/cogvideox_transformer_3d.py
a-r-r-o-w Nov 14, 2024
4d14abb
update conversion script
a-r-r-o-w Nov 14, 2024
9c846eb
remove copied from
a-r-r-o-w Nov 14, 2024
9ef66d1
fix test
a-r-r-o-w Nov 14, 2024
23abe7b
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 17, 2024
f47516d
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 17, 2024
b4d629d
Merge branch 'main' into cogvideox1.1-5b
a-r-r-o-w Nov 17, 2024
4a4df63
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 17, 2024
ea166f8
Update docs/source/en/api/pipelines/cogvideox.md
a-r-r-o-w Nov 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions docs/source/en/api/pipelines/cogvideox.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,18 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m

This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).

There are two models available that can be used with the text-to-video and video-to-video CogVideoX pipelines:
- [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b): The recommended dtype for running this model is `fp16`.
- [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b): The recommended dtype for running this model is `bf16`.
There are three official models available that can be used with the text-to-video and video-to-video CogVideoX pipelines:
- [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b): The recommended dtype for running this model is `torch.float16`.
- [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b): The recommended dtype for running this model is `torch.bfloat16`.
- [`THUDM/CogVideoX-1.5-5b`](https://huggingface.co/THUDM/CogVideoX-1.5-5b): The recommended dtype for running this mdoel is `torch.bfloat16`.

There is one model available that can be used with the image-to-video CogVideoX pipeline:
- [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V): The recommended dtype for running this model is `bf16`.
- [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V): The recommended dtype for running this model is `torch.bfloat16`.
- [`THUDM/CogVideoX-1.5-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-1.5-5b-I2V): The recommended dtype for running this mdoel is `torch.bfloat16`.

There are two models that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team):
- [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose): The recommended dtype for running this model is `bf16`.
- [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose): The recommended dtype for running this model is `bf16`.
- [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose): The recommended dtype for running this model is `torch.bfloat16`.
- [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose): The recommended dtype for running this model is `torch.bfloat16`.

## Inference

Expand Down
68 changes: 60 additions & 8 deletions scripts/convert_cogvideox_to_diffusers.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ def replace_up_keys_inplace(key: str, state_dict: Dict[str, Any]):
"post_attn1_layernorm": "norm2.norm",
"time_embed.0": "time_embedding.linear_1",
"time_embed.2": "time_embedding.linear_2",
"ofs_embed.0": "ofs_embedding.linear_1",
"ofs_embed.2": "ofs_embedding.linear_2",
"mixins.patch_embed": "patch_embed",
"mixins.final_layer.norm_final": "norm_out.norm",
"mixins.final_layer.linear": "proj_out",
Expand Down Expand Up @@ -140,6 +142,7 @@ def convert_transformer(
use_rotary_positional_embeddings: bool,
i2v: bool,
dtype: torch.dtype,
init_kwargs: Dict[str, Any],
):
PREFIX_KEY = "model.diffusion_model."

Expand All @@ -149,7 +152,9 @@ def convert_transformer(
num_layers=num_layers,
num_attention_heads=num_attention_heads,
use_rotary_positional_embeddings=use_rotary_positional_embeddings,
use_learned_positional_embeddings=i2v,
ofs_embed_dim=512 if (i2v and init_kwargs["patch_size_t"] is not None) else None, # CogVideoX1.5-5B-I2V
use_learned_positional_embeddings=i2v and init_kwargs["patch_size_t"] is None, # CogVideoX-5B-I2V
**init_kwargs,
).to(dtype=dtype)

for key in list(original_state_dict.keys()):
Expand All @@ -163,6 +168,7 @@ def convert_transformer(
if special_key not in key:
continue
handler_fn_inplace(key, original_state_dict)

transformer.load_state_dict(original_state_dict, strict=True)
return transformer

Expand All @@ -187,6 +193,34 @@ def convert_vae(ckpt_path: str, scaling_factor: float, dtype: torch.dtype):
return vae


def get_init_kwargs(version: str):
if version == "1.0":
vae_scale_factor_spatial = 8
init_kwargs = {
"patch_size": 2,
"patch_size_t": None,
"patch_bias": True,
"sample_height": 480 // vae_scale_factor_spatial,
"sample_width": 720 // vae_scale_factor_spatial,
"sample_frames": 49,
}

elif version == "1.5":
vae_scale_factor_spatial = 8
init_kwargs = {
"patch_size": 2,
"patch_size_t": 2,
"patch_bias": False,
"sample_height": 768 // vae_scale_factor_spatial,
"sample_width": 1360 // vae_scale_factor_spatial,
"sample_frames": 81, # TODO: Need Test with 161 for 10 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"sample_frames": 81, # TODO: Need Test with 161 for 10 seconds
"sample_frames": 81,

This is just to determine the default number of frames for sampling, so we do not need to make a modification here (which would affect the config.json of the converted transformer model). Users can still specify 161 frames (in the call to pipeline) for generation normally and we will still be compatible without needing any modifications here.

}
else:
raise ValueError("Unsupported version of CogVideoX.")

return init_kwargs


def get_args():
parser = argparse.ArgumentParser()
parser.add_argument(
Expand All @@ -202,6 +236,12 @@ def get_args():
parser.add_argument(
"--text_encoder_cache_dir", type=str, default=None, help="Path to text encoder cache directory"
)
parser.add_argument(
"--typecast_text_encoder",
action="store_true",
default=False,
help="Whether or not to apply fp16/bf16 precision to text_encoder",
)
# For CogVideoX-2B, num_layers is 30. For 5B, it is 42
parser.add_argument("--num_layers", type=int, default=30, help="Number of transformer blocks")
# For CogVideoX-2B, num_attention_heads is 30. For 5B, it is 48
Expand All @@ -214,7 +254,18 @@ def get_args():
parser.add_argument("--scaling_factor", type=float, default=1.15258426, help="Scaling factor in the VAE")
# For CogVideoX-2B, snr_shift_scale is 3.0. For 5B, it is 1.0
parser.add_argument("--snr_shift_scale", type=float, default=3.0, help="Scaling factor in the VAE")
parser.add_argument("--i2v", action="store_true", default=False, help="Whether to save the model weights in fp16")
parser.add_argument(
"--i2v",
action="store_true",
default=False,
help="Whether the model to be converted is the Image-to-Video version of CogVideoX.",
)
parser.add_argument(
"--version",
choices=["1.0", "1.5"],
default="1.0",
help="Which version of CogVideoX to use for initializing default modeling parameters.",
)
return parser.parse_args()


Expand All @@ -230,21 +281,27 @@ def get_args():
dtype = torch.float16 if args.fp16 else torch.bfloat16 if args.bf16 else torch.float32

if args.transformer_ckpt_path is not None:
init_kwargs = get_init_kwargs(args.version)
transformer = convert_transformer(
args.transformer_ckpt_path,
args.num_layers,
args.num_attention_heads,
args.use_rotary_positional_embeddings,
args.i2v,
dtype,
init_kwargs,
)
if args.vae_ckpt_path is not None:
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, dtype)
# Keep VAE in float32 for better quality
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, torch.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zRzRzRzRzRzRzR This is a bit of a breaking change. The SAT VAE is in fp32 but the diffusers format VAE is in bf16/fp16. This can lead to poorer quality, so it is best to just keep the VAE in fp32 and let users decide what configuration to use. I will open a PR to the other model weight CogVideoX repositories with the updated VAE weights soon.

cc @yiyixuxu @DN6 The VAE quality doesn't take too much of a hit, but best to have the default in FP32 and update all existing checkpoints. Apologies that this slipped through earlier but I definitely notice very minor differences in quality (atleast in training cc @sayakpaul). The transformer modeling weights don't use variants because there is no FP32 weights as training is done in BF16


text_encoder_id = "google/t5-v1_1-xxl"
tokenizer = T5Tokenizer.from_pretrained(text_encoder_id, model_max_length=TOKENIZER_MAX_LENGTH)
text_encoder = T5EncoderModel.from_pretrained(text_encoder_id, cache_dir=args.text_encoder_cache_dir)

if args.typecast_text_encoder:
text_encoder = text_encoder.to(dtype=dtype)

# Apparently, the conversion does not work anymore without this :shrug:
for param in text_encoder.parameters():
param.data = param.data.contiguous()
Expand Down Expand Up @@ -276,11 +333,6 @@ def get_args():
scheduler=scheduler,
)

if args.fp16:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the explanation above, we shouldn't typecast all weights in the pipeline. VAE is best in FP32, text encoder could be saved in FP32 but works well at lower precisions as well, and transformer is either in BF16, or FP16 for CogVideoX-2B text-to-video

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this is the right thing to do.

pipe = pipe.to(dtype=torch.float16)
if args.bf16:
pipe = pipe.to(dtype=torch.bfloat16)

# We don't use variant here because the model must be run in fp16 (2B) or bf16 (5B). It would be weird
# for users to specify variant when the default is not fp32 and they want to run with the correct default (which
# is either fp16/bf16 here).
Expand Down
38 changes: 29 additions & 9 deletions src/diffusers/models/embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,7 @@ class CogVideoXPatchEmbed(nn.Module):
def __init__(
self,
patch_size: int = 2,
patch_size_t: Optional[int] = None,
in_channels: int = 16,
embed_dim: int = 1920,
text_embed_dim: int = 4096,
Expand All @@ -355,6 +356,7 @@ def __init__(
super().__init__()

self.patch_size = patch_size
self.patch_size_t = patch_size_t
self.embed_dim = embed_dim
self.sample_height = sample_height
self.sample_width = sample_width
Expand All @@ -366,9 +368,15 @@ def __init__(
self.use_positional_embeddings = use_positional_embeddings
self.use_learned_positional_embeddings = use_learned_positional_embeddings

self.proj = nn.Conv2d(
in_channels, embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size, bias=bias
)
if patch_size_t is None:
# CogVideoX 1.0 checkpoints
self.proj = nn.Conv2d(
in_channels, embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size, bias=bias
)
else:
# CogVideoX 1.5 checkpoints
self.proj = nn.Linear(in_channels * patch_size * patch_size * patch_size_t, embed_dim)

self.text_proj = nn.Linear(text_embed_dim, embed_dim)

if use_positional_embeddings or use_learned_positional_embeddings:
Expand Down Expand Up @@ -407,12 +415,24 @@ def forward(self, text_embeds: torch.Tensor, image_embeds: torch.Tensor):
"""
text_embeds = self.text_proj(text_embeds)

batch, num_frames, channels, height, width = image_embeds.shape
image_embeds = image_embeds.reshape(-1, channels, height, width)
image_embeds = self.proj(image_embeds)
image_embeds = image_embeds.view(batch, num_frames, *image_embeds.shape[1:])
image_embeds = image_embeds.flatten(3).transpose(2, 3) # [batch, num_frames, height x width, channels]
image_embeds = image_embeds.flatten(1, 2) # [batch, num_frames x height x width, channels]
batch_size, num_frames, channels, height, width = image_embeds.shape

if self.patch_size_t is None:
image_embeds = image_embeds.reshape(-1, channels, height, width)
image_embeds = self.proj(image_embeds)
image_embeds = image_embeds.view(batch_size, num_frames, *image_embeds.shape[1:])
image_embeds = image_embeds.flatten(3).transpose(2, 3) # [batch, num_frames, height x width, channels]
image_embeds = image_embeds.flatten(1, 2) # [batch, num_frames x height x width, channels]
else:
p = self.patch_size
p_t = self.patch_size_t

image_embeds = image_embeds.permute(0, 1, 3, 4, 2)
image_embeds = image_embeds.reshape(
batch_size, num_frames // p_t, p_t, height // p, p, width // p, p, channels
)
image_embeds = image_embeds.permute(0, 1, 3, 5, 7, 2, 4, 6).flatten(4, 7).flatten(1, 3)
image_embeds = self.proj(image_embeds)

embeds = torch.cat(
[text_embeds, image_embeds], dim=1
Expand Down
58 changes: 47 additions & 11 deletions src/diffusers/models/transformers/cogvideox_transformer_3d.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,14 +170,16 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
Whether to flip the sin to cos in the time embedding.
time_embed_dim (`int`, defaults to `512`):
Output dimension of timestep embeddings.
ofs_embed_dim (`int`, defaults to `512`):
scaling factor in the VAE process for the Image-to-Video (I2V) transformation in CogVideoX1.5-5B.
text_embed_dim (`int`, defaults to `4096`):
Input dimension of text embeddings from the text encoder.
num_layers (`int`, defaults to `30`):
The number of layers of Transformer blocks to use.
dropout (`float`, defaults to `0.0`):
The dropout probability to use.
attention_bias (`bool`, defaults to `True`):
Whether or not to use bias in the attention projection layers.
Whether to use bias in the attention projection layers.
sample_width (`int`, defaults to `90`):
The width of the input latents.
sample_height (`int`, defaults to `60`):
Expand All @@ -198,7 +200,7 @@ class CogVideoXTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
timestep_activation_fn (`str`, defaults to `"silu"`):
Activation function to use when generating the timestep embeddings.
norm_elementwise_affine (`bool`, defaults to `True`):
Whether or not to use elementwise affine in normalization layers.
Whether to use elementwise affine in normalization layers.
norm_eps (`float`, defaults to `1e-5`):
The epsilon value to use in normalization layers.
spatial_interpolation_scale (`float`, defaults to `1.875`):
Expand All @@ -219,6 +221,7 @@ def __init__(
flip_sin_to_cos: bool = True,
freq_shift: int = 0,
time_embed_dim: int = 512,
ofs_embed_dim: Optional[int] = None,
text_embed_dim: int = 4096,
num_layers: int = 30,
dropout: float = 0.0,
Expand All @@ -227,6 +230,7 @@ def __init__(
sample_height: int = 60,
sample_frames: int = 49,
patch_size: int = 2,
patch_size_t: int = 2,
temporal_compression_ratio: int = 4,
max_text_seq_length: int = 226,
activation_fn: str = "gelu-approximate",
Expand All @@ -237,6 +241,7 @@ def __init__(
temporal_interpolation_scale: float = 1.0,
use_rotary_positional_embeddings: bool = False,
use_learned_positional_embeddings: bool = False,
patch_bias: bool = True,
):
super().__init__()
inner_dim = num_attention_heads * attention_head_dim
Expand All @@ -251,10 +256,11 @@ def __init__(
# 1. Patch embedding
self.patch_embed = CogVideoXPatchEmbed(
patch_size=patch_size,
patch_size_t=patch_size_t,
in_channels=in_channels,
embed_dim=inner_dim,
text_embed_dim=text_embed_dim,
bias=True,
bias=patch_bias,
sample_width=sample_width,
sample_height=sample_height,
sample_frames=sample_frames,
Expand All @@ -267,10 +273,16 @@ def __init__(
)
self.embedding_dropout = nn.Dropout(dropout)

# 2. Time embeddings
# 2. Time embeddings and ofs embedding(Only CogVideoX1.5-5B I2V have)

self.time_proj = Timesteps(inner_dim, flip_sin_to_cos, freq_shift)
self.time_embedding = TimestepEmbedding(inner_dim, time_embed_dim, timestep_activation_fn)

self.ofs_embedding = None

if ofs_embed_dim:
self.ofs_embedding = TimestepEmbedding(ofs_embed_dim, ofs_embed_dim, timestep_activation_fn) # same as time embeddings, for ofs

# 3. Define spatio-temporal transformers blocks
self.transformer_blocks = nn.ModuleList(
[
Expand Down Expand Up @@ -298,7 +310,15 @@ def __init__(
norm_eps=norm_eps,
chunk_dim=1,
)
self.proj_out = nn.Linear(inner_dim, patch_size * patch_size * out_channels)

if patch_size_t is None:
# For CogVideox 1.0
output_dim = patch_size * patch_size * out_channels
else:
# For CogVideoX 1.5
output_dim = patch_size * patch_size * patch_size_t * out_channels

self.proj_out = nn.Linear(inner_dim, output_dim)

self.gradient_checkpointing = False

Expand Down Expand Up @@ -441,8 +461,21 @@ def forward(
# there might be better ways to encapsulate this.
t_emb = t_emb.to(dtype=hidden_states.dtype)
emb = self.time_embedding(t_emb, timestep_cond)
if self.ofs_embedding is not None:
emb_ofs = self.ofs_embedding(emb, timestep_cond)
emb = emb + emb_ofs

# 2. Patch embedding
p = self.config.patch_size
p_t = self.config.patch_size_t

# We know that the hidden states height and width will always be divisible by patch_size.
# But, the number of frames may not be divisible by patch_size_t. So, we pad with the beginning frames.
if p_t is not None:
remaining_frames = p_t - num_frames % p_t
first_frame = hidden_states[:, :1].repeat(1, 1 + remaining_frames, 1, 1, 1)
hidden_states = torch.cat([first_frame, hidden_states[:, 1:]], dim=1)

hidden_states = self.patch_embed(encoder_hidden_states, hidden_states)
hidden_states = self.embedding_dropout(hidden_states)

Expand Down Expand Up @@ -491,12 +524,15 @@ def custom_forward(*inputs):
hidden_states = self.proj_out(hidden_states)

# 5. Unpatchify
# Note: we use `-1` instead of `channels`:
# - It is okay to `channels` use for CogVideoX-2b and CogVideoX-5b (number of input channels is equal to output channels)
# - However, for CogVideoX-5b-I2V also takes concatenated input image latents (number of input channels is twice the output channels)
p = self.config.patch_size
output = hidden_states.reshape(batch_size, num_frames, height // p, width // p, -1, p, p)
output = output.permute(0, 1, 4, 2, 5, 3, 6).flatten(5, 6).flatten(3, 4)
if p_t is None:
output = hidden_states.reshape(batch_size, num_frames, height // p, width // p, -1, p, p)
output = output.permute(0, 1, 4, 2, 5, 3, 6).flatten(5, 6).flatten(3, 4)
else:
output = hidden_states.reshape(
batch_size, (num_frames + p_t - 1) // p_t, height // p, width // p, -1, p_t, p, p
)
output = output.permute(0, 1, 5, 4, 2, 6, 3, 7).flatten(6, 7).flatten(4, 5).flatten(1, 2)
output = output[:, remaining_frames:]

if USE_PEFT_BACKEND:
# remove `lora_scale` from each PEFT layer
Expand Down
Loading