Dreambooth finetune FLUX dev CLIPTextModel

### Describe the bug

ValueError: Sequence length must be less than max_position_embeddings (got `sequence length`: 77 and max_position_embeddings: 0

I used four A100 to full amount of fine-tuning Flux. 1 dev model, according to https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_flux.md

I used the toy dog dataset (5 images) for fine-tuning. 
I ran into a problem with max_position_embeddings for CLIPTextModel:


### Reproduction

[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/AIGC/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 1812, in <module>
[rank1]:     main(args)
[rank1]:   File "/data/AIGC/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 1351, in main
[rank1]:     instance_prompt_hidden_states, instance_pooled_prompt_embeds, instance_text_ids = compute_text_embeddings(
[rank1]:   File "/data/AIGC/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 1339, in compute_text_embeddings
[rank1]:     prompt_embeds, pooled_prompt_embeds, text_ids = encode_prompt(
[rank1]:   File "/data/AIGC/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 963, in encode_prompt
[rank1]:     pooled_prompt_embeds = _encode_prompt_with_clip(
[rank1]:   File "/data/AIGC/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 937, in _encode_prompt_with_clip
[rank1]:     prompt_embeds = text_encoder(text_input_ids.to(device), output_hidden_states=False)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 1056, in forward
[rank1]:     return self.text_model(
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 947, in forward
[rank1]:     hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/root/anaconda3/envs/flux/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 283, in forward
[rank1]:     raise ValueError(
[rank1]: ValueError: Sequence length must be less than max_position_embeddings (got `sequence length`: 77 and max_position_embeddings: 0

I changed max_position_embeddings in CLIPTextModel but it doesn't work:
  text_encoder_one = class_one.from_pretrained(
      args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision, variant=args.variant, max_position_embeddings=77,ignore_mismatched_sizes=True
  )

My training script is as follows:

export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="trained-flux"

accelerate launch train_dreambooth_flux.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

### Logs

```shell

```

### System Info

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.16
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.29.1
- Transformers version: 4.49.0
- Accelerate version: 1.4.0
- PEFT version: 0.14.0
- Bitsandbytes version: not installed
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-40GB, 40960 MiB
NVIDIA A100-SXM4-40GB, 40960 MiB
NVIDIA A100-SXM4-40GB, 40960 MiB
NVIDIA A100-SXM4-40GB, 40960 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dreambooth finetune FLUX dev CLIPTextModel #10925

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Dreambooth finetune FLUX dev CLIPTextModel #10925

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions