Added support for long captions (75+ tokens) for SDXL #740

FennelFetish · 2025-03-14T17:53:42Z

FennelFetish
Mar 14, 2025

Hello, I implemented chunking for long captions to overcome the 75 token limit of CLIP text encoders.
For that I added two pipelines modules in MGDS: ChunkTokenize and ChunkText,
and updated EncodeClipText and DecodeTokens.

I only tested it with SDXL so far. The modules are added to the dataloader and I also had to update the decoding of tokens for the debug prompt output. The (small) changes for OneTrainer are in this commit.

I left the debug prints in there for now but disabled them.

It doesn't work for sampling yet.

I was looking at @celll1's code, then the implementation in Kohya's SD Scripts and was intimidated by the Japanese comments. Then I thought it would probably make sense to encode the tokens in the same way as the image generation apps which we are training for, so I looked at ComfyUI's code: comfy/sd1_clip.py - tokenize_with_weights()
It does something very interesting:

# How ComfyUI does it: (a chunk is named batch)
# 1. Parse text with weights into groups, separated by parantheses.
#    Empty groups for spaces between other groups like "...) (..."
# 2. Tokenize each group individually.
#    Embeddings are included as Tensors in place of a token ID.
# 3. Create list of lists, one list for each group.
#    Each group-list consists of tuples: (token or Embedding Tensor, weight)
#
# 4. Start with BOS (weight 1) and register current batch as first batch
# 5. Loop over groups and create batches
#    5.1) Check if group fits into current batch, considering EOS
#       5.1.a) If it does:
#         Add group-tokens to current batch
#       5.1.b) If it doesnt:
#         - For large groups (>=8 tokens):
#             Split group
#             Fill the batch with the group-tokens that fit
#             Add EOS (weight 1) to batch
#         - For small groups:
#             Don't add group to current batch, but to next
#             Add EOS (weight 1) to batch
#             Fill batch with padding (weight 1). Padding will be <8
#         - Create new batch with BOS (weight 1)
# 6. Finalize last batch with EOS and padding (both weight 1)
#
# When return_word_ids is True, the list of returned tuples of
# tokenize_with_weights() contain 3 elements instead of only (token, weight). Additionally:
#   - 0 for EOS/BOS and padding
#   - group-ID for tokens (starting at 1)

Essentially, it tries to keep weight-groups together and avoids splitting them across chunks. It adds padding instead and places the tokens together into the next chunk.
This is why I added ChunkText: It splits the prompt by multiple delimeters (,.:;), returns a list of strings, and ChunkTokenize then tries to keep these parts together. This avoids splitting tags across chunks, because black pants is not the same as black BREAK pants (in A1111's terms).

In ChunkTokenize, a maximum padding length can be specified (default max_pad_length=7) and parts longer than that (8+ tokens) are split anyway. In the last chunk, the maximum padding is set to 2, which encourages splitting to fill the chunk, but still tries to exclude truncated tags (black ...).

I made 2 modules out of this procedure so the method for splitting the text can be replaced, but I'm not sure if that's worthwhile. I also thought about a priority-hierarchy for the text-parts, with low priority for sentences and high priority for single words: Sentences my be split eagerly, tags not so much, and words should never be split.
Another option to explore is to try changing the order of tags and fill the chunk with a later, smaller tag.

The number of chunks is specified by max_num_chunks (1: 75 tokens, 2: 150, 3: 225). And since adding padding instead of splitting wastes space, this should be accounted for when choosing the chunk count (opt for a larger value).
In a batch, all embeddings need to be padded to the same number of chunks. This is not required when batch size is 1 (even when using accumulation steps). I didn't know how to retrieve the batch size, so currently it's always padded. Making the chunk count dynamic would need re-caching of the embeddings when the batch size is increased from 1.

I ignored the attention mask from the tokenizer because I only saw 1 for all tokens. I'm building a new mask with 0's for padding. Padding chunks are completely masked with 0's.
This mask could be augmented with different variations for the padding, which could teach the model to ignore the padding?

A possible result for the tokens and mask looks like this (for max 3 chunks where the prompt only needs 2):

# Chunks: [       Prompt      ][        Continued       ][       Empty      ]
# Tokens: <BOS>CHUNK1<EOS><PAD><BOS>CHUNK2<EOS><PAD><PAD><BOS><EOS><PAD><PAD>
# Padding:                -----                ----------          ----------
# Mask:   1111111111111111000001111111111111111000000000000000000000000000000

Not sure if that's the correct way. ComfyUI's implementation does not use an attention mask by default and generated images are different depending on the mask. I saw celll1 do it in a yet other way.
There's a comment in Kohya's code saying "I don't know if this is correct". Is there even a correct way?

The resulting embeddings are up to 3 times larger. This means almost 1 MB per prompt per variant.

I tested it with 25 epochs on 1400 images with most captions >75 tokens and some >150. The result looks better than with truncated captions for the same dataset and it trained faster. No systematic comparison though, only subjective.

What needs to be done further? Can you help me prepare this for a pull request?

Is this generally a valid approach?
The GUI needs an option for the max chunk count.
Use the pipeline modules for all models using CLIP.
Implement it for sampling.
Should the chunk count be dynamic for batch size 1? If so, how do I retrieve the batch size from within a pipeline module?
I'd like to implement the mentioned priority-hierarchy so words are never split. This would probably introduce more overhead because in the worst case, each word needs to be tokenized separately.
Is training with attention mask slower (didn't notice)? Should the embedding be created without mask?
Should the layer norm be applied before or after combining the embeddings in EncodeClipText? (Currently it's applied before. Qwen said that's common practice.)
Is it okay to just use the first pooler output? I saw this done in other places too, but Qwen said taking the mean or concating them could also be possible.
In StableDiffusionXLBaseDataLoader, what does add_text_encoder_1_embeddings_to_prompt() exactly do? Can the result be treated as strings and is it compatible with ChunkText?

References:
#450 (Support for Token Lengths Exceeding 75 Tokens in Text Encoder, by @celll1)
https://github.com/celll1/OneTrainer/commits/dev/
https://github.com/celll1/mgds/commits/dev/

https://medium.com/@natsunoyuki/using-long-prompts-with-the-diffusers-package-with-prompt-embeddings-819657943050
huggingface/diffusers#2136 (comment)

Relevant code in Kohya Scripts:
https://github.com/kohya-ss/sd-scripts/blob/6e3c1d0b58f03522f294dc2b0acbbbecc944d018/library/train_util.py#L843
https://github.com/kohya-ss/sd-scripts/blob/main/library/train_util.py#L4836

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added support for long captions (75+ tokens) for SDXL #740

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Added support for long captions (75+ tokens) for SDXL #740

Uh oh!

FennelFetish Mar 14, 2025

Replies: 0 comments

FennelFetish
Mar 14, 2025