Added support for long captions (75+ tokens) for SDXL #740
FennelFetish
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I implemented chunking for long captions to overcome the 75 token limit of CLIP text encoders.
For that I added two pipelines modules in MGDS: ChunkTokenize and ChunkText,
and updated EncodeClipText and DecodeTokens.
I only tested it with SDXL so far. The modules are added to the dataloader and I also had to update the decoding of tokens for the debug prompt output. The (small) changes for OneTrainer are in this commit.
I left the debug prints in there for now but disabled them.
It doesn't work for sampling yet.
I was looking at @celll1's code, then the implementation in Kohya's SD Scripts and was intimidated by the Japanese comments. Then I thought it would probably make sense to encode the tokens in the same way as the image generation apps which we are training for, so I looked at ComfyUI's code: comfy/sd1_clip.py - tokenize_with_weights()
It does something very interesting:
Essentially, it tries to keep weight-groups together and avoids splitting them across chunks. It adds padding instead and places the tokens together into the next chunk.
This is why I added
ChunkText: It splits the prompt by multiple delimeters (,.:;), returns a list of strings, andChunkTokenizethen tries to keep these parts together. This avoids splitting tags across chunks, becauseblack pantsis not the same asblack BREAK pants(in A1111's terms).In
ChunkTokenize, a maximum padding length can be specified (defaultmax_pad_length=7) and parts longer than that (8+ tokens) are split anyway. In the last chunk, the maximum padding is set to 2, which encourages splitting to fill the chunk, but still tries to exclude truncated tags (black ...).I made 2 modules out of this procedure so the method for splitting the text can be replaced, but I'm not sure if that's worthwhile. I also thought about a priority-hierarchy for the text-parts, with low priority for sentences and high priority for single words: Sentences my be split eagerly, tags not so much, and words should never be split.
Another option to explore is to try changing the order of tags and fill the chunk with a later, smaller tag.
The number of chunks is specified by
max_num_chunks(1: 75 tokens, 2: 150, 3: 225). And since adding padding instead of splitting wastes space, this should be accounted for when choosing the chunk count (opt for a larger value).In a batch, all embeddings need to be padded to the same number of chunks. This is not required when batch size is 1 (even when using accumulation steps). I didn't know how to retrieve the batch size, so currently it's always padded. Making the chunk count dynamic would need re-caching of the embeddings when the batch size is increased from 1.
I ignored the attention mask from the tokenizer because I only saw
1for all tokens. I'm building a new mask with0's for padding. Padding chunks are completely masked with0's.This mask could be augmented with different variations for the padding, which could teach the model to ignore the padding?
A possible result for the tokens and mask looks like this (for max 3 chunks where the prompt only needs 2):
Not sure if that's the correct way. ComfyUI's implementation does not use an attention mask by default and generated images are different depending on the mask. I saw celll1 do it in a yet other way.
There's a comment in Kohya's code saying "I don't know if this is correct". Is there even a correct way?
The resulting embeddings are up to 3 times larger. This means almost 1 MB per prompt per variant.
I tested it with 25 epochs on 1400 images with most captions >75 tokens and some >150. The result looks better than with truncated captions for the same dataset and it trained faster. No systematic comparison though, only subjective.
What needs to be done further? Can you help me prepare this for a pull request?
StableDiffusionXLBaseDataLoader, what doesadd_text_encoder_1_embeddings_to_prompt()exactly do? Can the result be treated as strings and is it compatible withChunkText?References:
#450 (Support for Token Lengths Exceeding 75 Tokens in Text Encoder, by @celll1)
https://github.com/celll1/OneTrainer/commits/dev/
https://github.com/celll1/mgds/commits/dev/
https://medium.com/@natsunoyuki/using-long-prompts-with-the-diffusers-package-with-prompt-embeddings-819657943050
huggingface/diffusers#2136 (comment)
Relevant code in Kohya Scripts:
https://github.com/kohya-ss/sd-scripts/blob/6e3c1d0b58f03522f294dc2b0acbbbecc944d018/library/train_util.py#L843
https://github.com/kohya-ss/sd-scripts/blob/main/library/train_util.py#L4836
Beta Was this translation helpful? Give feedback.
All reactions