-
Notifications
You must be signed in to change notification settings - Fork 57
Question about prompt template length and dropped tokens in text encoder #92
Description
Hello, Thank you for your great work.
I may be misunderstanding the implementation, but I noticed something that looks a bit inconsistent regarding the prompt template length and the tokens that are actually used.
According to the image_edit task implementation, the text encoder uses a template of the form
"<|im_start|>system\nYou are a promt engineer. Based on the provided source image (first image) and target image (second image), create an interesting text prompt that can be used together with the source image to create the target image:<|im_end|><|im_start|>user{}<|vision_start|><|image_pad|><|vision_end|><|im_end|>"
where the prefix before {user_prompt} is 50 tokens.
However, after embedding the text tokens, the implementation discards the first 55 tokens and uses only the remaining part of the sequence. As a result, it seems the first 5 tokens of the user’s prompt never reach the model.
References:
https://github.com/kandinskylab/kandinsky-5/blob/main/kandinsky/models/text_embedders.py#L79
https://github.com/kandinskylab/kandinsky-5/blob/main/kandinsky/models/text_embedders.py#L145
So I wanted to check:
- Is this behavior intentional (i.e., always dropping the first 50 template tokens + 5 user tokens)?
- If intentional, what is the reasoning behind discarding the first 5 user-prompt tokens?
- Should users expect the effective prompt to start only from token 6?
Any clarification would be appreciated.