Question about prompt template length and dropped tokens in text encoder

Hello, Thank you for your great work.

I may be misunderstanding the implementation, but I noticed something that looks a bit inconsistent regarding the prompt template length and the tokens that are actually used.

According to the image_edit task implementation, the text encoder uses a template of the form
`"<|im_start|>system\nYou are a promt engineer. Based on the provided source image (first image) and target image (second image), create an interesting text prompt that can be used together with the source image to create the target image:<|im_end|><|im_start|>user{}<|vision_start|><|image_pad|><|vision_end|><|im_end|>"`

where the prefix before `{user_prompt}` is 50 tokens.

However, after embedding the text tokens, the implementation discards the first 55 tokens and uses only the remaining part of the sequence. As a result, it seems the first 5 tokens of the user’s prompt never reach the model.

> References:
> https://github.com/kandinskylab/kandinsky-5/blob/main/kandinsky/models/text_embedders.py#L79
> https://github.com/kandinskylab/kandinsky-5/blob/main/kandinsky/models/text_embedders.py#L145

So I wanted to check:

- Is this behavior intentional (i.e., always dropping the first 50 template tokens + 5 user tokens)?
- If intentional, what is the reasoning behind discarding the first 5 user-prompt tokens?
- Should users expect the effective prompt to start only from token 6?


Any clarification would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about prompt template length and dropped tokens in text encoder #92

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about prompt template length and dropped tokens in text encoder #92

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions