Skip to content

Question about prompt template length and dropped tokens in text encoder #92

@JEONG8652

Description

@JEONG8652

Hello, Thank you for your great work.

I may be misunderstanding the implementation, but I noticed something that looks a bit inconsistent regarding the prompt template length and the tokens that are actually used.

According to the image_edit task implementation, the text encoder uses a template of the form
"<|im_start|>system\nYou are a promt engineer. Based on the provided source image (first image) and target image (second image), create an interesting text prompt that can be used together with the source image to create the target image:<|im_end|><|im_start|>user{}<|vision_start|><|image_pad|><|vision_end|><|im_end|>"

where the prefix before {user_prompt} is 50 tokens.

However, after embedding the text tokens, the implementation discards the first 55 tokens and uses only the remaining part of the sequence. As a result, it seems the first 5 tokens of the user’s prompt never reach the model.

References:
https://github.com/kandinskylab/kandinsky-5/blob/main/kandinsky/models/text_embedders.py#L79
https://github.com/kandinskylab/kandinsky-5/blob/main/kandinsky/models/text_embedders.py#L145

So I wanted to check:

  • Is this behavior intentional (i.e., always dropping the first 50 template tokens + 5 user tokens)?
  • If intentional, what is the reasoning behind discarding the first 5 user-prompt tokens?
  • Should users expect the effective prompt to start only from token 6?

Any clarification would be appreciated.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions