Skip to content

Commit 6f2f9be

Browse files
committed
fix comments, .md, show warning when crop prompt
1 parent 4ee05ce commit 6f2f9be

File tree

6 files changed

+183
-235
lines changed

6 files changed

+183
-235
lines changed

docs/source/en/api/pipelines/kandinsky5_image.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
1+
<!--Copyright 2025 The HuggingFace Team and Kandinsky Lab Team. All rights reserved.
22
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
33
the License. You may obtain a copy of the License at
44
http://www.apache.org/licenses/LICENSE-2.0
@@ -9,9 +9,7 @@ specific language governing permissions and limitations under the License.
99

1010
# Kandinsky 5.0 Image
1111

12-
Kandinsky 5.0 Image is created by the Kandinsky team: Nikolay Vaulin, Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
13-
14-
Kandinsky 5.0 is a family of diffusion models for Video & Image generation.
12+
[Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation.
1513

1614
Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters)
1715

@@ -29,20 +27,15 @@ The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.
2927
Kandinsky 5.0 Image Lite:
3028
| model_id | Description | Use Cases |
3129
|------------|-------------|-----------|
32-
| **kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers** | 6B image Supervised Fine-Tuned model | Highest generation quality |
33-
| **kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers** | 6B image editing Supervised Fine-Tuned model | Highest generation quality |
34-
| **kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers** | 6B image Base pretrained model | Research and fine-tuning |
35-
| **kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers** | 6B image editing Base pretrained model | Research and fine-tuning |
36-
37-
## Kandinsky5T2IPipeline
38-
39-
[[autodoc]] Kandinsky5T2IPipeline
40-
- all
41-
- __call__
30+
| [**kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers) | 6B image Supervised Fine-Tuned model | Highest generation quality |
31+
| [**kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers) | 6B image editing Supervised Fine-Tuned model | Highest generation quality |
32+
| [**kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers) | 6B image Base pretrained model | Research and fine-tuning |
33+
| [**kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers) | 6B image editing Base pretrained model | Research and fine-tuning |
4234

4335
## Usage Examples
4436

4537
### Basic Text-to-Image Generation
38+
4639
```python
4740
import torch
4841
from diffusers import Kandinsky5T2IPipeline
@@ -65,11 +58,7 @@ output = pipe(
6558
).image[0]
6659
```
6760

68-
## Kandinsky5I2IPipeline
69-
70-
[[autodoc]] Kandinsky5I2IPipeline
71-
- all
72-
- __call__
61+
### Basic Image-to-Image Generation
7362

7463
```python
7564
import torch
@@ -99,6 +88,19 @@ output = pipe(
9988
```
10089

10190

91+
## Kandinsky5T2IPipeline
92+
93+
[[autodoc]] Kandinsky5T2IPipeline
94+
- all
95+
- __call__
96+
97+
## Kandinsky5I2IPipeline
98+
99+
[[autodoc]] Kandinsky5I2IPipeline
100+
- all
101+
- __call__
102+
103+
102104
## Citation
103105
```bibtex
104106
@misc{kandinsky2025,

docs/source/en/api/pipelines/kandinsky5_video.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
1+
<!--Copyright 2025 The HuggingFace Team Kandinsky Lab Team. All rights reserved.
22
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
33
the License. You may obtain a copy of the License at
44
http://www.apache.org/licenses/LICENSE-2.0
@@ -9,9 +9,7 @@ specific language governing permissions and limitations under the License.
99

1010
# Kandinsky 5.0 Video
1111

12-
Kandinsky 5.0 Video is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
13-
14-
Kandinsky 5.0 is a family of diffusion models for Video & Image generation.
12+
[Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation.
1513

1614
Kandinsky 5.0 Lite line-up of lightweight video generation models (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem.
1715

@@ -27,7 +25,7 @@ The model introduces several key innovations:
2725
The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.com/kandinskylab/Kandinsky-5).
2826

2927
> [!TIP]
30-
> Check out the [AI Forever](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
28+
> Check out the [Kandinsky Lab](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
3129
3230
## Available Models
3331

@@ -49,11 +47,6 @@ Kandinsky 5.0 T2V Lite:
4947
| **kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers** | 5 second Base pretrained model | Research and fine-tuning |
5048
| **kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers** | 10 second Base pretrained model | Research and fine-tuning |
5149

52-
## Kandinsky5T2VPipeline
53-
54-
[[autodoc]] Kandinsky5T2VPipeline
55-
- all
56-
- __call__
5750

5851
## Usage Examples
5952

@@ -171,13 +164,8 @@ output = pipe(
171164
export_to_video(output, "output.mp4", fps=24, quality=9)
172165
```
173166

174-
## Kandinsky5I2VPipeline
175-
176-
[[autodoc]] Kandinsky5I2VPipeline
177-
- all
178-
- __call__
179167

180-
## Usage Examples
168+
### Basic Image-to-Video Generation
181169
**⚠️ Warning!** all Pro models should be infered with pipeline.enable_model_cpu_offload()
182170
```python
183171
import torch
@@ -297,6 +285,18 @@ The evaluation is based on the expanded prompts from the [Movie Gen benchmark](h
297285

298286
</table>
299287

288+
## Kandinsky5T2VPipeline
289+
290+
[[autodoc]] Kandinsky5T2VPipeline
291+
- all
292+
- __call__
293+
294+
## Kandinsky5I2VPipeline
295+
296+
[[autodoc]] Kandinsky5I2VPipeline
297+
- all
298+
- __call__
299+
300300

301301
## Citation
302302
```bibtex

src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -120,13 +120,14 @@ class Kandinsky5T2VPipeline(DiffusionPipeline, KandinskyLoraLoaderMixin):
120120
transformer ([`Kandinsky5Transformer3DModel`]):
121121
Conditional Transformer to denoise the encoded video latents.
122122
vae ([`AutoencoderKLHunyuanVideo`]):
123-
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
123+
Variational Auto-Encoder Model [hunyuanvideo-community/HunyuanVideo (vae)](https://huggingface.co/hunyuanvideo-community/HunyuanVideo) to encode and decode videos to and from latent representations.
124124
text_encoder ([`Qwen2_5_VLForConditionalGeneration`]):
125-
Frozen text-encoder (Qwen2.5-VL).
125+
Frozen text-encoder [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct).
126126
tokenizer ([`AutoProcessor`]):
127127
Tokenizer for Qwen2.5-VL.
128128
text_encoder_2 ([`CLIPTextModel`]):
129-
Frozen CLIP text encoder.
129+
Frozen [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
130+
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
130131
tokenizer_2 ([`CLIPTokenizer`]):
131132
Tokenizer for CLIP.
132133
scheduler ([`FlowMatchEulerDiscreteScheduler`]):
@@ -315,12 +316,32 @@ def _encode_prompt_qwen(
315316
dtype = dtype or self.text_encoder.dtype
316317

317318
full_texts = [self.prompt_template.format(p) for p in prompt]
319+
max_allowed_len = self.prompt_template_encode_start_idx + max_sequence_length
320+
321+
untruncated_ids = self.tokenizer(
322+
text=full_texts,
323+
images=None,
324+
videos=None,
325+
return_tensors="pt",
326+
padding="longest",
327+
)['input_ids']
328+
329+
if untruncated_ids.shape[-1] > max_allowed_len:
330+
for i,text in enumerate(full_texts):
331+
tokens = untruncated_ids[i][self.prompt_template_encode_start_idx:-2]
332+
removed_text = self.tokenizer.decode(tokens[max_sequence_length-2:])
333+
if len(removed_text) > 0:
334+
full_texts[i] = text[:-len(removed_text)]
335+
logger.warning(
336+
"The following part of your input was truncated because `max_sequence_length` is set to "
337+
f" {max_sequence_length} tokens: {removed_text}"
338+
)
318339

319340
inputs = self.tokenizer(
320341
text=full_texts,
321342
images=None,
322343
videos=None,
323-
max_length=max_sequence_length + self.prompt_template_encode_start_idx,
344+
max_length=max_allowed_len,
324345
truncation=True,
325346
return_tensors="pt",
326347
padding=True,
@@ -481,6 +502,7 @@ def check_inputs(
481502
prompt_cu_seqlens=None,
482503
negative_prompt_cu_seqlens=None,
483504
callback_on_step_end_tensor_inputs=None,
505+
max_sequence_length=None,
484506
):
485507
"""
486508
Validate input parameters for the pipeline.
@@ -501,6 +523,10 @@ def check_inputs(
501523
Raises:
502524
ValueError: If inputs are invalid
503525
"""
526+
527+
if max_sequence_length is not None and max_sequence_length > 1024:
528+
raise ValueError(f"max_sequence_length must be less than 1024")
529+
504530
if height % 16 != 0 or width % 16 != 0:
505531
raise ValueError(f"`height` and `width` have to be divisible by 16 but are {height} and {width}.")
506532

@@ -622,11 +648,6 @@ def guidance_scale(self):
622648
"""Get the current guidance scale value."""
623649
return self._guidance_scale
624650

625-
@property
626-
def do_classifier_free_guidance(self):
627-
"""Check if classifier-free guidance is enabled."""
628-
return self._guidance_scale > 1.0
629-
630651
@property
631652
def num_timesteps(self):
632653
"""Get the number of denoising timesteps."""
@@ -664,7 +685,6 @@ def __call__(
664685
] = None,
665686
callback_on_step_end_tensor_inputs: List[str] = ["latents"],
666687
max_sequence_length: int = 512,
667-
**kwargs,
668688
):
669689
r"""
670690
The call function to the pipeline for generation.
@@ -729,6 +749,7 @@ def __call__(
729749
prompt_cu_seqlens=prompt_cu_seqlens,
730750
negative_prompt_cu_seqlens=negative_prompt_cu_seqlens,
731751
callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
752+
max_sequence_length=max_sequence_length,
732753
)
733754

734755
if num_frames % self.vae_scale_factor_temporal != 1:
@@ -762,7 +783,7 @@ def __call__(
762783
dtype=dtype,
763784
)
764785

765-
if self.do_classifier_free_guidance:
786+
if self.guidance_scale > 1.:
766787
if negative_prompt is None:
767788
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
768789

@@ -847,7 +868,7 @@ def __call__(
847868
return_dict=True,
848869
).sample
849870

850-
if self.do_classifier_free_guidance and negative_prompt_embeds_qwen is not None:
871+
if self.guidance_scale > 1. and negative_prompt_embeds_qwen is not None:
851872
uncond_pred_velocity = self.transformer(
852873
hidden_states=latents.to(dtype),
853874
encoder_hidden_states=negative_prompt_embeds_qwen.to(dtype),

0 commit comments

Comments
 (0)