huggingface
diff --git a/‎docs/source/en/api/pipelines/kandinsky5_image.md‎
Lines changed: 21 additions & 19 deletions b/‎docs/source/en/api/pipelines/kandinsky5_image.md‎
Lines changed: 21 additions & 19 deletions
diff --git a/‎docs/source/en/api/pipelines/kandinsky5_video.md‎
Lines changed: 16 additions & 16 deletions b/‎docs/source/en/api/pipelines/kandinsky5_video.md‎
Lines changed: 16 additions & 16 deletions
diff --git a/‎src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py‎
Lines changed: 33 additions & 12 deletions b/‎src/diffusers/pipelines/kandinsky5/pipeline_kandinsky.py‎
Lines changed: 33 additions & 12 deletions
@@ -1,4 +1,4 @@
-<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team and Kandinsky Lab Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
@@ -9,9 +9,7 @@ specific language governing permissions and limitations under the License.
 
 # Kandinsky 5.0 Image
 
-Kandinsky 5.0 Image is created by the Kandinsky team: Nikolay Vaulin, Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko,  Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
-
-Kandinsky 5.0 is a family of diffusion models for Video & Image generation. 
+[Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation. 
 
 Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters) 
 
@@ -29,20 +27,15 @@ The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.
 Kandinsky 5.0 Image Lite:
 | model_id | Description | Use Cases |
 |------------|-------------|-----------|
-| **kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers** | 6B image Supervised Fine-Tuned model | Highest generation quality |
-| **kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers** | 6B image editing Supervised Fine-Tuned model | Highest generation quality |
-| **kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers** | 6B image Base pretrained model | Research and fine-tuning |
-| **kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers** | 6B image editing Base pretrained model | Research and fine-tuning |
-
-## Kandinsky5T2IPipeline
-
-[[autodoc]] Kandinsky5T2IPipeline
-    - all
-    - __call__
+| [**kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers) | 6B image Supervised Fine-Tuned model | Highest generation quality |
+| [**kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers) | 6B image editing Supervised Fine-Tuned model | Highest generation quality |
+| [**kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers) | 6B image Base pretrained model | Research and fine-tuning |
+| [**kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers) | 6B image editing Base pretrained model | Research and fine-tuning |
 
 ## Usage Examples
 
 ### Basic Text-to-Image Generation
+
 ```python
 import torch
 from diffusers import Kandinsky5T2IPipeline
@@ -65,11 +58,7 @@ output = pipe(
 ).image[0]
 ```
 
-## Kandinsky5I2IPipeline
-
-[[autodoc]] Kandinsky5I2IPipeline
-    - all
-    - __call__
+### Basic Image-to-Image Generation
 
 ```python
 import torch
@@ -99,6 +88,19 @@ output = pipe(
 ```
 
 
+## Kandinsky5T2IPipeline
+
+[[autodoc]] Kandinsky5T2IPipeline
+    - all
+    - __call__
+
+## Kandinsky5I2IPipeline
+
+[[autodoc]] Kandinsky5I2IPipeline
+    - all
+    - __call__
+
+
 ## Citation
 ```bibtex
 @misc{kandinsky2025,
 
@@ -1,4 +1,4 @@
-<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+<!--Copyright 2025 The HuggingFace Team Kandinsky Lab Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
@@ -9,9 +9,7 @@ specific language governing permissions and limitations under the License.
 
 # Kandinsky 5.0 Video
 
-Kandinsky 5.0 Video is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko,  Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
-
-Kandinsky 5.0 is a family of diffusion models for Video & Image generation. 
+[Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation.
 
 Kandinsky 5.0 Lite line-up of lightweight video generation models (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem.
 
@@ -27,7 +25,7 @@ The model introduces several key innovations:
 The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.com/kandinskylab/Kandinsky-5).
 
 > [!TIP]
-> Check out the [AI Forever](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
+> Check out the [Kandinsky Lab](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
 
 ## Available Models
 
@@ -49,11 +47,6 @@ Kandinsky 5.0 T2V Lite:
 | **kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers** | 5 second Base pretrained model | Research and fine-tuning |
 | **kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers** | 10 second Base pretrained model | Research and fine-tuning |
 
-## Kandinsky5T2VPipeline
-
-[[autodoc]] Kandinsky5T2VPipeline
-    - all
-    - __call__
 
 ## Usage Examples
 
@@ -171,13 +164,8 @@ output = pipe(
 export_to_video(output, "output.mp4", fps=24, quality=9)
 ```
 
-## Kandinsky5I2VPipeline
-
-[[autodoc]] Kandinsky5I2VPipeline
-    - all
-    - __call__
 
-## Usage Examples
+### Basic Image-to-Video Generation
 **⚠️ Warning!** all Pro models should be infered with pipeline.enable_model_cpu_offload()  
 ```python
 import torch
@@ -297,6 +285,18 @@ The evaluation is based on the expanded prompts from the [Movie Gen benchmark](h
 
 </table>
 
+## Kandinsky5T2VPipeline
+
+[[autodoc]] Kandinsky5T2VPipeline
+    - all
+    - __call__
+
+## Kandinsky5I2VPipeline
+
+[[autodoc]] Kandinsky5I2VPipeline
+    - all
+    - __call__
+
 
 ## Citation
 ```bibtex
 
@@ -120,13 +120,14 @@ class Kandinsky5T2VPipeline(DiffusionPipeline, KandinskyLoraLoaderMixin):
         transformer ([`Kandinsky5Transformer3DModel`]):
             Conditional Transformer to denoise the encoded video latents.
         vae ([`AutoencoderKLHunyuanVideo`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+            Variational Auto-Encoder Model [hunyuanvideo-community/HunyuanVideo (vae)](https://huggingface.co/hunyuanvideo-community/HunyuanVideo) to encode and decode videos to and from latent representations.
         text_encoder ([`Qwen2_5_VLForConditionalGeneration`]):
-            Frozen text-encoder (Qwen2.5-VL).
+            Frozen text-encoder [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct).
         tokenizer ([`AutoProcessor`]):
             Tokenizer for Qwen2.5-VL.
         text_encoder_2 ([`CLIPTextModel`]):
-            Frozen CLIP text encoder.
+            Frozen [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
+            the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
         tokenizer_2 ([`CLIPTokenizer`]):
             Tokenizer for CLIP.
         scheduler ([`FlowMatchEulerDiscreteScheduler`]):
@@ -315,12 +316,32 @@ def _encode_prompt_qwen(
         dtype = dtype or self.text_encoder.dtype
 
         full_texts = [self.prompt_template.format(p) for p in prompt]
+        max_allowed_len = self.prompt_template_encode_start_idx + max_sequence_length
+
+        untruncated_ids = self.tokenizer(
+            text=full_texts,
+            images=None,
+            videos=None,
+            return_tensors="pt",
+            padding="longest",
+        )['input_ids']
+
+        if untruncated_ids.shape[-1] > max_allowed_len:
+            for i,text in enumerate(full_texts):
+                tokens = untruncated_ids[i][self.prompt_template_encode_start_idx:-2]
+                removed_text = self.tokenizer.decode(tokens[max_sequence_length-2:])
+                if len(removed_text) > 0:
+                    full_texts[i] = text[:-len(removed_text)]
+                    logger.warning(
+                        "The following part of your input was truncated because `max_sequence_length` is set to "
+                        f" {max_sequence_length} tokens: {removed_text}"
+                    )
 
         inputs = self.tokenizer(
             text=full_texts,
             images=None,
             videos=None,
-            max_length=max_sequence_length + self.prompt_template_encode_start_idx,
+            max_length=max_allowed_len,
             truncation=True,
             return_tensors="pt",
             padding=True,
@@ -481,6 +502,7 @@ def check_inputs(
         prompt_cu_seqlens=None,
         negative_prompt_cu_seqlens=None,
         callback_on_step_end_tensor_inputs=None,
+        max_sequence_length=None,
     ):
         """
         Validate input parameters for the pipeline.
@@ -501,6 +523,10 @@ def check_inputs(
         Raises:
             ValueError: If inputs are invalid
         """
+
+        if max_sequence_length is not None and max_sequence_length > 1024:
+            raise ValueError(f"max_sequence_length must be less than 1024")
+
         if height % 16 != 0 or width % 16 != 0:
             raise ValueError(f"`height` and `width` have to be divisible by 16 but are {height} and {width}.")
 
@@ -622,11 +648,6 @@ def guidance_scale(self):
         """Get the current guidance scale value."""
         return self._guidance_scale
 
-    @property
-    def do_classifier_free_guidance(self):
-        """Check if classifier-free guidance is enabled."""
-        return self._guidance_scale > 1.0
-
     @property
     def num_timesteps(self):
         """Get the number of denoising timesteps."""
@@ -664,7 +685,6 @@ def __call__(
         ] = None,
         callback_on_step_end_tensor_inputs: List[str] = ["latents"],
         max_sequence_length: int = 512,
-        **kwargs,
     ):
         r"""
         The call function to the pipeline for generation.
@@ -729,6 +749,7 @@ def __call__(
             prompt_cu_seqlens=prompt_cu_seqlens,
             negative_prompt_cu_seqlens=negative_prompt_cu_seqlens,
             callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
+            max_sequence_length=max_sequence_length,
         )
 
         if num_frames % self.vae_scale_factor_temporal != 1:
@@ -762,7 +783,7 @@ def __call__(
                 dtype=dtype,
             )
 
-        if self.do_classifier_free_guidance:
+        if self.guidance_scale > 1.:
             if negative_prompt is None:
                 negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
 
@@ -847,7 +868,7 @@ def __call__(
                     return_dict=True,
                 ).sample
 
-                if self.do_classifier_free_guidance and negative_prompt_embeds_qwen is not None:
+                if self.guidance_scale > 1. and negative_prompt_embeds_qwen is not None:
                     uncond_pred_velocity = self.transformer(
                         hidden_states=latents.to(dtype),
                         encoder_hidden_states=negative_prompt_embeds_qwen.to(dtype),