You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
3
+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+
the License. You may obtain a copy of the License at
5
+
6
+
http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+
specific language governing permissions and limitations under the License. -->
11
+
12
+
# AutoencoderKLLTX2Audio
13
+
14
+
The 3D variational autoencoder (VAE) model with KL loss used in [LTX-2](https://huggingface.co/Lightricks/LTX-2) was introduced by Lightricks. This is for encoding and decoding audio latent representations.
15
+
16
+
The model can be loaded with the following code snippet.
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
#
3
+
# Licensed under the Apache License, Version 2.0 (the "License");
4
+
# you may not use this file except in compliance with the License.
5
+
# You may obtain a copy of the License at
6
+
#
7
+
# http://www.apache.org/licenses/LICENSE-2.0
8
+
#
9
+
# Unless required by applicable law or agreed to in writing, software
10
+
# distributed under the License is distributed on an "AS IS" BASIS,
11
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+
# See the License for the specific language governing permissions and
13
+
# limitations under the License. -->
14
+
15
+
# LTX-2
16
+
17
+
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
18
+
19
+
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
20
+
21
+
The original codebase for LTX-2 can be found [here](https://github.com/Lightricks/LTX-2).
- The recommended dtype for the transformer, VAE, and text encoder is `torch.bfloat16`. The VAE and text encoder can also be `torch.float32` or `torch.float16`.
137
137
- For guidance-distilled variants of LTX-Video, set `guidance_scale` to `1.0`. The `guidance_scale` for any other model should be set higher, like `5.0`, for good generation quality.
138
138
- For timestep-aware VAE variants (LTX-Video 0.9.1 and above), set `decode_timestep` to `0.05` and `image_cond_noise_scale` to `0.025`.
139
-
- For variants that support interpolation between multiple conditioning images and videos (LTX-Video 0.9.5 and above), use similar images and videos for the best results. Divergence from the conditioning inputs may lead to abrupt transitionts in the generated video.
139
+
- For variants that support interpolation between multiple conditioning images and videos (LTX-Video 0.9.5 and above), use similar images and videos for the best results. Divergence from the conditioning inputs may lead to abrupt transitions in the generated video.
140
140
141
141
- LTX-Video 0.9.7 includes a spatial latent upscaler and a 13B parameter transformer. During inference, a low resolution video is quickly generated first and then upscaled and refined.
Copy file name to clipboardExpand all lines: docs/source/en/api/pipelines/wan.md
-3Lines changed: 0 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -250,9 +250,6 @@ The code snippets available in [this](https://github.com/huggingface/diffusers/p
250
250
251
251
The general rule of thumb to keep in mind when preparing inputs for the VACE pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color.
252
252
253
-
</hfoption>
254
-
</hfoptions>
255
-
256
253
### Wan-Animate: Unified Character Animation and Replacement with Holistic Replication
257
254
258
255
[Wan-Animate](https://huggingface.co/papers/2509.14055) by the Wan Team.
0 commit comments