Skip to content

[Model] support LTX-2 text-to-video image-to-video#841

Open
david6666666 wants to merge 35 commits intovllm-project:mainfrom
david6666666:ltx2
Open

[Model] support LTX-2 text-to-video image-to-video#841
david6666666 wants to merge 35 commits intovllm-project:mainfrom
david6666666:ltx2

Conversation

@david6666666
Copy link
Collaborator

@david6666666 david6666666 commented Jan 19, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

support LTX-2 text-to-video image-to-video, refer to huggingface/diffusers#12915

Test Plan

t2v:

python text_to_video.py \
  --model "/workspace/models/Lightricks/LTX-2" \
  --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
  --negative-prompt "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
  --height 512 --width 768 --num_frames 121 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --frame-rate 24 --fps 24 \
  --seed 0 \
  --enable-cpu-offload \
  --output ltx2_t2v_diff.mp4

diffusers:

import torch
from diffusers.pipelines.ltx2 import LTX2Pipeline
from diffusers.pipelines.ltx2.export_utils import encode_video

generator = torch.Generator("cuda").manual_seed(0)
pipe = LTX2Pipeline.from_pretrained("/workspace/models/Lightricks/LTX-2", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A cinematic close-up of ocean waves at golden hour."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

frame_rate = 24.0
video, audio = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    guidance_scale=4.0,
    output_type="np",
    generator=generator,
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_sample.mp4",
)

i2v:

python examples/offline_inference/image_to_video/image_to_video.py \
  --model "/workspace/models/Lightricks/LTX-2" \
  --model_class_name "LTX2ImageToVideoPipeline" \
  --image astronaut.jpg \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --negative-prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --height 512 --width 768 --num-frames 121 \
  --num-inference-steps 40 --guidance-scale 4.0 \
  --frame-rate 24 \
  --seed 0 \
  --output ltx2_i2v_diff.mp4

diffusers:

import torch
from diffusers.pipelines.ltx2 import LTX2ImageToVideoPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import load_image

generator = torch.Generator("cuda").manual_seed(0)
pipe = LTX2ImageToVideoPipeline.from_pretrained("/workspace/models/Lightricks/LTX-2", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

image = load_image(
    "./astronaut.jpg"
)
prompt = "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot."
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

frame_rate = 24.0
video, audio = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    guidance_scale=4.0,
    output_type="np",
    generator=generator,
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_i2v.mp4",
)

online serving:

vllm serve "/workspace/models/Lightricks/LTX-2"   --omni   --port 8093   --model-class-name LTX2ImageToVideoPipeline
curl -X POST http://localhost:8093/v1/videos   -H "Accept: application/json"   -F 'prompt=An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera'\''s movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot.'   -F 'negative_prompt=shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static.'   -F 'input_reference=@astronaut.jpg'   -F 'width=384'   -F 'height=256'   -F 'num_frames=121'   -F 'fps=24'   -F 'num_inference_steps=40'   -F 'guidance_scale=4.0'   -F 'seed=0' | jq -r '.data[0].b64_json' | base64 -d > ltx2_i2v_diff1.mp4

Test Result

t2v:

ltx2_t2v_diff.mp4

i2v:

ltx2_i2v_diff.mp4

A100-80G height=256 width=384
cache-dit:
39s -> 26s
ulysses_degree 2:
39s -> 38s
ring_degree 2:
39s -> 38s
cfg 2:
39s -> 29s
tp 2:
39s -> 38s


Checklist

LTX-2

  • Accuracy alignment achieved
  • Support TI2V inference
  • Ensure structural and code-style consistency across modules
  • Support audio joint
  • Support SP
  • Support TP
  • Support CFG parallel
  • validate Cache-DiT
  • validate cpu-offloading
  • Clean up code
  • Fix comment
  • support online serving

LTX-2 follow prs:

  • two stages
  • text encoder Gemma 3: support TP (tensor parallelism)
  • perf optimization
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@david6666666 david6666666 force-pushed the ltx2 branch 6 times, most recently from cb1a09e to 3f3a885 Compare January 21, 2026 09:17
@david6666666 david6666666 added this to the v0.14.0 milestone Jan 26, 2026
@david6666666 david6666666 force-pushed the ltx2 branch 4 times, most recently from 5c4a679 to 72bb6c8 Compare January 27, 2026 08:59
@david6666666 david6666666 marked this pull request as ready for review January 27, 2026 09:33
@david6666666
Copy link
Collaborator Author

@ZJY0516 @SamitHuang @wtomin ptal, thx

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 346be1b2ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +544 to +552
sp_size = getattr(self.od_config.parallel_config, "sequence_parallel_size", 1)
if sp_size > 1 and latent_length < sp_size:
pad_len = sp_size - latent_length
if latents is not None:
pad_shape = list(latents.shape)
pad_shape[2] = pad_len
padding = torch.zeros(pad_shape, dtype=latents.dtype, device=latents.device)
latents = torch.cat([latents, padding], dim=2)
latent_length = sp_size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pad audio latents for sequence-parallel sharding

When sequence_parallel_size > 1, the LTX2 transformer shards audio_hidden_states with SequenceParallelInput (auto-pad is off), so the sequence length must be evenly divisible across ranks. Here prepare_audio_latents only pads when latent_length < sp_size, but it does nothing when latent_length is larger yet not divisible (e.g., default 121 frames @ 24fps → latent_length≈126, sp_size=4). That yields uneven shards and will fail during all‑gather or produce mismatched audio in SP runs. Consider padding latent_length up to the next multiple of sp_size (or enabling auto‑pad in the SP plan) instead of only handling the < sp_size case.

Useful? React with 👍 / 👎.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive support for the LTX-2 (Lightricks) text-to-video and image-to-video models with integrated audio generation capabilities, aligning with the diffusers library implementation (PR #12915).

Changes:

  • Implements LTX2 text-to-video and image-to-video pipelines with joint audio generation
  • Adds LTX2VideoTransformer3DModel with audio-video cross-attention blocks
  • Integrates cache-dit support for LTX2 transformer blocks
  • Extends example scripts to handle audio output alongside video frames

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py Core LTX2 text-to-video pipeline with audio generation support
vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py LTX2 image-to-video pipeline with conditioning mask and audio
vllm_omni/diffusion/models/ltx2/ltx2_transformer.py Audio-visual transformer with a2v/v2a cross-attention blocks and RoPE
vllm_omni/diffusion/models/ltx2/init.py Module exports for LTX2 components
vllm_omni/diffusion/registry.py Registers LTX2 pipeline classes and post-processing functions
vllm_omni/diffusion/request.py Adds audio_latents, frame_rate, output_type, and decode parameters
vllm_omni/diffusion/diffusion_engine.py Extends engine to extract and route audio payloads from dict outputs
vllm_omni/entrypoints/omni_diffusion.py Allows model_class_name override for custom pipeline selection
vllm_omni/entrypoints/async_omni_diffusion.py Allows model_class_name override in async entrypoint
vllm_omni/diffusion/cache/cache_dit_backend.py Adds cache-dit support for LTX2 transformer blocks
examples/offline_inference/text_to_video/text_to_video.py Enhanced to handle LTX2 audio+video output and encode_video export
examples/offline_inference/text_to_video/text_to_video.md Documents LTX2 usage example with frame_rate and audio_sample_rate
examples/offline_inference/image_to_video/image_to_video.py Enhanced for LTX2 I2V with audio output and model class override
Comments suppressed due to low confidence (1)

examples/offline_inference/text_to_video/text_to_video.py:100

  • This assignment to 'parallel_config' is unnecessary as it is redefined before this value is used.
    parallel_config = DiffusionParallelConfig(

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +39 to +40
class LTX2ImageToVideoPipeline(LTX2Pipeline):
support_image_input = True
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LTX2ImageToVideoPipeline should also inherit from SupportAudioOutput and declare support_audio_output = True. Although it inherits support_image_input from the pattern in the codebase, it also produces audio output like its parent LTX2Pipeline.

Both class variables should be declared:

  • support_image_input = True (already present)
  • support_audio_output = True (missing)

And the class should inherit from both protocols:
class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput):

Note: Once LTX2Pipeline properly inherits from SupportAudioOutput, this class will inherit it automatically, but it's clearer to be explicit about all supported interfaces.

Suggested change
class LTX2ImageToVideoPipeline(LTX2Pipeline):
support_image_input = True
class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput):
support_image_input = True
support_audio_output = True

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +110
# Configure parallel settings (only SP is supported for Wan)
# Note: cfg_parallel and tensor_parallel are not implemented for Wan models
parallel_config = DiffusionParallelConfig(
ulysses_degree=args.ulysses_degree,
ring_degree=args.ring_degree,
)
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallel_config is defined twice with identical content (lines 100-103 and lines 107-110). This is redundant code duplication. Remove one of these duplicate blocks.

The comment also mentions "only SP is supported for Wan" which may not be accurate for all models in this script (e.g., LTX2).

Copilot uses AI. Check for mistakes.
num_inference_steps=args.num_inference_steps,
num_frames=args.num_frames,
frame_rate=frame_rate,
enable_cpu_offload=True,
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable_cpu_offload parameter is hardcoded to True in the generate call, but it should respect the command-line argument args.enable_cpu_offload. This overrides the user's choice and always enables CPU offloading.

Change to: enable_cpu_offload=args.enable_cpu_offload,

Suggested change
enable_cpu_offload=True,
enable_cpu_offload=args.enable_cpu_offload,

Copilot uses AI. Check for mistakes.
return mu


class LTX2Pipeline(nn.Module):
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LTX2Pipeline class should inherit from SupportAudioOutput and declare support_audio_output = True as a class variable. This is necessary for the diffusion engine to properly identify that this pipeline produces audio output and handle it correctly.

The pattern is established in other audio-producing pipelines like StableAudioPipeline (see vllm_omni/diffusion/models/stable_audio/pipeline_stable_audio.py:61). Without this, the supports_audio_output() check in diffusion_engine.py:32-36 will return False, causing audio output to be incorrectly handled.

Add the import: from vllm_omni.diffusion.models.interface import SupportAudioOutput
And update the class declaration to: class LTX2Pipeline(nn.Module, SupportAudioOutput):
Then add: support_audio_output = True as a class variable.

Copilot uses AI. Check for mistakes.
Comment on lines +375 to +379
width,
prompt_embeds=None,
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'image'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'latents'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.

Suggested change
width,
prompt_embeds=None,
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
width,
image=None,
latents=None,
prompt_embeds=None,
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
**kwargs,

Copilot uses AI. Check for mistakes.
dtype: torch.dtype | None = None,
device: torch.device | None = None,
generator: torch.Generator | None = None,
latents: torch.Tensor | None = None,
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.prepare_latents matches the call.

Suggested change
latents: torch.Tensor | None = None,
latents: torch.Tensor | None = None,
*args: Any,
**kwargs: Any,

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +140
def check_inputs(
self,
image,
height,
width,
prompt,
latents=None,
prompt_embeds=None,
negative_prompt_embeds=None,
prompt_attention_mask=None,
negative_prompt_attention_mask=None,
):
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method requires at least 5 positional arguments, whereas overridden LTX2Pipeline.check_inputs may be called with 4. This call correctly calls the base method, but does not match the signature of the overriding method.

Copilot uses AI. Check for mistakes.
Comment on lines +232 to +233
except Exception:
pass
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except Exception as exc: # noqa: BLE001
# If ring-parallel utilities are unavailable or misconfigured,
# fall back to using the unsharded attention_mask.
logger.debug(
"Failed to shard attention mask for sequence parallelism; "
"continuing without sharding: %s",
exc,
)

Copilot uses AI. Check for mistakes.
@@ -2,11 +2,12 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this model name in docs/models/supported_models.md, and if acceleration methods are applicable, update this model's name in docs/user_guide/diffusion/diffusion_acceleration.md and docs/user_guide/diffusion/parallelism_acceleration.md .

@david6666666 david6666666 removed this from the v0.14.0 milestone Jan 28, 2026
@david6666666 david6666666 linked an issue Jan 29, 2026 that may be closed by this pull request
1 task
@david6666666 david6666666 force-pushed the ltx2 branch 6 times, most recently from c2dc5df to 84e0305 Compare February 2, 2026 09:17
@hsliuustc0106 hsliuustc0106 requested a review from Copilot February 2, 2026 11:42
- `--vae_use_slicing`: Enable VAE slicing for memory optimization.
- `--vae_use_tiling`: Enable VAE tiling for memory optimization.
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
- `--tensor_parallel_size`: tensor parallel size (effective for models that support TP, e.g. LTX2).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about other inference examples

@david6666666
Copy link
Collaborator Author

david6666666 commented Feb 12, 2026

I will rebase code and support online video serving after holiday Feb 24

@lishunyang12
Copy link
Contributor

@david6666666 Hey, the LTX-2 T2V/I2V support looks solid with the transformer, pipeline, and scheduler all ported from the diffusers PR. Are you still testing this? Any issues with the video generation quality or the 17-file rebase against current main?

david6666666 and others added 4 commits February 27, 2026 15:31
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
@david6666666 david6666666 removed the ready label to trigger buildkite CI label Feb 27, 2026
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. The core model port looks thorough. Main concerns are around duplicated code and a variable shadowing issue in the engine.

)


def _unwrap_request_tensor(value: Any) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_unwrap_request_tensor and _get_prompt_field are duplicated verbatim from pipeline_ltx2.py. Since this file already imports from .pipeline_ltx2, just import these too instead of redefining them.

output_idx = end_idx

if supports_audio_output(self.od_config.model_class_name):
audio_payload = request_outputs[0] if len(request_outputs) == 1 else request_outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audio_payload is set at function scope from the dict output, then re-assigned inside the loop when supports_audio_output() is true. This shadowing is fragile -- use a different variable name for the per-request audio (e.g. request_audio_payload) to keep the two sources distinct.

sample (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
The hidden states output conditioned on the `encoder_hidden_states` input, representing the visual output
of the model. This is typically a video (spatiotemporal) output.
audio_sample (`torch.Tensor` of shape `(batch_size, TODO)`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: docstring says audio_sample shape but the field name and usage suggest this describes the output format. Verify the shape description matches the actual output dimensions.

# LTX2 blocks return (hidden_states, audio_hidden_states)
forward_pattern=ForwardPattern.Pattern_0,
# Treat audio_hidden_states as encoder_hidden_states in Pattern_0
check_forward_pattern=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_forward_pattern=False with ForwardPattern.Pattern_0 -- does Pattern_0 handle the dual-tensor return (hidden_states, audio_hidden_states) from LTX2 blocks correctly, or does cache-dit only cache the first element? Worth a comment explaining what happens to the audio branch during cached steps.

Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
@david6666666 david6666666 force-pushed the ltx2 branch 2 times, most recently from 592b426 to 3ec6b0f Compare February 28, 2026 04:46
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #841 Review: [Model] Support LTX-2 text-to-video image-to-video

Overview

This PR adds support for LTX-2, a text-to-video and image-to-video model from Lightricks that generates both video and audio. It includes SP, TP, CFG parallel support, and Cache-DiT optimization.

Features Supported ✅

Feature Status
Text-to-Video (T2V)
Image-to-Video (I2V)
Audio joint generation
Sequence Parallel (SP)
Tensor Parallel (TP)
CFG Parallel
Cache-DiT

Performance Results ✅

A100-80G (height=256, width=384):

Config Time Improvement
Base 39s -
Cache-DiT 26s 33% faster
CFG 2 29s 26% faster

MRO Pattern Check ✅

No MRO issues detected. Classes follow proper inheritance order (nn.Module first).


Important Issues: 2 found

1. No Unit Tests for LTX2 Model

With 4480 lines of new code, having tests for the core transformer and pipeline functionality would be valuable.

2. Hardcoded Audio Sample Rate in Serving

audio_sample_rate = 24000 is hardcoded. Should come from vocoder config.


Suggestions

  1. Address TODO comment at ltx2_transformer.py:1198
  2. Consider consolidating fps and frame_rate fields in inputs/data.py

Strengths

  • ✅ Comprehensive T2V/I2V implementation with audio support
  • ✅ Good performance optimizations (33% faster with Cache-DiT)
  • ✅ Proper parallelism support (SP, TP, CFG)
  • ✅ Clean architecture with clear class hierarchy
  • ✅ Documentation updated

Recommendation

Add basic unit tests and fix hardcoded audio sample rate, then ready for merge.


result = await self._run_generation(prompt, gen_params, request_id, raw_request)
videos = self._extract_video_outputs(result)
audios = self._extract_audio_outputs(result, expected_count=len(videos))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded audio sample rate

Consider getting the sample rate from the vocoder config instead of hardcoding:

# Instead of:
audio_sample_rate = 24000

# Use:
audio_sample_rate = self.engine.model.vocoder.config.output_sampling_rate

This ensures consistency if the model uses a different sample rate.

freqs = freqs.transpose(-1, -2).flatten(2) # [B, num_patches, self.dim // 2]

# 5. Get real, interleaved (cos, sin) frequencies, padded to self.dim
# TODO: consider implementing this as a utility and reuse in `connectors.py`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment

Consider addressing this before merge or creating a tracking issue for the utility refactoring.

height: int | None = None
width: int | None = None
fps: int | None = None
frame_rate: float | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate fields: fps vs frame_rate

Both fps: int | None and frame_rate: float | None exist. Consider consolidating these to avoid confusion, or document why both are needed (e.g., fps for video encoding, frame_rate for model inference).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Lightricks/LTX-2

6 participants