Comprehensive documentation for temporal generative models that handle audio, speech, music, and video synthesis. These models extend diffusion and autoregressive techniques to the temporal domain, enabling high-quality generation of time-series data with complex dependencies.
-
- Expert transformer for text-to-video generation
- 3D causal attention (spatial + temporal)
- Progressive training strategy
- Accepted at ICLR 2025
-
- Large language model approach to video
- Unified tokenization for video, audio, text
- Multi-task pre-training
- Zero-shot capabilities
-
- Neural codec language modeling for TTS
- Zero-shot voice cloning from 3-second prompt
- Autoregressive + non-autoregressive architecture
- EnCodec discrete tokens
-
- Non-autoregressive speech generation
- Flow matching for speech synthesis
- In-context learning for voice styles
- Fast, high-quality generation
-
- Parallel audio generation
- Confidence-based iterative decoding
- Maskgit-style generation
- 2-second audio in 0.5 seconds
-
- Text-to-music generation
- Controllable music synthesis
- Multi-stream transformer
- Melody conditioning
-
- Factorized diffusion for speech
- Disentangled prosody and content
- Neural codec integration
- State-of-the-art quality
Video-specific:
- Spatiotemporal coherence: Maintaining consistency across frames
- Long-range dependencies: Modeling motion over time
- Memory constraints: High-dimensional data (B, T, C, H, W)
- Compression: VAE for spatial, tokenization for temporal
Audio-specific:
- High sample rates: 16-48 kHz raw audio
- Long sequences: Seconds of audio = thousands of timesteps
- Discrete representations: Neural codecs (EnCodec, SoundStream)
- Multi-scale structure: Prosody, phonemes, acoustics
| Approach | Examples | Pros | Cons |
|---|---|---|---|
| Autoregressive | VALL-E, VideoPoet | High quality, flexible | Slow sampling |
| Non-autoregressive | Voicebox, SoundStorm | Fast sampling | Training complexity |
| Diffusion | NaturalSpeech 3, CogVideoX | SOTA quality | Slow, many steps |
| Hybrid | VALL-E (AR+NAR) | Balanced | Architecture complexity |
Purpose: Compress raw audio to discrete tokens
Examples:
- EnCodec: 8 codebooks, 75 tokens/sec at 24kHz
- SoundStream: Similar to EnCodec, used in AudioLM
- DAC: Descript Audio Codec
Benefits:
- 100-300x compression
- Discrete tokens enable language modeling
- Residual vector quantization (RVQ) for quality
Spatial Compression:
Image (3, 256, 256) -> VAE -> Latent (4, 32, 32)
Temporal Compression:
Video (T, 4, 32, 32) -> Transformer -> Tokens (T', D)
Full Pipeline:
Video -> 3D VAE -> Latent Video -> Diffusion Transformer -> Generated Latent -> 3D VAE Decoder -> Video
For Video (CogVideoX):
- Stage 1: Image generation (T=1)
- Stage 2: Short videos (T=16)
- Stage 3: Long videos (T=64)
Benefits:
- Faster convergence
- Better motion modeling
- Reduced memory requirements
For VideoPoet:
- Video generation
- Video inpainting
- Video outpainting
- Video-to-audio
- Stylization
Benefits:
- Shared representations
- Zero-shot transfer
- Better generalization
For VALL-E:
- AR Stage: Generate first codebook level
- NAR Stage: Generate remaining codebook levels
Benefits:
- Balance quality (AR) and speed (NAR)
- Coarse-to-fine generation
- Scalability
T5/CLIP Embeddings:
text_emb = t5_encoder(text_tokens) # (B, L, D)
cross_attention(video_tokens, text_emb)Classifier-Free Guidance:
output = uncond_output + scale * (cond_output - uncond_output)Acoustic Prompting:
# Encode reference audio
prompt_tokens = codec.encode(prompt_audio)
# Generate conditioned on prompt
generated = model(text, prompt_tokens)Chroma Features:
melody_features = extract_chroma(melody_audio)
music = musicgen(text, melody_features)Quantitative:
- FVD (Fréchet Video Distance): Distribution similarity
- IS (Inception Score): Quality and diversity
- FID (per-frame): Image quality
- CLIP Score: Text-video alignment
Qualitative:
- Temporal consistency
- Motion realism
- Object permanence
- Scene transitions
Objective Metrics:
- MOS (Mean Opinion Score): Human evaluation (1-5)
- PESQ: Perceptual Evaluation of Speech Quality
- STOI: Short-Time Objective Intelligibility
- SECS: Speaker Encoder Cosine Similarity
Perceptual:
- Naturalness
- Prosody
- Speaker similarity
- Audio quality
- FAD (Fréchet Audio Distance): Distribution similarity
- Melodic coherence: Musical structure
- Harmonic consistency: Chord progressions
- Rhythmic accuracy: Beat alignment
Essential for long sequences:
from torch.utils.checkpoint import checkpoint
def forward_block(block, x):
if self.training:
return checkpoint(block, x)
return block(x)Memory savings: 2-4x
Efficient attention for long sequences:
from flash_attn import flash_attn_func
# Standard attention: O(N^2) memory
attn = scaled_dot_product_attention(q, k, v)
# Flash attention: O(N) memory
attn = flash_attn_func(q, k, v)Speedup: 2-3x, Memory: 5-10x less
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
loss = model(video)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Process video in chunks:
chunk_size = 16 # frames
for i in range(0, T, chunk_size):
chunk = video[:, i:i+chunk_size]
process(chunk)Use neural codecs to reduce sequence length:
# Raw audio: 24000 samples/sec
# EnCodec: 75 tokens/sec (320x compression)
tokens = codec.encode(audio) # (B, T_raw) -> (B, 8, T_compressed)Problem: Flickering, jittering between frames
Solutions:
- Temporal attention layers
- Frame differencing loss
- Optical flow consistency
- Higher frame rate training
Problem: OOM errors with long sequences
Solutions:
- Gradient checkpointing
- Reduce batch size
- Temporal chunking
- Lower resolution training
Problem: Generation takes too long
Solutions:
- Distillation to fewer steps
- Non-autoregressive models
- Cached KV for autoregressive
- Parallel generation (SoundStorm)
Problem: Pops, clicks, robotic voice
Solutions:
- Higher codec quality
- Post-filtering
- Better neural vocoder
- Sufficient context length
Problem: Generated video doesn't match text
Solutions:
- Higher guidance scale
- Better text encoder (T5 vs CLIP)
- Multi-stage generation
- Reward model fine-tuning
| Model | Resolution | FPS | Length | FVD | Speed | Notes |
|---|---|---|---|---|---|---|
| CogVideoX | 720p | 8 | 6s | 82 | Medium | SOTA quality |
| VideoPoet | 1080p | 24 | 10s | 95 | Slow | Multi-modal |
| Make-A-Video | 512p | 16 | 5s | 118 | Fast | Image-based |
| Imagen Video | 1280p | 24 | 5.3s | 74 | Slow | Best quality |
| Model | Type | Quality (MOS) | Speed | Zero-Shot | Notes |
|---|---|---|---|---|---|
| VALL-E | TTS | 4.2 | Medium | Yes | Voice cloning |
| Voicebox | TTS | 4.5 | Fast | Yes | Flow matching |
| SoundStorm | TTS | 4.1 | Very Fast | No | Parallel decoding |
| MusicGen | Music | 4.3 | Medium | Partial | Melody control |
| NaturalSpeech 3 | TTS | 4.6 | Slow | Yes | SOTA quality |
nexus/models/
├── video/
│ ├── cogvideox.py # Text-to-video transformer
│ └── videopoet.py # LLM for video
└── audio/
├── valle.py # Neural codec TTS
├── voicebox.py # Flow-based TTS
├── soundstorm.py # Parallel audio generation
├── musicgen.py # Text-to-music
└── naturalspeech3.py # Factorized diffusion TTS
-
Hong et al., "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" (2024)
-
Kondratyuk et al., "VideoPoet: A Large Language Model for Zero-Shot Video Generation" (2023)
-
Wang et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers" (VALL-E, 2023)
-
Le et al., "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale" (2023)
-
Borsos et al., "SoundStorm: Efficient Parallel Audio Generation" (2023)
-
Copet et al., "Simple and Controllable Music Generation" (MusicGen, 2023)
-
Ju et al., "NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models" (2024)
- Study CogVideoX for state-of-the-art video generation
- Implement VALL-E for zero-shot voice cloning
- Try Voicebox for fast, high-quality TTS
- Explore MusicGen for controllable music synthesis
Implementations in Nexus/nexus/models/video/ and Nexus/nexus/models/audio/