PyTorch and Diffusers-based (depending on the models / pipeline) experimentation platform for LLM-DiT image and video generation. Pluggable backends, quantization, and quality of life features for research.
| Pipeline | Task | Encoder | Steps | Notes |
|---|---|---|---|---|
| FLUX.2 Klein | text-to-image, image editing | Qwen3-8B/4B (12288/7680 dim) | 4 | Distilled, multi-layer extraction, configurable text encoding |
| Z-Image | text-to-image, img2img | Qwen3-4B (2560 dim) | 8-9 | CFG=0 baked, 1504 token limit |
| LTX-2 | text-to-video | Gemma3-12B (3840 dim) | 15-40 | Pure PyTorch impl, FP8 quantization |
| Qwen-Image-Layered | image decomposition | Qwen2.5-VL-7B (3584 dim) | 50 | Fixed 640/1024 res, outputs RGBA layers |
| Qwen-Image-Edit-2511 | instruction editing | Qwen2.5-VL-7B (3584 dim) | 40 | Multi-image composition support |
Prompt -> Qwen3Formatter -> TextEncoder -> hidden_states[layer] -> DiT -> VAE -> Image
Text encoder extracts embeddings from LLM hidden states (default layer -2). DiT uses flow matching to generate latents. VAE decodes to RGB/RGBA.
uv sync# FLUX.2 Klein (text-to-image with FP8 and block offload for 24GB GPU)
uv run scripts/generate.py --model-type flux2 \
--flux2-model-name klein-9b-fp8 \
--flux2-block-offload \
--flux2-model-path /path/to/FLUX.2-klein-9b-fp8 \
--flux2-vae-path /path/to/FLUX.2-klein-9B \
"A photo of a cat"
# FLUX.2 Klein with longer prompts (configurable text encoding)
uv run scripts/generate.py --model-type flux2 \
--flux2-model-name klein-9b-fp8 \
--flux2-block-offload \
--flux2-max-text-length 1024 \
--flux2-model-path /path/to/FLUX.2-klein-9b-fp8 \
--flux2-vae-path /path/to/FLUX.2-klein-9B \
"A highly detailed description of a complex scene..."
# FLUX.2 Klein image editing with multiple references
uv run scripts/generate.py --model-type flux2 \
--flux2-model-name klein-9b-fp8 \
--flux2-block-offload \
--flux2-model-path /path/to/FLUX.2-klein-9b-fp8 \
--flux2-vae-path /path/to/FLUX.2-klein-9B \
--flux2-input-image ref1.jpg ref2.jpg ref3.jpg \
"Combine elements from the reference images"
# Z-Image (text-to-image)
uv run scripts/generate.py --model-path /path/to/z-image-turbo "A cat sleeping"
# LTX-2 (text-to-video) - Pure PyTorch pipeline
uv run scripts/generate.py --model-type ltx2 \
--ltx2-model-path /path/to/LTX-2 \
--ltx2-num-frames 33 --width 768 --height 512 \
"A cat walking through a sunny garden"
# LTX-2 (text-to-video) - PyTorch pipeline with explicit device placement
uv run python scripts/generate.py --model-type ltx2 \
--ltx2-model-path /path/to/LTX-2 \
--ltx2-text-encoder-device cpu \
--ltx2-transformer-device cuda \
--ltx2-quantize fp8 \
"A cat walking"
# Web UI
uv run web/server.py --config config.tomlSee docs/reference/cli_flags.md for full CLI reference.
Quantization (VRAM reduction):
- BitsAndBytes:
4bitNF4 (~75%),8bitINT8 (~50%) - TorchAO:
fp8dynamic (~50%, RTX 4090+),int8weight-only (~50%)
Generation:
- Vision Conditioning via Qwen3-VL (zero-shot style transfer)
- Skip Layer Guidance for improved anatomy
- DyPE for high-resolution (2K-4K)
- Long prompt compression (4 modes for >1504 tokens)
- LoRA with multi-stack support
Backends:
- Attention: Flash Attention 2/3, SageAttention, xFormers, SDPA (auto-detect)
- Text Encoder: local (transformers), remote API, vLLM
- Distributed: encode on Mac, generate on CUDA
Configuration:
- TOML-based with hardware profiles
- Web UI config management (edit params, switch profiles, restart server)
- Modular component system
- CLI overrides
cp config.toml.example config.toml
uv run web/server.py --config config.toml --profile rtx4090Key sections: [encoder], [generation], [qwen_image], [vl], [rewriter]
See config.toml.example for all options.
| Endpoint | Method | Description |
|---|---|---|
/api/generate |
POST | Z-Image generation |
/api/qwen-image/decompose |
POST | Image decomposition |
/api/qwen-image/edit |
POST | Instruction editing |
/api/vl/generate |
POST | Vision-conditioned generation |
/api/rewrite |
POST | Prompt expansion |
/api/config/session |
GET/PUT | Session config management |
/api/server/restart |
POST | Server restart with profile |
See docs/reference/api_endpoints.md for full reference.
Ablation sweeps and comparison tools in experiments/. Interactive viewer on port 7861.
Models:
- Z-Image - performance tuning, device placement
- LTX-2 - video generation with pure PyTorch pipeline
- Qwen-Image-Layered - decomposition details
- Qwen-Image-Edit-2511 - instruction editing
Guides:
- Config Management - web UI config editing
- VL Conditioning - vision-based style transfer
- LoRA - loading and fusing
- Distributed - multi-machine setup
- Profiler - performance testing
Reference:
- CLI Flags - all command-line options
- API Endpoints - REST API
- Configuration - TOML structure
- Web Architecture - modular JS/CSS structure
- DyPE - high-resolution generation
- Long Prompts - token compression
Internal: CLAUDE.md for development reference.