Professional ComfyUI custom nodes for Tencent HunyuanImage-3.0, the powerful 80B parameter native multimodal image generation model.
Latest β v1.3.0
- β Instruct (full) INT8 now working: Five bugs fixed β INT8 block swap, CB/SCB device tracking, memory budget estimation, and VAE decode
super()crash caused by closure cell corruption. All Instruct INT8 variants (Distil and full) are fully operational.- π¦ v2 pre-quantized models on Hugging Face: Improved quantization with better block swap defaults. Available for all 6 base + Instruct variants (INT8 and NF4). See download links below.
- π§ͺ Experimental latent/image input nodes (base models):
Hunyuan Empty Latent,Hunyuan Latent Noise Shaping, andHunyuan Generate with Latentprovide composition control, img2img, and custom noise injection for the base (non-Instruct) pipeline. See Latent Control Nodes.- π οΈ Unified Generate V2 node: Single node replaces all base-model generate variants β auto-detects NF4/INT8/BF16, handles block swap, memory budgets, and VRAM management.
Previous highlights (v1.2.0):
- π¨ Instruct resolution overhaul β all 33 model-native bucket resolutions.
- π Multi-Image Fusion expanded to 5 inputs.
- π‘οΈ Transformers 5.x compatibility. NF4 Low VRAM OOM fix (Issue #16). Multi-GPU device mismatch fix (Issue #15).
- π HighRes Efficient generate node enables 3MPβ4K+ generation on 96GB GPUs by replacing the memory-hungry MoE dispatch_mask with a loop-based expert routing that uses ~75Γ less VRAM. See High-Resolution Generation below.
- β NF4 Low VRAM Loader + Low VRAM Budget generator are verified on 24β32 GB cards thanks to the custom device-map strategy that pins every quantized layer on GPU.
- β INT8 Budget loader works end-to-end and pre-quantized checkpoints are available on Hugging Face.
- πΈ Added a reference workflow (below) showing the recommended node pair for Low VRAM setups.
π Acknowledgment: This project integrates the HunyuanImage-3.0 model developed by Tencent Hunyuan Team and uses their official system prompts. The model and original code are licensed under Apache 2.0. This integration code is separately licensed under CC BY-NC 4.0 for non-commercial use.
- NF4 Low VRAM Loader: Custom device map keeps NF4 layers on GPU so 24β32β―GB cards can use the Low VRAM Budget workflow without bitsandbytes errors.
- Provide example workflow/screenshot for Low VRAM users (see below).- [x] Instruct nodes: Loader, Generate, Image Edit, Multi-Image Fusion, and Unload nodes for Instruct models.
- Instruct block swap: BF16, INT8, and NF4 Distil models verified working with block swap.
- Instruct non-distil INT8: β
Fixed in v1.3.0. Five bugs resolved: (1)
blocks_to_swapforced to 0 for INT8, (2) missing_load_int8_block_swap()method, (3) INT8 model size estimated at 40GB instead of 80GB, (4) CB/SCB not fixed after.to(device), (5) wronggb_per_blockfor INT8 in optimal config. Plus VAE decodesuper()crash from closure cell corruption. - Add screenshots/documentation for every node (in progress).
- Test and document multi-GPU setup.
- Continue long-run stability testing on the INT8 Budget loader with CPU offload edge cases.
- LoRA loading support: Load pre-trained PEFT LoRA adapters for inference (merge-and-unload or live adapter swapping). Shelved until the HunyuanImage-3.0 LoRA ecosystem matures β only a handful of LoRAs exist currently. See PhotonAISG/hunyuan-image3-finetune for training scripts.
- Multiple Loading Modes: Full BF16, INT8/NF4 Quantized, Single GPU, Multi-GPU
- Smart Memory Management: Automatic VRAM tracking, cleanup, and optimization
- High-Quality Image Generation:
- Standard generation (<2MP) - Fast, GPU-only
- Large image generation (2MP-8MP+) - CPU offload support
- Instruct Model Support (NEW):
- Built-in Chain-of-Thought (CoT) prompt enhancement β no external API needed
- Image-to-Image editing with natural language instructions
- Multi-image fusion (combine 2β5 reference images; 4β5 experimental)
- Block swap for BF16/INT8/NF4 models on 48β96GB GPUs
- Advanced Prompting:
- Optional prompt enhancement using official HunyuanImage-3.0 system prompts
- Supports any OpenAI-compatible LLM API (DeepSeek, OpenAI, Claude, local LLMs)
- Two professional rewriting modes: en_recaption (structured) and en_think_recaption (advanced)
- Professional Resolution Control:
- All 33 model-native bucket resolutions (~1MP each) in the dropdown
- Ordered tallest portrait (512Γ2048) β square (1024Γ1024) β widest landscape (2048Γ512)
- Aspect ratio labels and Auto resolution option
- Production Ready: Comprehensive error handling, detailed logging, VRAM monitoring
- ComfyUI installed and working
- NVIDIA GPU with CUDA support
- Minimum 24GB VRAM for NF4 quantized model
- Minimum 80GB VRAM (or multi-GPU) for full BF16 model
- Python 3.10+
- PyTorch 2.9+ with CUDA 12.8+ (recommended for best performance)
The hardware requirements depend heavily on which model version you use.
This is the uncompressed 80B parameter model. It is massive.
- Model Size: ~160GB on disk.
- VRAM:
- Ideal: 80GB+ (A100, H100, RTX 6000 Ada). Runs entirely on GPU.
- Minimum: 24GB (RTX 3090/4090). Requires massive System RAM.
- System RAM (CPU Memory):
- If you have <80GB VRAM, the model weights that don't fit on GPU are stored in RAM.
- Requirement: 192GB+ System RAM is recommended if using a 24GB card.
- Example: On a 24GB card, ~140GB of weights will live in RAM.
- Performance:
- On low VRAM cards, generation will be slow due to swapping data between RAM and VRAM.
This version is compressed to 4-bit, reducing size by ~4x with minimal quality loss.
- Model Size: ~45GB on disk.
- VRAM:
- Ideal: 48GB+ (RTX 6000, A6000). Runs entirely on GPU.
- Minimum: 24GB (RTX 3090/4090).
- Note: Since 45GB > 24GB, about half the model will live in System RAM.
- Performance: Slower than 48GB cards, but functional.
- System RAM:
- 64GB+ recommended (especially for 24GB VRAM cards to hold the offloaded weights).
- Performance: Much faster on consumer hardware.
This version is compressed to 8-bit, offering near-original quality with reduced memory usage.
- Model Size: ~85GB on disk.
- VRAM:
- Ideal: 80GB+ (A100, H100). Runs entirely on GPU.
- Minimum: 24GB (RTX 3090/4090).
- Note: Significant CPU offloading required (~60GB of weights in RAM).
- System RAM:
- 128GB+ recommended.
- Performance:
- Quality: ~98% of full precision (better than NF4).
- Speed: Faster inference than NF4 (less dequantization overhead) but requires more memory transfer if offloading.
- Practical Requirement: For the selective INT8 checkpoints shipped here, you realistically need a 96GB-class GPU (RTX 6000 Pro Blackwell, H100, etc.). Forcing CPU offload on smaller cards makes each step take minutes because the quantized tensors continually stream over PCIe.
The Instruct family adds built-in prompt enhancement, image editing, and multi-image fusion. They come in two variants:
- Instruct (full): 50-step inference, uses CFG (batch=2 during diffusion), highest quality.
- Instruct-Distil: 8-step inference, CFG-distilled (batch=1), ~6Γ faster, same quality.
Both variants are available in BF16, INT8, and NF4 quantization.
| Variant | Quant | Disk Size | Min VRAM | Block Swap | Status |
|---|---|---|---|---|---|
| Instruct-Distil | NF4 | ~45 GB | 48 GB | Optional | β Working |
| Instruct-Distil | INT8 | ~81 GB | 96 GB | Required | β Working |
| Instruct-Distil | BF16 | ~160 GB | 96 GB | Required | β Working |
| Instruct (full) | NF4 | ~45 GB | 48 GB | Optional | β Working |
| Instruct (full) | INT8 | ~81 GB | 96 GB | Required | β Working |
| Instruct (full) | BF16 | ~160 GB | 96 GB | Required | β Working |
- Clone this repository into your ComfyUI custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/ericRollei/Eric_Hunyuan3.git- Install dependencies:
cd Eric_Hunyuan3
pip install -r requirements.txt- Download model weights:
Option A: Full BF16 Model (~80GB)
# Download to ComfyUI/models/
cd ../../models
huggingface-cli download tencent/HunyuanImage-3.0 --local-dir HunyuanImage-3Option B: Download Pre-Quantized NF4 Model (~20GB) - Recommended for single GPU <96GB You can download the pre-quantized weights directly from Hugging Face:
- v2 (recommended): EricRollei/HunyuanImage-3-NF4-v2 β improved quantization with better block swap defaults
- v1: EricRollei/HunyuanImage-3-NF4-ComfyUI
# Download v2 (recommended) to ComfyUI/models/
cd ../../models
huggingface-cli download EricRollei/HunyuanImage-3-NF4-v2 --local-dir HunyuanImage-3-NF4-v2Option C: Download Pre-Quantized INT8 Model (~85GB) If you want maximum fidelity without running the INT8 quantizer locally, grab the ready-to-go checkpoint:
- v2 (recommended): EricRollei/HunyuanImage-3-INT8-v2 β improved quantization with better block swap defaults
- v1: EricRollei/Hunyuan_Image_3_Int8
# Download v2 (recommended) to ComfyUI/models/
cd ../../models
huggingface-cli download EricRollei/HunyuanImage-3-INT8-v2 --local-dir HunyuanImage-3-INT8-v2Option D: Quantize Yourself (from Full Model) If you prefer to quantize it yourself:
# First download full model (Option A), then quantize
cd path/to/Eric_Hunyuan3/quantization
python hunyuan_quantize_nf4.py \
--model-path "../../models/HunyuanImage-3" \
--output-path "../../models/HunyuanImage-3-NF4"Option E: Download Instruct Models
Download Instruct models into your ComfyUI/models/ directory so the loader can find them automatically. Pre-quantized INT8 and NF4 variants (v2 recommended) are available from EricRollei on Hugging Face:
cd ComfyUI/models
# Pre-quantized INT8 Instruct-Distil v2 (~81GB) β RECOMMENDED for 96GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8-v2 \
--local-dir HunyuanImage-3.0-Instruct-Distil-INT8-v2
# Pre-quantized NF4 Instruct-Distil v2 (~45GB) β for 48GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4-v2 \
--local-dir HunyuanImage-3.0-Instruct-Distil-NF4-v2See Instruct Models for the full list, HF links, and extra_model_paths.yaml setup for custom locations.
- Restart ComfyUI
Pair Hunyuan 3 Loader (NF4 Low VRAM+) with Hunyuan 3 Generate (Low VRAM Budget) for 24β32β―GB cards. The loaderβs GPU budget slider keeps 18β20β―GB free for inference while the generator adds smart telemetry and optional prompt rewriting. The screenshot above shows a full working chain (loader β rewriter β Low VRAM Budget generator β save/display) producing a 1600Γ1600 render on a 24β―GB test rig without triggering bitsandbytes validation.
| Node | Best For | NOT For | Key Features |
|---|---|---|---|
| Hunyuan 3 Loader (NF4) | β’ 45-48GB+ VRAM β’ Fast quantized loading β’ Simple setup |
β’ 24-32GB VRAM (use Low VRAM+) β’ When you need device_map control |
Fast NF4 load. Full model on GPU. ~45GB VRAM. |
| Hunyuan 3 Loader (NF4 Low VRAM+) | β’ 24-32GB VRAM β’ Budget-constrained setups β’ device_map offloading |
β’ 48GB+ VRAM (use standard NF4) | Auto memory management. external_vram_gb for shared GPU. |
| Hunyuan 3 Loader (INT8 Budget) | β’ 80-96GB VRAM β’ Better quality than NF4 β’ Production workflows |
β’ <80GB VRAM (use NF4) β’ Speed-critical workflows |
Auto memory management. ~82GB model. Higher quality than NF4. |
| Hunyuan 3 Loader (Full BF16) | β’ 80GB+ VRAM β’ Maximum quality β’ No quantization artifacts |
β’ <80GB VRAM β’ Quick experimentation |
Full precision. ~80GB VRAM. Best quality. |
| Hunyuan 3 Loader (Full BF16 GPU) | β’ Single 80GB+ GPU β’ Memory control needed |
β’ Multi-GPU setups β’ <80GB VRAM |
Single GPU with reserve slider. |
| Hunyuan 3 Loader (Multi-GPU BF16) | β’ Multi-GPU setups β’ Distributed inference |
β’ Single GPU setups | Splits model across GPUs. |
| Hunyuan 3 Loader (88GB GPU Optimized) | β DEPRECATED | Everything | Use Full BF16 Loader instead. |
| Node | Best For | NOT For | Key Features |
|---|---|---|---|
| Hunyuan 3 Unload | β’ Simple VRAM cleanup β’ Between workflows β’ Manual control |
β’ Fast model switching (use Soft Unload) β’ Emergency cleanup (use Force Unload) |
Standard cleanup. clear_for_downstream option. |
| Hunyuan 3 Soft Unload (Fast) | β’ Fast model switching β’ Multi-model workflows β’ NF4/INT8 models |
β’ Final cleanup (model stays in RAM) β’ RAM-constrained systems |
Parks model in CPU RAM. ~5s restore vs ~90s reload. Requires bitsandbytes β₯0.48.2 for quantized. |
| Hunyuan 3 Force Unload (Nuclear) | β’ Emergency VRAM clearing β’ Cross-tab VRAM pollution β’ Stuck memory situations |
β’ Normal workflows (too aggressive) β’ When you want fast restore |
Clears ALL cached models. gc.collect + cache clear. Nuclear option. |
| Hunyuan 3 Clear Downstream Models | β’ Hunyuan + other models β’ End of workflow cleanup β’ Keep Hunyuan loaded |
β’ Clearing Hunyuan itself β’ Simple single-model workflows |
Clears Flux/SAM2/etc but KEEPS Hunyuan. Connect to final output. |
| Hunyuan 3 GPU Info | β’ Debugging β’ GPU detection β’ Memory diagnostics |
β’ Actual workflow operations | Shows VRAM stats. Multi-GPU detection. |
| Feature | NF4 | NF4 Low VRAM+ | INT8 Budget | Full BF16 | BF16 GPU | Multi-GPU |
|---|---|---|---|---|---|---|
external_vram_gb option |
β | β | β | β | β | β |
| Auto inference reserve | β | β (6GB) | β (12GB) | β | β | β |
device_map support |
β | β | β | β | β | β |
| Soft unload support | β | β | β | β | β | |
| Min VRAM | ~45GB | ~18GB | ~55GB | ~80GB | ~75GB | 80GB split |
| Model quality | Good | Good | Better | Best | Best | Best |
These loaders use automatic memory management:
| Parameter | What It Does | When to Change |
|---|---|---|
external_vram_gb |
Reserve VRAM for other apps (browsers, other models) | Only if GPU is shared. Default 0 = dedicated to Hunyuan. |
| Inference reserve | Automatic - reserves space for VAE decode + activations | Never - handled internally (6GB for NF4, 12GB for INT8) |
Example: On a 96GB GPU with external_vram_gb=0:
- INT8 Budget: 96GB - 12GB (inference) = 84GB for model weights β
Example: On a 24GB GPU with external_vram_gb=2 (browser running):
- NF4 Low VRAM+: 24GB - 6GB (inference) - 2GB (browser) = 16GB for model, rest spills to CPU RAM β
| Situation | Use This Node | Why |
|---|---|---|
| Normal end of workflow | post_action: soft_unload |
Fast restore next run |
| Running Flux after Hunyuan | Clear Downstream Models | Keeps Hunyuan, clears Flux/SAM2 |
| VRAM stuck from other tab | Force Unload (Nuclear) | Clears orphaned allocations |
| Simple cleanup, slow reload OK | Unload | Standard, reliable cleanup |
| Manual control, will reload | Unload + clear_for_downstream |
Explicit cleanup |
π Simplified Memory Management: The NF4 Low VRAM+ and INT8 Budget loaders now use automatic memory management. Inference overhead is handled internally - just set
external_vram_gbif other apps need GPU memory (default 0 for dedicated GPU).
Do you have 48GB+ VRAM AND model fully on GPU?
ββ YES β Is your target resolution 3MP+ (e.g. 2048x1536)?
ββ YES β Use "HighRes Efficient" (efficient MoE, no OOM)
ββ NO β Use "Hunyuan 3 Generate" (fastest)
ββ NO β Are you using NF4/INT8 quantized model?
ββ YES β Use "Low VRAM" or "Low VRAM Budget"
ββ NO β Use "Large/Offload" or "Large Budget"
| Node | Best For | NOT For | Key Features |
|---|---|---|---|
| Hunyuan 3 Generate | β’ 48GB+ VRAM β’ Model fully on GPU β’ Fast iteration β’ Standard resolutions (β€2MP) |
β’ Low VRAM (<48GB) β’ Quantized models with CPU offload β’ Very large images (>2MP) |
Fastest option. No offload overhead. Simple controls. |
| Hunyuan 3 Generate (Telemetry) | β’ Same as above β’ Debugging VRAM issues β’ Performance monitoring |
β’ Same as above | Adds RAM/VRAM stats to status output. Same speed as base Generate. |
| Hunyuan 3 Generate (Large/Offload) | β’ Large images (2-8MP+) β’ BF16 models on <80GB VRAM β’ When you need offload_mode control |
β’ Quantized models (INT8/NF4) β’ Quick iteration (slower startup) |
Has offload_mode (smart/always/disabled). CPU offload for large inference. |
| Hunyuan 3 Generate (Large Budget) | β’ Same as Large/Offload β’ Need GPU budget override β’ Want telemetry stats |
β’ Same as Large/Offload | Adds gpu_budget_gb slider and memory telemetry. |
| Hunyuan 3 Generate (Low VRAM) | β’ NF4/INT8 quantized models β’ 24-48GB VRAM β’ Models loaded with device_map |
β’ BF16 models β’ 96GB+ GPUs (use base Generate) |
Skips conflicting CPU offload calls. Works with accelerate's device_map. |
| Hunyuan 3 Generate (Low VRAM Budget) | β’ Same as Low VRAM β’ Fine-tuning memory usage β’ Production workflows |
β’ Same as Low VRAM | Best for quantized models. GPU budget + telemetry + smart defaults. |
| Hunyuan 3 Generate (HighRes Efficient) | β’ 3MP+ resolutions (2K, 4K) β’ BF16 models on 96GB GPUs β’ When standard Large node OOMs |
β’ Low VRAM (<48GB) β’ NF4/INT8 quantized models |
Memory-efficient MoE dispatch. ~75Γ less MoE intermediate VRAM. See technical details. |
| Feature | Generate | Telemetry | Large | Large Budget | Low VRAM | Low VRAM Budget | HighRes Efficient |
|---|---|---|---|---|---|---|---|
offload_mode control |
β | β | β | β | β | β | β |
gpu_budget_gb slider |
β | β | β | β | β | β | β |
| Memory telemetry | β | β | β | β | β | β | β (logging) |
vae_tiling option |
β | β | β | β | β | β | β |
post_action control |
β | β | β | β | β | β | β |
| Prompt rewriting | β | β | β | β | β | β | β |
| device_map friendly | β | β | β | ||||
| Efficient MoE dispatch | β | β | β | β | β | β | β |
Default post_action |
keep | keep | keep | keep | soft_unload | soft_unload | keep |
| GPU VRAM | Model Type | Loader | Generator |
|---|---|---|---|
| 96GB | INT8 | INT8 Budget | Generate (fastest) |
| 96GB | NF4 | NF4 | Generate (fastest) |
| 96GB | BF16, 3MP+ | Full BF16 | HighRes Efficient |
| 48-80GB | NF4 | NF4 | Generate or Low VRAM |
| 48-80GB | BF16 | Full BF16 | Large/Offload |
| 24-32GB | NF4 | NF4 Low VRAM+ | Low VRAM Budget |
| 24-32GB | INT8 | INT8 Budget | Low VRAM Budget |
| Mistake | Problem | Solution |
|---|---|---|
| Using "Generate" with NF4 on 24GB | May conflict with device_map offloading | Use "Low VRAM" or "Low VRAM Budget" |
| Using "Large/Offload" with NF4 | CPU offload hooks conflict with quantization | Use "Low VRAM" variants |
| Using "Low VRAM" on 96GB GPU | Unnecessary overhead | Use base "Generate" for speed |
Setting offload_mode: always with device_map |
Double offloading causes stalls | Low VRAM nodes auto-disable conflicting offload |
The HunyuanImage-3.0-Instruct models extend the base model with powerful new capabilities:
- Built-in prompt enhancement β no external LLM API needed
- Chain-of-Thought (CoT) reasoning β the model "thinks" about your prompt before generating
- Image editing β modify images with natural language instructions
- Multi-image fusion β combine elements from 2β5 reference images (4β5 experimental)
| Feature | Base (T2I) | Instruct |
|---|---|---|
| Text-to-image | β | β |
| Built-in prompt enhancement | β (needs API) | β (CoT built-in) |
| Image editing | β | β |
| Multi-image fusion | β | β |
| Bot task modes | β | β (image, recaption, think_recaption) |
| CFG-distilled variant | β | β (8-step fast inference) |
| Node | Purpose | Key Features |
|---|---|---|
| Hunyuan Instruct Loader | Load any Instruct model variant | Auto-detects BF16/INT8/NF4 and Distil vs Full. Block swap support. |
| Hunyuan Instruct Generate | Text-to-image with Instruct model | Bot task modes (image/recaption/think_recaption). Returns CoT reasoning text. |
| Hunyuan Instruct Image Edit | Edit images with instructions | Takes input image + instruction. "Change the background to sunset." |
| Hunyuan Instruct Multi-Image Fusion | Combine 2β5 reference images | "Place the cat from image 1 into the scene from image 2." Images 4β5 are experimental. |
| Hunyuan Instruct Unload | Free memory | Clears cached Instruct model from VRAM and RAM. |
| Parameter | Description | Default |
|---|---|---|
model_name |
Auto-detected from ComfyUI/models/ and extra_model_paths.yaml |
β |
force_reload |
Force reload even if cached | False |
attention_impl |
Attention implementation (sdpa recommended) |
sdpa |
moe_impl |
MoE implementation (keep eager unless you have flashinfer) |
eager |
vram_reserve_gb |
VRAM to keep free for inference (auto-boosted for CFG models) | 30.0 |
blocks_to_swap |
Number of transformer blocks to swap between GPUβCPU (0 = no swap) | 0 |
Block swap enables running large models on GPUs that can't fit the entire model. It moves transformer blocks between GPU and CPU during the forward pass β only 1β10 blocks on GPU at a time, with async prefetching.
When to use block swap:
- BF16 Instruct models (~160 GB) on 96 GB GPUs β
blocks_to_swap=20-24 - INT8 Instruct-Distil models (~81 GB) on 96 GB GPUs β
blocks_to_swap=24-28
Recommended settings:
| Model | blocks_to_swap | VRAM Used | Notes |
|---|---|---|---|
| Instruct-Distil NF4 | 0 | ~29 GB | Fits on 48 GB without swap |
| Instruct-Distil INT8 | 24 | ~30 GB | β Tested, working on 96 GB |
| Instruct-Distil BF16 | 22 | ~50 GB | β Tested, working on 96 GB |
| Instruct (full) NF4 | 0 | ~29 GB | Fits on 48 GB without swap |
| Instruct (full) INT8 | 28-31 | ~10-17 GB | β Working β fixed in v1.3.0 |
| Instruct (full) BF16 | 22 | ~50 GB | β Tested, working on 96 GB |
The Instruct Generate node supports three modes via the bot_task parameter:
| Mode | Description | Steps | Best For |
|---|---|---|---|
image |
Direct generation β prompt used as-is | Fast | When you have a detailed prompt already |
recaption |
Model rewrites prompt into detailed description | Medium | General use β improves composition/lighting |
think_recaption |
CoT reasoning β rewrite β generate | Slow | Best quality β model analyzes intent first |
The cot_reasoning output returns the model's thought process (for recaption and think_recaption).
βββββββββββββββββββββββββββββββββββ
β Hunyuan Instruct Loader β
β model_name: HunyuanImage-3.0- β
β Instruct-Distil-INT8 β
β blocks_to_swap: 24 β
βββββββββββββ¬ββββββββββββββββββββββ
β HUNYUAN_INSTRUCT_MODEL
βΌ
βββββββββββββββββββββββββββββββββββ
β Hunyuan Instruct Generate β
β prompt: "A cat astronaut..." β
β bot_task: think_recaption β
β resolution: 1024x1024 β
ββββββββ¬βββββββββ¬ββββββββββββββββββ
β β
IMAGE STRING (cot_reasoning)
β
βΌ
ββββββββββββββββ
β Save Image β
ββββββββββββββββ
βββββββββββββββββββββββ
β Load Image β
βββββββββ¬ββββββββββββββ
β IMAGE
βΌ
βββββββββββββββββββββββββββββββββββ
β Hunyuan Instruct Image Edit β
β instruction: "Add sunglasses β
β to the person" β
β bot_task: image β
β edit_strength: 0.7 β
ββββββββ¬βββββββββββββββββββββββββββ
β IMAGE (edited)
βΌ
ββββββββββββββββ
β Save Image β
ββββββββββββββββ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Load Image 1 β β Load Image 2 β β Load Image 3 β (+ optional image_4, image_5)
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hunyuan Instruct Multi-Image Fusion β
β instruction: "Place the cat from image 1 into the β
β scene from image 2 with lighting from image 3" β
β bot_task: think_recaption (recommended) β
β resolution: 1024x1024 (1:1 Square) β
βββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β IMAGE
βΌ
ββββββββββββββββ
β Save Image β
ββββββββββββββββ
Note: The model officially supports up to 3 input images. Slots 4 and 5 are experimental β the pipeline accepts them but they increase VRAM usage significantly and results may vary.
Where to put them: Download all models (base and Instruct) into your ComfyUI/models/ directory. All loaders automatically scan that folder.
If you store models on a separate drive, add the path to your extra_model_paths.yaml:
comfyui:
# Base models (NF4, INT8, BF16)
hunyuan: |
models/
H:/MyModels/
# Instruct models
hunyuan_instruct: |
models/
H:/MyModels/Note: Both
hunyuanandhunyuan_instructcategories default toComfyUI/models/. You only need theextra_model_paths.yamlentries if your models live somewhere else.
| Model | Size | Link | Notes |
|---|---|---|---|
| Instruct-Distil INT8 v2 | ~81 GB | EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8-v2 | β Recommended β 8-step, fast |
| Instruct-Distil NF4 v2 | ~45 GB | EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4-v2 | Best for 48 GB GPUs |
| Instruct (full) INT8 v2 | ~81 GB | EricRollei/HunyuanImage-3.0-Instruct-INT8-v2 | β Working |
| Instruct (full) NF4 v2 | ~45 GB | EricRollei/HunyuanImage-3.0-Instruct-NF4-v2 | 50-step, highest quality |
| Instruct-Distil INT8 (v1) | ~81 GB | EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8 | Legacy |
| Instruct-Distil NF4 (v1) | ~45 GB | EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4 | Legacy |
| Instruct (full) INT8 (v1) | ~81 GB | EricRollei/HunyuanImage-3.0-Instruct-INT8 | Legacy |
| Instruct (full) NF4 (v1) | ~45 GB | EricRollei/HunyuanImage-3.0-Instruct-NF4 | Legacy |
| Instruct BF16 (full) | ~160 GB | tencent/HunyuanImage-3.0-Instruct | Tencent original |
| Instruct-Distil BF16 | ~160 GB | tencent/HunyuanImage-3.0-Instruct-Distil | Tencent original |
Download directly into ComfyUI/models/:
cd ComfyUI/models
# INT8 Instruct-Distil v2 (~81GB) β RECOMMENDED for 96GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8-v2 \
--local-dir HunyuanImage-3.0-Instruct-Distil-INT8-v2
# NF4 Instruct-Distil v2 (~45GB) β for 48GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4-v2 \
--local-dir HunyuanImage-3.0-Instruct-Distil-NF4-v2
# NF4 Instruct Full v2 (~45GB)
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-NF4-v2 \
--local-dir HunyuanImage-3.0-Instruct-NF4-v2
# INT8 Instruct Full v2 (~81GB) β now fully working
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-INT8-v2 \
--local-dir HunyuanImage-3.0-Instruct-INT8-v2
# BF16 originals from Tencent (~160GB each)
huggingface-cli download tencent/HunyuanImage-3.0-Instruct \
--local-dir HunyuanImage-3.0-Instruct
huggingface-cli download tencent/HunyuanImage-3.0-Instruct-Distil \
--local-dir HunyuanImage-3.0-Instruct-DistilQuantize yourself (from BF16 source):
cd quantization
# Instruct-Distil INT8
python hunyuan_quantize_instruct_distil_int8.py \
--model-path /path/to/HunyuanImage-3.0-Instruct-Distil \
--output-path ComfyUI/models/HunyuanImage-3.0-Instruct-Distil-INT8
# Instruct-Distil NF4
python hunyuan_quantize_instruct_distil_nf4.py \
--model-path /path/to/HunyuanImage-3.0-Instruct-Distil \
--output-path ComfyUI/models/HunyuanImage-3.0-Instruct-Distil-NF4-
INT8 block swap on Windows β On Windows WDDM, pinned CPU memory (
cudaHostAlloc) is mapped into the GPUβs address space andcuMemGetInfocounts it as consumed GPU memory. With 28 blocks Γ ~2.4GB = ~67GB of pinned buffers, this can exhaust a 96GB GPU. The fix skips pinned buffers for INT8 and usesblock.to()instead (same approach as NF4). Additionally,weight.CBandweight.dataare aliased to prevent memory doubling on GPU. Tradeoff: Without pinned buffers, CRT heap fragmentation may cause slow RAM growth over many generations on Windows. For INT8 Instruct-Distil on a 96GB GPU where the model fits without block swap, useblocks_to_swap=0for best performance. -
Block swap requires
blocks_to_swap > 0in the loader β Setting it to 0 disables block swap entirely. For BF16/INT8 Instruct models, block swap is effectively required on 96 GB GPUs. -
RAM usage accumulates β The Instruct Unload node clears the model cache, but some references may persist across successive loads. If you notice RAM creeping up, restart ComfyUI. A fix is planned.
-
Instruct models need
trust_remote_code=Trueβ The loader handles this automatically. The Instruct models include custom model code that must be executed.
Managing VRAM is critical when running Hunyuan3 alongside other models. Here are the recommended workflows:
[Loader] β [Generate] β [Save Image]
- Set
keep_model_loaded: Truefor successive runs - Model stays cached, fast subsequent generations
Option A: Keep Hunyuan Loaded (Recommended for successive runs)
[Hunyuan Loader] β [Generate] β [Flux Detailer] β [SAM2] β [Save] β [Clear Downstream Models]
β β
βββββββββββββββββββββββββ next run βββββββββββββββββββββββββββββββββββββ
- Place "Hunyuan 3 Clear Downstream Models" at the END of workflow
- Connect its
triggerinput to your final output - This clears Flux/SAM2/Florence VRAM while keeping Hunyuan loaded
- Next run: Hunyuan is already loaded (fast!), downstream models reload (small)
Option B: Clear Everything Between Runs
[Hunyuan Loader] β [Generate] β [Unload] β [Flux Detailer] β [SAM2] β [Save]
β
enable "clear_for_downstream"
- Use standard Unload node with
clear_for_downstream: True - Hunyuan unloads after generation, freeing VRAM for downstream
- Hunyuan reloads from disk on next run (slower but simpler)
When running workflows in multiple browser tabs, models from other tabs stay in VRAM but ComfyUI "forgets" about them.
Solution: Use the BF16 Loader's clear_vram_before_load option:
[Hunyuan Loader (BF16)] β enable "clear_vram_before_load"
β
[Generate]
- Clears ALL orphaned VRAM before loading Hunyuan
- Helps when VRAM is "mysteriously" full
| Node | When to Use |
|---|---|
| Unload | Clear Hunyuan model between runs |
| Clear Downstream Models | Keep Hunyuan, clear Flux/SAM2/etc. |
| Force Unload (Nuclear) | After OOM errors, stuck VRAM |
| Soft Unload |
The Soft Unload node moves the model to CPU RAM instead of deleting it, allowing much faster restore times.
Requirements:
- bitsandbytes >= 0.48.2 (install with
pip install bitsandbytes>=0.48.2) - Model loaded with
offload_mode='disabled'(no meta tensors)
Now works with:
- β INT8 quantized models
- β NF4 quantized models
- β BF16 models (loaded entirely on GPU)
Does NOT work with:
- β Models loaded with
device_mapoffloading (has meta tensors)
Performance:
- Soft unload + restore: ~10-30 seconds (scales with model size)
- Full unload + reload from disk: ~2+ minutes
Use Case - Multi-Model Workflows:
[Hunyuan Generate] β [Soft Unload] β [Flux Detailer] β [SAM2] β [Restore to GPU] β [Next Hunyuan Gen]
This keeps Hunyuan in CPU RAM while other models use GPU, then restores without disk reload.
Actions:
soft_unload: Move model to CPU RAM, free VRAMrestore_to_gpu: Move model back to GPUcheck_status: Report current model location
The BF16 loader has a target_resolution dropdown that reserves VRAM for inference:
| Option | VRAM Reserved | Use When |
|---|---|---|
| Auto (safe default) | 35GB | Unknown resolution, auto-detection |
| 1MP Fast (96GB+) | 8GB | 96GB+ cards, max speed at 1MP |
| 1MP (1024x1024) | 15GB | Standard 1MP generation |
| 2MP (1920x1080) | 30GB | HD/2K generation |
| 3MP (2048x1536) | 55GB | Large images |
| 4MP+ (2560x1920) | 75GB | Very large images |
When using the Hunyuan 3 Loader (INT8 Budget) + Hunyuan 3 Generate (Large Budget) path (recommended for 96β―GB RTX 6000 Pro / similar cards), keep these controls in mind:
reserve_memory_gb(loader): Amount of VRAM the loader leaves free for inference overhead. Set to ~20β―GB for 1.5β2β―MP work; increase when targeting 4K+ so Smart mode has headroom without forcing CPU offload.gpu_memory_target_gb(loader): How much VRAM the INT8 weights are allowed to occupy. 80β―GB is a good sweet spot on 96β―GB GPUs. Lower values keep more VRAM free but may require CPU spillover during load.offload_mode(Large Budget generator):smart(default) reads the loaderβs reserve/budget metadata, estimates per-megapixel cost (~15β―GB/MP + 6β―GB), and only enables accelerate CPU offload when VRAM is critically low.disabledforces fully-on-GPU sampling (fastest) β use this for β€2β―MP jobs when you already fit in VRAM.alwaysmoves the model into accelerate offload before sampling. Only select this for extreme resolutions where you expect to exceed available VRAM; otherwise it slows INT8 inference to CPU speeds.
keep_model_loaded(Large Budget generator): WhenFalse, the node calls the same cache-clearing logic as Hunyuan 3 Unload after a run, freeing the INT8 weights from VRAM/RAM. Leave itTruefor iterative workflows to avoid reload times.gpu_budget_gboverride (Large Budget generator): Temporarily overrides the loaderβs_loader_gpu_budget_gbvalue for Smart-mode math without remapping tensors. Set it if you want Smart mode to assume a different baseline (e.g., 65β―GB) without reloading the model; leave at-1to use the loaderβs recorded budget.
These generator settings never remap the model by themselvesβthey only influence whether accelerate offload engages and whether we clear the cached model afterward. Adjust the loader slider first (20β―GB reserve / 80β―GB target recommended), then use the generator controls to decide when to keep models in memory or trigger CPU offload for very large renders.
| Loader Node | Compatible Generate Node | Why? |
|---|---|---|
| Hunyuan 3 Loader (NF4) | Hunyuan 3 Generate | Keeps model on GPU. Best for standard sizes (<2MP). |
| Hunyuan 3 Loader (Full BF16) | Hunyuan 3 Generate (Large/Offload) | Keeps model in RAM. Allows CPU offloading for massive images (4K+). |
| Hunyuan 3 Loader (Full BF16) | Hunyuan 3 Generate (HighRes Efficient) | Memory-efficient MoE for 3MP+ on 96GB GPUs. |
| Hunyuan Unified Generate V2 | (built-in loader) | Single node β auto-detects NF4/INT8/BF16, handles block swap and memory. |
| Hunyuan Generate with Latent | (built-in loader, inherits V2) | π§ͺ Experimental β adds image/latent inputs for composition, img2img, noise shaping. |
| Hunyuan Instruct Loader | Hunyuan Instruct Generate | T2I with CoT reasoning and prompt enhancement. |
| Hunyuan Instruct Loader | Hunyuan Instruct Image Edit | Edit images with natural language. |
| Hunyuan Instruct Loader | Hunyuan Instruct Multi-Image Fusion | Combine 2β5 reference images (4β5 experimental). |
Do not mix them! Base model loaders are NOT compatible with Instruct generate nodes (and vice versa). The NF4 Loader with the Large/Offload node will also cause errors because the quantized model cannot be moved to CPU correctly.
These nodes provide composition control, img2img, and custom noise injection for base (non-Instruct) models. They are experimental β the HunyuanImage-3 architecture is autoregressive (not diffusion-based), so traditional latent manipulation behaves differently than with Stable Diffusion.
| Node | Purpose | Category |
|---|---|---|
| Hunyuan Empty Latent | Create a random noise tensor at a specific resolution and seed | Latent utility |
| Hunyuan Latent Noise Shaping | Transform latent tensors β frequency filter, amplify, invert | Latent utility |
| Hunyuan Generate with Latent | All-in-one generate node with optional image and latent inputs | Generate (inherits V2 Unified) |
This node extends the Unified Generate V2 β it inherits all memory management, block swap, model loading, and VRAM features. On top of that, it adds two optional inputs:
| Optional Input | What It Does |
|---|---|
image |
A ComfyUI IMAGE tensor β encoded to latent space via the model's VAE, then combined with noise according to image_mode |
latent |
A HUNYUAN_LATENT tensor from Empty Latent or Noise Shaping β used as the noise source instead of random generation |
Image Modes (when an image is connected):
| Mode | Description | Best For |
|---|---|---|
composition |
Extracts broad spatial layout from the image (low-pass filter in latent space) and modulates noise amplitude. No ghosting β only macro composition transfers. | Guiding where subjects/backgrounds appear |
img2img |
Traditional latent mix: (1βΟ)Β·clean + ΟΒ·noise. Low denoise preserves the image; high denoise adds variation. Can ghost at low denoise. |
Classic image-to-image workflows |
energy_map |
Uses per-channel energy (absolute magnitude) of the image latent to scale noise spatially. More abstract than composition. | Abstract/artistic noise shaping |
Usage Modes (determined by which optional inputs are connected):
| Image | Latent | Behavior |
|---|---|---|
| β | β | Identical to V2 Unified (pure text-to-image) |
| β | β | Image-guided generation using image_mode |
| β | β | Custom noise / latent injection |
| β | β | Image-guided with custom noise base |
ββββββββββββββββ
β Load Image β
ββββββββ¬ββββββββ
β IMAGE
βΌ
ββββββββββββββββββββββββββββββββββββββββββββ
β Hunyuan Generate with Latent β
β model_name: HunyuanImage-3-NF4-v2 β
β prompt: "A cyberpunk cityscape..." β
β image_mode: composition β
β denoise_strength: 0.5 β
β resolution: 1024x1024 β
βββββββββββββ¬βββββββββββββββββββββββββββββββ
β IMAGE
βΌ
ββββββββββββββββ
β Save Image β
ββββββββββββββββ
Note: Because HunyuanImage-3 is autoregressive, img2img results differ significantly from diffusion-model img2img. The
compositionmode is generally recommended for more predictable layout guidance.
βββββββββββββββββββββββββββ
β Hunyuan 3 Loader (NF4) β
β model_name: HunyuanImage-3-NF4 β
β keep_in_cache: True β
βββββββββββββ¬ββββββββββββββ
β HUNYUAN_MODEL
βΌ
βββββββββββββββββββββββββββ
β Hunyuan 3 Generate β
β prompt: "..." β
β steps: 50 β
β resolution: 1024x1024 β
β guidance_scale: 7.5 β
βββββββββββββ¬ββββββββββββββ
β IMAGE
βΌ
ββββββββββββββββ
β Save Image β
ββββββββββββββββ
β¨ Feature: Uses official HunyuanImage-3.0 system prompts to professionally expand your prompts for better results.
Supported APIs (any OpenAI-compatible endpoint):
- DeepSeek (default, recommended for cost)
- OpenAI GPT-4/GPT-3.5
- Claude (via OpenAI-compatible proxy)
- Local LLMs (via LM Studio, Ollama with OpenAI API)
Setup (Secure):
- Rename
api_config.ini.exampletoapi_config.iniin the custom node folder. - Add your API key to the file:
[API] api_key = sk-your-key-here
- Alternatively, set environment variables:
HUNYUAN_API_KEY,HUNYUAN_API_URL.
Usage:
- Option 1 (Integrated): Enable
enable_prompt_rewritein the Generate node. - Option 2 (Standalone): Use the Hunyuan Prompt Rewriter node to rewrite prompts before passing them to any model.
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Hunyuan Prompt Rewriter β β Hunyuan 3 Generate β
β prompt: "dog running" β ββββΊ β prompt: (rewritten) β
β rewrite_style: ... β β ... β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
Result: Automatically expands to:
"An energetic brown and white border collie running across a sun-drenched meadow filled with wildflowers, motion blur on legs showing speed, golden hour lighting, shallow depth of field, professional photography, high detail, 8k quality"
For high-resolution outputs (2K, 4K, 6MP+):
βββββββββββββββββββββββββββββββββββ
β Hunyuan 3 Generate (Large) β
β resolution: 3840x2160 - 4K UHD β
β cpu_offload: True β
β steps: 50 β
βββββββββββββββββββββββββββββββββββ
The Hunyuan 3 Generate (HighRes Efficient) node enables 3MPβ4K+ generation on 96GB GPUs where the standard Large/Offload node runs out of memory.
Recommended Workflow:
βββββββββββββββββββββββββββββββββββ
β Hunyuan 3 Loader (Full BF16) β
β target_resolution: 3MP β
βββββββββββββ¬ββββββββββββββββββββββ
β HUNYUAN_MODEL
βΌ
βββββββββββββββββββββββββββββββββββ
β Hunyuan 3 Generate β
β (HighRes Efficient) β
β resolution: 2048x1536 (3.1MP) β
β guidance_scale: 7.5 β
β steps: 50 β
β offload_mode: smart β
βββββββββββββ¬ββββββββββββββββββββββ
β IMAGE
βΌ
ββββββββββββββββ
β Save Image β
ββββββββββββββββ
Why use this instead of the standard Large/Offload node?
The standard node OOMs at 3MP+ because the upstream MoE routing creates a massive intermediate tensor (the "dispatch_mask") that grows quadratically with token count. The HighRes Efficient node replaces this with a loop-based dispatch that is ~75Γ more memory-efficient. See technical details below.
VRAM Requirements (BF16, 96GB GPU):
| Resolution | Standard Large Node | HighRes Efficient Node |
|---|---|---|
| 1MP (1024Γ1024) | β ~42 GB total | β ~42 GB total |
| 2MP (1920Γ1080) | β ~55 GB total | β ~52 GB total |
| 3MP (2048Γ1536) | β OOM (~83 GB) | β ~55 GB total |
| 4MP (2560Γ1920) | β OOM | β ~70 GB total |
| 8MP (3840Γ2160) | β OOM |
Quality: Identical to the standard node β same routing decisions, same expert MLPs, just dispatched more efficiently.
Speed: Similar β the same expert computations run on the same data. The loop-based dispatch adds negligible overhead compared to the MLP computations themselves.
HunyuanImage-3.0 is a Mixture-of-Experts (MoE) model with 64 expert MLPs per layer, but each token only uses its top-8. The question is how to route tokens to experts.
The VAE has a 16Γ downsample factor in each spatial dimension. Each latent pixel becomes one token:
| Resolution | Latent Size | Image Tokens |
|---|---|---|
| 1024Γ1024 (1MP) | 64Γ64 | 4,096 |
| 1920Γ1080 (2MP) | 120Γ68 | ~8,160 |
| 2048Γ1536 (3MP) | 128Γ96 | 12,288 |
| 3840Γ2160 (8MP) | 240Γ135 | 32,400 |
When CFG (Classifier-Free Guidance) is enabled (guidance_scale > 1.0), the batch is
doubled β the model runs a conditional and unconditional pass together. So at 3MP
with CFG, N β 25,088 tokens flow through each MoE layer.
Note on CFG: The
guidance_scalevalue (4.0, 5.0, 7.5, 20.0) only controls the blending ratio between conditional and unconditional predictions. It does not change memory usage. The batch doubling is binary:guidance_scale = 1.0β no doubling,guidance_scale > 1.0β always doubled.
The upstream "eager" MoE implementation builds a 3D boolean tensor β the dispatch_mask β
of shape [N_tokens, 64_experts, expert_capacity].
Think of it as a seating chart for 64 rooms (experts), each with a fixed number of
chairs. For every token, the mask records which room and chair it sits in. With
drop_tokens=True, the capacity per expert is:
The full dispatch_mask shape is [N, 64, N/8] β it grows quadratically with N.
Two einsum operations use it:
- Dispatch:
einsum("sec,smβecm", dispatch_mask, input)β gathers tokens into expert buffers - Combine:
einsum("sec,ecmβsm", combine_weights, expert_output)β scatters results back
This is mathematically elegant (fully parallelized, single matrix op) but extremely memory-wasteful β the mask is 87.5% zeros since each token only visits 8 of 64 experts.
| Tensor | dispatch_mask approach | Loop approach |
|---|---|---|
| Routing info | [25088, 64, 3136] = 5 GB (bool) |
[25088, 8] = 0.8 MB (int64) |
| Cast to bf16 for einsum | 10 GB | not needed |
| combine_weights | 10 GB | not needed |
| Einsum intermediates | ~15 GB | not needed |
| Per-expert gather/scatter | β | ~200 MB peak |
| Total MoE intermediate | ~37 GB | ~200 MB |
Model weights (30 GB) + KV cache (16 GB) + MoE intermediates (37 GB) = 83 GB β OOM.
Instead of building the giant seating chart, the HighRes node patches the MoE forward to use a simple loop:
For each of the 64 experts:
1. Find which tokens chose this expert (index lookup from top-k)
2. Gather those ~N/8 tokens
3. Run the expert MLP on them
4. Scatter-add weighted results back to the output
Same routing decisions, same expert computations, same output quality. The only data
structure needed is the top-k indices and weights β shape [N, 8] β which is tiny.
- 832x1280 - Portrait (1.0MP) [<2MP] β Safe, fast
- 1024x1024 - Square (1.0MP) [<2MP] β Safe, fast
- 1280x832 - Landscape (1.0MP) [<2MP] β Safe, fast
- 1536x1024 - Landscape (1.5MP) [<2MP] β Safe, fast
- 2048x2048 - Square (4.0MP) [>2MP]
β οΈ May OOM
- 2560x1440 - Landscape 2K (3.7MP) β With CPU offload
- 3840x2160 - Landscape 4K UHD (8.3MP) β With CPU offload
- 3072x2048 - Landscape 6MP (6.3MP) β With CPU offload
Tip: Test prompts at small resolutions (fast), then render finals in large node.
Single GPU (24-48GB VRAM):
Use: Hunyuan 3 Loader (NF4)
Settings:
- keep_in_cache: True (for multiple generations)
- Use standard Generate node for <2MP
Single GPU (80-96GB VRAM):
Use: Hunyuan 3 Loader (88GB GPU Optimized)
Settings:
- reserve_memory_gb: 14.0 (leaves room for inference)
- Full BF16 quality
Multi-GPU Setup:
Use: Hunyuan 3 Loader (Multi-GPU BF16)
Settings:
- primary_gpu: 0 (where inference runs)
- reserve_memory_gb: 12.0
- Automatically distributes across all GPUs
To get the maximum speed and avoid unnecessary offloading (which slows down generation):
-
Reserve Enough VRAM:
- Use the
reserve_memory_gbslider in the Loader. - Set it high enough to cover the generation overhead for your target resolution (e.g., 30GB+ for 4K).
- Why? If you reserve space upfront, the model stays on the GPU. If you don't, the "Smart Offload" might panic and move everything to RAM to prevent a crash.
- Use the
-
Select Specific Resolutions:
- Avoid using "Auto (model default)" in the Large Generate node if you are optimizing for speed.
- Auto Mode Safety: When "Auto" is selected, the node assumes a large resolution (~2.5MP) to be safe. This might trigger offloading even if your actual image is small.
- Specific Mode: Selecting "1024x1024" tells the node exactly how much VRAM is needed, allowing it to skip offload if you have the space.
Create your own NF4 quantized model:
cd quantization
python hunyuan_quantize_nf4.py \
--model-path "/path/to/HunyuanImage-3" \
--output-path "/path/to/HunyuanImage-3-NF4"Benefits:
- ~4x smaller (80GB β 20GB model size)
- ~45GB VRAM usage (vs 80GB+ for BF16)
- Minimal quality loss
- Attention layers kept in full precision for stability
Configuration: PyTorch 2.9.1 + CUDA 12.8, NF4 Loader + Standard Generate
| Resolution | Steps | Time | Speed | VRAM |
|---|---|---|---|---|
| 1024x1024 (~1MP) | 40 | ~58s | ~1.45s/step | ~45GB |
This is the fastest configuration for Blackwell GPUs thanks to PyTorch 2.9's optimizations. Key settings:
- Use NF4 Loader (standard or Low VRAM+ variant)
- Use Hunyuan 3 Generate (standard node, no offload needed)
- Keep model loaded between runs for successive generations
- Soft Unload supported with bitsandbytes 0.48.2+
Configuration: INT8 Loader + Large Generate (Budget) with offload_mode=disabled
| Resolution | Steps | Time | Speed | VRAM |
|---|---|---|---|---|
| 1152x864 (~1MP) | 40 | 2:35 | 3.9s/step | 85GB β 95GB |
INT8 offers slightly better quality than NF4 but is slower. Key settings:
- Use INT8 Loader (Budget) with ~80GB GPU target
- Use Large Generate (Budget) with
offload_mode=disabled - Keep model loaded between runs for successive generations
- Load time: ~35 seconds
- 1024x1024 @ 50 steps: ~4 seconds/step
- VRAM usage: ~45GB
- Load time: ~60 seconds
- 1024x1024 @ 50 steps: ~3.5 seconds/step
- VRAM usage: ~70GB + 10GB distributed
- Load time: ~25 seconds
- 1024x1024 @ 50 steps: ~3 seconds/step
- VRAM usage: ~80GB
Solutions:
- For 3MP+ BF16 generations: Use "Hunyuan 3 Generate (HighRes Efficient)" β the standard Large node OOMs at 3MP+ due to MoE dispatch_mask memory. See technical details.
- Use NF4 quantized model instead of full BF16
- Reduce resolution (pick options marked
[<2MP]) - Lower
steps(try 30-40 instead of 50) - Use "Hunyuan 3 Generate (Large/Offload)" node with
cpu_offload: True(for β€2MP) - Run "Hunyuan 3 Unload" node before generating
- Set
keep_in_cache: Falsein loader
If using NF4 quantization:
- Re-quantize with the updated script (includes attention layer fix)
- Old quantized models may produce artifacts
Check:
- Run "Hunyuan 3 GPU Info" node
- Look for
CUDA_VISIBLE_DEVICESenvironment variable - Ensure ComfyUI can see all GPUs:
torch.cuda.device_count()
Fix:
# Remove GPU visibility restrictions
unset CUDA_VISIBLE_DEVICES
# Restart ComfyUIOptimizations:
- Use NF4 quantized model (faster than BF16)
- Reduce
steps(30-40 is often sufficient) - Keep model in cache (
keep_in_cache: True) - Use smaller resolutions for testing
Good prompts include:
- Subject: What is the main focus
- Action: What is happening
- Environment: Where it takes place
- Style: Artistic style, mood, atmosphere
- Technical: Lighting, composition, quality keywords
Example:
A majestic snow leopard prowling through a misty mountain forest at dawn,
dappled golden light filtering through pine trees, shallow depth of field,
wildlife photography, National Geographic style, 8k, highly detailed fur texture
Note: HunyuanImage-3.0 uses an autoregressive architecture (like GPT) rather than diffusion, so it doesn't support negative prompts. Instead, be explicit in your prompt about what you want to include.
Set a specific seed (0-18446744073709551615) to get the same image:
seed: 42 # Use any number, same seed = same image
HunyuanImage-3.0:
- Architecture: Native multimodal autoregressive transformer
- Parameters: 80B total (13B active per token)
- Experts: 64 experts (Mixture of Experts architecture)
- Training: Text-to-image with RLHF post-training
- License: Apache 2.0 (see Tencent repo for details)
Paper: HunyuanImage 3.0 Technical Report
Official Repo: Tencent-Hunyuan/HunyuanImage-3.0
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
Dual License (Non-Commercial and Commercial Use):
- Non-Commercial Use: Licensed under Creative Commons Attribution-NonCommercial 4.0 International License
- Commercial Use: Requires separate license. Contact eric@historic.camera or eric@rollei.us
See LICENSE for full details.
Note: The HunyuanImage-3.0 model itself is licensed under Apache 2.0 by Tencent. This license only covers the ComfyUI integration code.
Copyright (c) 2025-2026 Eric Hiss. All rights reserved.
- Author: Eric Hiss (GitHub: EricRollei)
- License: CC BY-NC 4.0 (Non-Commercial) / Commercial License Available
- Developed by: Tencent Hunyuan Team
- Official Repository: Tencent-Hunyuan/HunyuanImage-3.0
- Model License: Apache License 2.0
- Paper: HunyuanImage 3.0 Technical Report
- This integration uses the official HunyuanImage-3.0 system prompts and model architecture developed by Tencent
- Tencent Hunyuan Team for creating and open-sourcing the incredible HunyuanImage-3.0 model
- ComfyUI Community for the excellent extensible framework
- All contributors and testers
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: eric@historic.camera or eric@rollei.us
- Tencent Official: WeChat | Discord
INT8 Instruct Fix (5 bugs):
blocks_to_swapwas forcibly set to 0 for INT8 models ingenerate()β now respects user setting- Missing
_load_int8_block_swap()method in loader β added INT8-specific block swap loading with CB/SCB guard hooks - INT8 model size estimated at 40GB instead of 80GB in memory budget β corrected
_move_non_block_components_to_gpu()didn't fix INT8 CB/SCB after.to(device)β now calls_fix_int8_module_devices()_calculate_optimal_config()used BF16'sgb_per_blockfor INT8 β now uses quant-specific values (NF4: 0.72, INT8: 2.4, BF16: 4.7)
VAE Decode Crash Fix:
- Fixed
super(): __class__ is not a type (NoneType)crash during VAE decode. Root cause: the cache-clearing code's "external monkey-patch remover" usedPyCell_Set(cell, None)to nuke ALL closure cells, including__class__cells used bysuper()in Conv3d and other classes. Fixed withisinstance(val, type)guard inhunyuan_shared.pyandhunyuan_cache_v2.py. - Cleaned up instance-level
forwardattributes left byremove_hook_from_moduleacross 4 files.
v2 Pre-Quantized Models:
- All 6 INT8/NF4 models (base + Instruct-Distil + Instruct-Full) re-quantized and uploaded to Hugging Face with improved model cards and block swap guidance.
Experimental Latent Control Nodes:
Hunyuan Empty Latentβ creates random noise tensors at specific resolution/seedHunyuan Latent Noise Shapingβ frequency filtering, amplification, inversion of latent tensorsHunyuan Generate with Latentβ all-in-one node with optional image/latent inputs for composition control, img2img, and energy map modes
Unified Generate V2:
- Single node replacing all base-model generate variants β auto-detects NF4/INT8/BF16, handles block swap, memory budgets, and VRAM management
Resolution & UX Improvements:
- Instruct resolution dropdown expanded to all 33 model-native bucket resolutions (~1MP each), ordered tallest portrait β square β widest landscape
- Multi-Image Fusion node expanded from 3 to 5 image inputs (slots 4β5 experimental)
- Resolution tooltips updated across all Instruct generate nodes
Bug Fixes:
- Issue #16 β NF4 Low VRAM OOM: Two-stage
max_memoryestimation in quantized loader replaces one-shot approach that left no headroom for inference - Issue #15 β Multi-GPU device mismatch: Explicit
.to(device)onfreqs_cis/image_pos_idprevents cross-device errors during block-swap forward pass - Issue #12 β Transformers 5.x compatibility:
_lookupdict guard,BitsAndBytesConfigimport path, andmodeling_utilsattribute checks updated
Code Quality:
- Instruct nodes: added missing multi-GPU block-swap patch, OOM error handlers for Image Edit and Multi-Fusion
- Removed dead
gcimport fromhunyuan_highres_nodes.py - Cache v2: added
clear_generation_cache()helper used by all generate nodes - Shared utilities: centralized
_aggressive_vram_cleanup()with stale KV-cache detection - Block swap:
_lookupguard for INT8Module._applyhook (transformers 5.x)
- Instruct model nodes (Loader, Generate, Image Edit, Multi-Fusion, Unload)
- Block swap, HighRes Efficient node, Unified V2 node
- Flexible model paths via
extra_model_paths.yaml - Soft Unload, Force Unload, Clear Downstream nodes
- Initial release
- Full BF16 and NF4 quantized model support
- Multi-GPU loading support
- Optional prompt rewriting with DeepSeek API
- Improved resolution organization
- Large image generation with CPU offload
- Comprehensive error handling and VRAM management
Made with β€οΈ for the ComfyUI community
