Skip to content

EricRollei/Comfy_HunyuanImage3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎨 HunyuanImage-3.0 ComfyUI Custom Nodes

License: CC BY-NC 4.0 Python 3.10+ ComfyUI

Professional ComfyUI custom nodes for Tencent HunyuanImage-3.0, the powerful 80B parameter native multimodal image generation model.

Latest β€” v1.3.0

  • βœ… Instruct (full) INT8 now working: Five bugs fixed β€” INT8 block swap, CB/SCB device tracking, memory budget estimation, and VAE decode super() crash caused by closure cell corruption. All Instruct INT8 variants (Distil and full) are fully operational.
  • πŸ“¦ v2 pre-quantized models on Hugging Face: Improved quantization with better block swap defaults. Available for all 6 base + Instruct variants (INT8 and NF4). See download links below.
  • πŸ§ͺ Experimental latent/image input nodes (base models): Hunyuan Empty Latent, Hunyuan Latent Noise Shaping, and Hunyuan Generate with Latent provide composition control, img2img, and custom noise injection for the base (non-Instruct) pipeline. See Latent Control Nodes.
  • πŸ› οΈ Unified Generate V2 node: Single node replaces all base-model generate variants β€” auto-detects NF4/INT8/BF16, handles block swap, memory budgets, and VRAM management.

Previous highlights (v1.2.0):

  • 🎨 Instruct resolution overhaul β€” all 33 model-native bucket resolutions.
  • πŸ”€ Multi-Image Fusion expanded to 5 inputs.
  • πŸ›‘οΈ Transformers 5.x compatibility. NF4 Low VRAM OOM fix (Issue #16). Multi-GPU device mismatch fix (Issue #15).
  • πŸš€ HighRes Efficient generate node enables 3MP–4K+ generation on 96GB GPUs by replacing the memory-hungry MoE dispatch_mask with a loop-based expert routing that uses ~75Γ— less VRAM. See High-Resolution Generation below.
  • βœ… NF4 Low VRAM Loader + Low VRAM Budget generator are verified on 24‑32 GB cards thanks to the custom device-map strategy that pins every quantized layer on GPU.
  • βœ… INT8 Budget loader works end-to-end and pre-quantized checkpoints are available on Hugging Face.
  • πŸ“Έ Added a reference workflow (below) showing the recommended node pair for Low VRAM setups.

πŸ™ Acknowledgment: This project integrates the HunyuanImage-3.0 model developed by Tencent Hunyuan Team and uses their official system prompts. The model and original code are licensed under Apache 2.0. This integration code is separately licensed under CC BY-NC 4.0 for non-commercial use.

πŸ“‹ TODO / Known Issues

  • NF4 Low VRAM Loader: Custom device map keeps NF4 layers on GPU so 24‑32β€―GB cards can use the Low VRAM Budget workflow without bitsandbytes errors.
  • Provide example workflow/screenshot for Low VRAM users (see below).- [x] Instruct nodes: Loader, Generate, Image Edit, Multi-Image Fusion, and Unload nodes for Instruct models.
  • Instruct block swap: BF16, INT8, and NF4 Distil models verified working with block swap.
  • Instruct non-distil INT8: βœ… Fixed in v1.3.0. Five bugs resolved: (1) blocks_to_swap forced to 0 for INT8, (2) missing _load_int8_block_swap() method, (3) INT8 model size estimated at 40GB instead of 80GB, (4) CB/SCB not fixed after .to(device), (5) wrong gb_per_block for INT8 in optimal config. Plus VAE decode super() crash from closure cell corruption.
  • Add screenshots/documentation for every node (in progress).
  • Test and document multi-GPU setup.
  • Continue long-run stability testing on the INT8 Budget loader with CPU offload edge cases.
  • LoRA loading support: Load pre-trained PEFT LoRA adapters for inference (merge-and-unload or live adapter swapping). Shelved until the HunyuanImage-3.0 LoRA ecosystem matures β€” only a handful of LoRAs exist currently. See PhotonAISG/hunyuan-image3-finetune for training scripts.

🎯 Features

  • Multiple Loading Modes: Full BF16, INT8/NF4 Quantized, Single GPU, Multi-GPU
  • Smart Memory Management: Automatic VRAM tracking, cleanup, and optimization
  • High-Quality Image Generation:
    • Standard generation (<2MP) - Fast, GPU-only
    • Large image generation (2MP-8MP+) - CPU offload support
  • Instruct Model Support (NEW):
    • Built-in Chain-of-Thought (CoT) prompt enhancement β€” no external API needed
    • Image-to-Image editing with natural language instructions
    • Multi-image fusion (combine 2–5 reference images; 4–5 experimental)
    • Block swap for BF16/INT8/NF4 models on 48–96GB GPUs
  • Advanced Prompting:
    • Optional prompt enhancement using official HunyuanImage-3.0 system prompts
    • Supports any OpenAI-compatible LLM API (DeepSeek, OpenAI, Claude, local LLMs)
    • Two professional rewriting modes: en_recaption (structured) and en_think_recaption (advanced)
  • Professional Resolution Control:
    • All 33 model-native bucket resolutions (~1MP each) in the dropdown
    • Ordered tallest portrait (512Γ—2048) β†’ square (1024Γ—1024) β†’ widest landscape (2048Γ—512)
    • Aspect ratio labels and Auto resolution option
  • Production Ready: Comprehensive error handling, detailed logging, VRAM monitoring

πŸ“¦ Installation

Prerequisites

  • ComfyUI installed and working
  • NVIDIA GPU with CUDA support
  • Minimum 24GB VRAM for NF4 quantized model
  • Minimum 80GB VRAM (or multi-GPU) for full BF16 model
  • Python 3.10+
  • PyTorch 2.9+ with CUDA 12.8+ (recommended for best performance)

System Requirements & Hardware Recommendations

The hardware requirements depend heavily on which model version you use.

1. Full BF16 Model (Original)

This is the uncompressed 80B parameter model. It is massive.

  • Model Size: ~160GB on disk.
  • VRAM:
    • Ideal: 80GB+ (A100, H100, RTX 6000 Ada). Runs entirely on GPU.
    • Minimum: 24GB (RTX 3090/4090). Requires massive System RAM.
  • System RAM (CPU Memory):
    • If you have <80GB VRAM, the model weights that don't fit on GPU are stored in RAM.
    • Requirement: 192GB+ System RAM is recommended if using a 24GB card.
    • Example: On a 24GB card, ~140GB of weights will live in RAM.
  • Performance:
    • On low VRAM cards, generation will be slow due to swapping data between RAM and VRAM.

2. NF4 Quantized Model (Recommended)

This version is compressed to 4-bit, reducing size by ~4x with minimal quality loss.

  • Model Size: ~45GB on disk.
  • VRAM:
    • Ideal: 48GB+ (RTX 6000, A6000). Runs entirely on GPU.
    • Minimum: 24GB (RTX 3090/4090).
      • Note: Since 45GB > 24GB, about half the model will live in System RAM.
      • Performance: Slower than 48GB cards, but functional.
  • System RAM:
    • 64GB+ recommended (especially for 24GB VRAM cards to hold the offloaded weights).
  • Performance: Much faster on consumer hardware.

3. INT8 Quantized Model (High Quality)

This version is compressed to 8-bit, offering near-original quality with reduced memory usage.

  • Model Size: ~85GB on disk.
  • VRAM:
    • Ideal: 80GB+ (A100, H100). Runs entirely on GPU.
    • Minimum: 24GB (RTX 3090/4090).
      • Note: Significant CPU offloading required (~60GB of weights in RAM).
  • System RAM:
    • 128GB+ recommended.
  • Performance:
    • Quality: ~98% of full precision (better than NF4).
    • Speed: Faster inference than NF4 (less dequantization overhead) but requires more memory transfer if offloading.
    • Practical Requirement: For the selective INT8 checkpoints shipped here, you realistically need a 96GB-class GPU (RTX 6000 Pro Blackwell, H100, etc.). Forcing CPU offload on smaller cards makes each step take minutes because the quantized tensors continually stream over PCIe.

4. Instruct Models (NEW)

The Instruct family adds built-in prompt enhancement, image editing, and multi-image fusion. They come in two variants:

  • Instruct (full): 50-step inference, uses CFG (batch=2 during diffusion), highest quality.
  • Instruct-Distil: 8-step inference, CFG-distilled (batch=1), ~6Γ— faster, same quality.

Both variants are available in BF16, INT8, and NF4 quantization.

Variant Quant Disk Size Min VRAM Block Swap Status
Instruct-Distil NF4 ~45 GB 48 GB Optional βœ… Working
Instruct-Distil INT8 ~81 GB 96 GB Required βœ… Working
Instruct-Distil BF16 ~160 GB 96 GB Required βœ… Working
Instruct (full) NF4 ~45 GB 48 GB Optional βœ… Working
Instruct (full) INT8 ~81 GB 96 GB Required βœ… Working
Instruct (full) BF16 ~160 GB 96 GB Required βœ… Working

Quick Install

  1. Clone this repository into your ComfyUI custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/ericRollei/Eric_Hunyuan3.git
  1. Install dependencies:
cd Eric_Hunyuan3
pip install -r requirements.txt
  1. Download model weights:

Option A: Full BF16 Model (~80GB)

# Download to ComfyUI/models/
cd ../../models
huggingface-cli download tencent/HunyuanImage-3.0 --local-dir HunyuanImage-3

Option B: Download Pre-Quantized NF4 Model (~20GB) - Recommended for single GPU <96GB You can download the pre-quantized weights directly from Hugging Face:

# Download v2 (recommended) to ComfyUI/models/
cd ../../models
huggingface-cli download EricRollei/HunyuanImage-3-NF4-v2 --local-dir HunyuanImage-3-NF4-v2

Option C: Download Pre-Quantized INT8 Model (~85GB) If you want maximum fidelity without running the INT8 quantizer locally, grab the ready-to-go checkpoint:

# Download v2 (recommended) to ComfyUI/models/
cd ../../models
huggingface-cli download EricRollei/HunyuanImage-3-INT8-v2 --local-dir HunyuanImage-3-INT8-v2

Option D: Quantize Yourself (from Full Model) If you prefer to quantize it yourself:

# First download full model (Option A), then quantize
cd path/to/Eric_Hunyuan3/quantization
python hunyuan_quantize_nf4.py \
  --model-path "../../models/HunyuanImage-3" \
  --output-path "../../models/HunyuanImage-3-NF4"

Option E: Download Instruct Models

Download Instruct models into your ComfyUI/models/ directory so the loader can find them automatically. Pre-quantized INT8 and NF4 variants (v2 recommended) are available from EricRollei on Hugging Face:

cd ComfyUI/models

# Pre-quantized INT8 Instruct-Distil v2 (~81GB) β€” RECOMMENDED for 96GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8-v2 \
  --local-dir HunyuanImage-3.0-Instruct-Distil-INT8-v2

# Pre-quantized NF4 Instruct-Distil v2 (~45GB) β€” for 48GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4-v2 \
  --local-dir HunyuanImage-3.0-Instruct-Distil-NF4-v2

See Instruct Models for the full list, HF links, and extra_model_paths.yaml setup for custom locations.

  1. Restart ComfyUI

πŸš€ Usage

πŸ“Έ Low VRAM Workflow Reference

Low VRAM Workflow

Pair Hunyuan 3 Loader (NF4 Low VRAM+) with Hunyuan 3 Generate (Low VRAM Budget) for 24‑32β€―GB cards. The loader’s GPU budget slider keeps 18–20β€―GB free for inference while the generator adds smart telemetry and optional prompt rewriting. The screenshot above shows a full working chain (loader β†’ rewriter β†’ Low VRAM Budget generator β†’ save/display) producing a 1600Γ—1600 render on a 24β€―GB test rig without triggering bitsandbytes validation.

Node Overview

πŸ“¦ Loader Nodes

Node Best For NOT For Key Features
Hunyuan 3 Loader (NF4) β€’ 45-48GB+ VRAM
β€’ Fast quantized loading
β€’ Simple setup
β€’ 24-32GB VRAM (use Low VRAM+)
β€’ When you need device_map control
Fast NF4 load. Full model on GPU. ~45GB VRAM.
Hunyuan 3 Loader (NF4 Low VRAM+) β€’ 24-32GB VRAM
β€’ Budget-constrained setups
β€’ device_map offloading
β€’ 48GB+ VRAM (use standard NF4) Auto memory management. external_vram_gb for shared GPU.
Hunyuan 3 Loader (INT8 Budget) β€’ 80-96GB VRAM
β€’ Better quality than NF4
β€’ Production workflows
β€’ <80GB VRAM (use NF4)
β€’ Speed-critical workflows
Auto memory management. ~82GB model. Higher quality than NF4.
Hunyuan 3 Loader (Full BF16) β€’ 80GB+ VRAM
β€’ Maximum quality
β€’ No quantization artifacts
β€’ <80GB VRAM
β€’ Quick experimentation
Full precision. ~80GB VRAM. Best quality.
Hunyuan 3 Loader (Full BF16 GPU) β€’ Single 80GB+ GPU
β€’ Memory control needed
β€’ Multi-GPU setups
β€’ <80GB VRAM
Single GPU with reserve slider.
Hunyuan 3 Loader (Multi-GPU BF16) β€’ Multi-GPU setups
β€’ Distributed inference
β€’ Single GPU setups Splits model across GPUs.
Hunyuan 3 Loader (88GB GPU Optimized) ❌ DEPRECATED Everything Use Full BF16 Loader instead.

πŸ”§ Utility Nodes

Node Best For NOT For Key Features
Hunyuan 3 Unload β€’ Simple VRAM cleanup
β€’ Between workflows
β€’ Manual control
β€’ Fast model switching (use Soft Unload)
β€’ Emergency cleanup (use Force Unload)
Standard cleanup. clear_for_downstream option.
Hunyuan 3 Soft Unload (Fast) β€’ Fast model switching
β€’ Multi-model workflows
β€’ NF4/INT8 models
β€’ Final cleanup (model stays in RAM)
β€’ RAM-constrained systems
Parks model in CPU RAM. ~5s restore vs ~90s reload. Requires bitsandbytes β‰₯0.48.2 for quantized.
Hunyuan 3 Force Unload (Nuclear) β€’ Emergency VRAM clearing
β€’ Cross-tab VRAM pollution
β€’ Stuck memory situations
β€’ Normal workflows (too aggressive)
β€’ When you want fast restore
Clears ALL cached models. gc.collect + cache clear. Nuclear option.
Hunyuan 3 Clear Downstream Models β€’ Hunyuan + other models
β€’ End of workflow cleanup
β€’ Keep Hunyuan loaded
β€’ Clearing Hunyuan itself
β€’ Simple single-model workflows
Clears Flux/SAM2/etc but KEEPS Hunyuan. Connect to final output.
Hunyuan 3 GPU Info β€’ Debugging
β€’ GPU detection
β€’ Memory diagnostics
β€’ Actual workflow operations Shows VRAM stats. Multi-GPU detection.

Loader Feature Matrix

Feature NF4 NF4 Low VRAM+ INT8 Budget Full BF16 BF16 GPU Multi-GPU
external_vram_gb option ❌ βœ… βœ… ❌ ❌ ❌
Auto inference reserve ❌ βœ… (6GB) βœ… (12GB) ❌ ❌ ❌
device_map support ❌ βœ… βœ… ❌ ❌ ❌
Soft unload support βœ… βœ… βœ… βœ… βœ… ⚠️
Min VRAM ~45GB ~18GB ~55GB ~80GB ~75GB 80GB split
Model quality Good Good Better Best Best Best

Memory Management (NF4 Low VRAM+ & INT8 Budget)

These loaders use automatic memory management:

Parameter What It Does When to Change
external_vram_gb Reserve VRAM for other apps (browsers, other models) Only if GPU is shared. Default 0 = dedicated to Hunyuan.
Inference reserve Automatic - reserves space for VAE decode + activations Never - handled internally (6GB for NF4, 12GB for INT8)

Example: On a 96GB GPU with external_vram_gb=0:

  • INT8 Budget: 96GB - 12GB (inference) = 84GB for model weights βœ…

Example: On a 24GB GPU with external_vram_gb=2 (browser running):

  • NF4 Low VRAM+: 24GB - 6GB (inference) - 2GB (browser) = 16GB for model, rest spills to CPU RAM βœ…

Utility Node Quick Reference

Situation Use This Node Why
Normal end of workflow post_action: soft_unload Fast restore next run
Running Flux after Hunyuan Clear Downstream Models Keeps Hunyuan, clears Flux/SAM2
VRAM stuck from other tab Force Unload (Nuclear) Clears orphaned allocations
Simple cleanup, slow reload OK Unload Standard, reliable cleanup
Manual control, will reload Unload + clear_for_downstream Explicit cleanup

πŸ†• Simplified Memory Management: The NF4 Low VRAM+ and INT8 Budget loaders now use automatic memory management. Inference overhead is handled internally - just set external_vram_gb if other apps need GPU memory (default 0 for dedicated GPU).

🎯 Generate Node Reference

Quick Decision Tree

Do you have 48GB+ VRAM AND model fully on GPU?
  └─ YES β†’ Is your target resolution 3MP+ (e.g. 2048x1536)?
              └─ YES β†’ Use "HighRes Efficient" (efficient MoE, no OOM)
              └─ NO  β†’ Use "Hunyuan 3 Generate" (fastest)
  └─ NO  β†’ Are you using NF4/INT8 quantized model?
              └─ YES β†’ Use "Low VRAM" or "Low VRAM Budget"
              └─ NO  β†’ Use "Large/Offload" or "Large Budget"

Detailed Node Comparison

Node Best For NOT For Key Features
Hunyuan 3 Generate β€’ 48GB+ VRAM
β€’ Model fully on GPU
β€’ Fast iteration
β€’ Standard resolutions (≀2MP)
β€’ Low VRAM (<48GB)
β€’ Quantized models with CPU offload
β€’ Very large images (>2MP)
Fastest option. No offload overhead. Simple controls.
Hunyuan 3 Generate (Telemetry) β€’ Same as above
β€’ Debugging VRAM issues
β€’ Performance monitoring
β€’ Same as above Adds RAM/VRAM stats to status output. Same speed as base Generate.
Hunyuan 3 Generate (Large/Offload) β€’ Large images (2-8MP+)
β€’ BF16 models on <80GB VRAM
β€’ When you need offload_mode control
β€’ Quantized models (INT8/NF4)
β€’ Quick iteration (slower startup)
Has offload_mode (smart/always/disabled). CPU offload for large inference.
Hunyuan 3 Generate (Large Budget) β€’ Same as Large/Offload
β€’ Need GPU budget override
β€’ Want telemetry stats
β€’ Same as Large/Offload Adds gpu_budget_gb slider and memory telemetry.
Hunyuan 3 Generate (Low VRAM) β€’ NF4/INT8 quantized models
β€’ 24-48GB VRAM
β€’ Models loaded with device_map
β€’ BF16 models
β€’ 96GB+ GPUs (use base Generate)
Skips conflicting CPU offload calls. Works with accelerate's device_map.
Hunyuan 3 Generate (Low VRAM Budget) β€’ Same as Low VRAM
β€’ Fine-tuning memory usage
β€’ Production workflows
β€’ Same as Low VRAM Best for quantized models. GPU budget + telemetry + smart defaults.
Hunyuan 3 Generate (HighRes Efficient) β€’ 3MP+ resolutions (2K, 4K)
β€’ BF16 models on 96GB GPUs
β€’ When standard Large node OOMs
β€’ Low VRAM (<48GB)
β€’ NF4/INT8 quantized models
Memory-efficient MoE dispatch. ~75Γ— less MoE intermediate VRAM. See technical details.

Feature Matrix

Feature Generate Telemetry Large Large Budget Low VRAM Low VRAM Budget HighRes Efficient
offload_mode control ❌ ❌ βœ… βœ… βœ… βœ… βœ…
gpu_budget_gb slider ❌ ❌ ❌ βœ… ❌ βœ… ❌
Memory telemetry ❌ βœ… ❌ βœ… ❌ βœ… βœ… (logging)
vae_tiling option βœ… βœ… βœ… βœ… βœ… βœ… ❌
post_action control βœ… βœ… βœ… βœ… βœ… βœ… βœ…
Prompt rewriting βœ… βœ… βœ… βœ… βœ… βœ… βœ…
device_map friendly ⚠️ ⚠️ ⚠️ ⚠️ βœ… βœ… βœ…
Efficient MoE dispatch ❌ ❌ ❌ ❌ ❌ ❌ βœ…
Default post_action keep keep keep keep soft_unload soft_unload keep

Recommended Pairings

GPU VRAM Model Type Loader Generator
96GB INT8 INT8 Budget Generate (fastest)
96GB NF4 NF4 Generate (fastest)
96GB BF16, 3MP+ Full BF16 HighRes Efficient
48-80GB NF4 NF4 Generate or Low VRAM
48-80GB BF16 Full BF16 Large/Offload
24-32GB NF4 NF4 Low VRAM+ Low VRAM Budget
24-32GB INT8 INT8 Budget Low VRAM Budget

Common Mistakes

Mistake Problem Solution
Using "Generate" with NF4 on 24GB May conflict with device_map offloading Use "Low VRAM" or "Low VRAM Budget"
Using "Large/Offload" with NF4 CPU offload hooks conflict with quantization Use "Low VRAM" variants
Using "Low VRAM" on 96GB GPU Unnecessary overhead Use base "Generate" for speed
Setting offload_mode: always with device_map Double offloading causes stalls Low VRAM nodes auto-disable conflicting offload

πŸ†• Instruct Models (NEW)

The HunyuanImage-3.0-Instruct models extend the base model with powerful new capabilities:

  • Built-in prompt enhancement β€” no external LLM API needed
  • Chain-of-Thought (CoT) reasoning β€” the model "thinks" about your prompt before generating
  • Image editing β€” modify images with natural language instructions
  • Multi-image fusion β€” combine elements from 2–5 reference images (4–5 experimental)

Instruct vs Base Model

Feature Base (T2I) Instruct
Text-to-image βœ… βœ…
Built-in prompt enhancement ❌ (needs API) βœ… (CoT built-in)
Image editing ❌ βœ…
Multi-image fusion ❌ βœ…
Bot task modes ❌ βœ… (image, recaption, think_recaption)
CFG-distilled variant ❌ βœ… (8-step fast inference)

Instruct Node Overview

Node Purpose Key Features
Hunyuan Instruct Loader Load any Instruct model variant Auto-detects BF16/INT8/NF4 and Distil vs Full. Block swap support.
Hunyuan Instruct Generate Text-to-image with Instruct model Bot task modes (image/recaption/think_recaption). Returns CoT reasoning text.
Hunyuan Instruct Image Edit Edit images with instructions Takes input image + instruction. "Change the background to sunset."
Hunyuan Instruct Multi-Image Fusion Combine 2–5 reference images "Place the cat from image 1 into the scene from image 2." Images 4–5 are experimental.
Hunyuan Instruct Unload Free memory Clears cached Instruct model from VRAM and RAM.

Instruct Loader Options

Parameter Description Default
model_name Auto-detected from ComfyUI/models/ and extra_model_paths.yaml β€”
force_reload Force reload even if cached False
attention_impl Attention implementation (sdpa recommended) sdpa
moe_impl MoE implementation (keep eager unless you have flashinfer) eager
vram_reserve_gb VRAM to keep free for inference (auto-boosted for CFG models) 30.0
blocks_to_swap Number of transformer blocks to swap between GPU↔CPU (0 = no swap) 0

Block Swap (VRAM Management)

Block swap enables running large models on GPUs that can't fit the entire model. It moves transformer blocks between GPU and CPU during the forward pass β€” only 1–10 blocks on GPU at a time, with async prefetching.

When to use block swap:

  • BF16 Instruct models (~160 GB) on 96 GB GPUs β†’ blocks_to_swap=20-24
  • INT8 Instruct-Distil models (~81 GB) on 96 GB GPUs β†’ blocks_to_swap=24-28

Recommended settings:

Model blocks_to_swap VRAM Used Notes
Instruct-Distil NF4 0 ~29 GB Fits on 48 GB without swap
Instruct-Distil INT8 24 ~30 GB βœ… Tested, working on 96 GB
Instruct-Distil BF16 22 ~50 GB βœ… Tested, working on 96 GB
Instruct (full) NF4 0 ~29 GB Fits on 48 GB without swap
Instruct (full) INT8 28-31 ~10-17 GB βœ… Working β€” fixed in v1.3.0
Instruct (full) BF16 22 ~50 GB βœ… Tested, working on 96 GB

Bot Task Modes

The Instruct Generate node supports three modes via the bot_task parameter:

Mode Description Steps Best For
image Direct generation β€” prompt used as-is Fast When you have a detailed prompt already
recaption Model rewrites prompt into detailed description Medium General use β€” improves composition/lighting
think_recaption CoT reasoning β†’ rewrite β†’ generate Slow Best quality β€” model analyzes intent first

The cot_reasoning output returns the model's thought process (for recaption and think_recaption).

Instruct Basic Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan Instruct Loader         β”‚
β”‚  model_name: HunyuanImage-3.0-  β”‚
β”‚    Instruct-Distil-INT8         β”‚
β”‚  blocks_to_swap: 24             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ HUNYUAN_INSTRUCT_MODEL
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan Instruct Generate       β”‚
β”‚  prompt: "A cat astronaut..."   β”‚
β”‚  bot_task: think_recaption      β”‚
β”‚  resolution: 1024x1024          β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚        β”‚
  IMAGE      STRING (cot_reasoning)
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Save Image   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Instruct Image Edit Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Load Image          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚ IMAGE
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan Instruct Image Edit     β”‚
β”‚  instruction: "Add sunglasses   β”‚
β”‚    to the person"               β”‚
β”‚  bot_task: image                β”‚
β”‚  edit_strength: 0.7             β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚ IMAGE (edited)
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Save Image   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Instruct Multi-Image Fusion Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Load Image 1 β”‚   β”‚ Load Image 2 β”‚   β”‚ Load Image 3 β”‚  (+ optional image_4, image_5)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                  β”‚                  β”‚
       β–Ό                  β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan Instruct Multi-Image Fusion                  β”‚
β”‚  instruction: "Place the cat from image 1 into the   β”‚
β”‚    scene from image 2 with lighting from image 3"    β”‚
β”‚  bot_task: think_recaption (recommended)              β”‚
β”‚  resolution: 1024x1024 (1:1 Square)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ IMAGE
            β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Save Image   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Note: The model officially supports up to 3 input images. Slots 4 and 5 are experimental β€” the pipeline accepts them but they increase VRAM usage significantly and results may vary.

Downloading Instruct Models

Where to put them: Download all models (base and Instruct) into your ComfyUI/models/ directory. All loaders automatically scan that folder.

If you store models on a separate drive, add the path to your extra_model_paths.yaml:

comfyui:
    # Base models (NF4, INT8, BF16)
    hunyuan: |
        models/
        H:/MyModels/
    # Instruct models
    hunyuan_instruct: |
        models/
        H:/MyModels/

Note: Both hunyuan and hunyuan_instruct categories default to ComfyUI/models/. You only need the extra_model_paths.yaml entries if your models live somewhere else.

Available Models on Hugging Face

Model Size Link Notes
Instruct-Distil INT8 v2 ~81 GB EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8-v2 ⭐ Recommended β€” 8-step, fast
Instruct-Distil NF4 v2 ~45 GB EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4-v2 Best for 48 GB GPUs
Instruct (full) INT8 v2 ~81 GB EricRollei/HunyuanImage-3.0-Instruct-INT8-v2 βœ… Working
Instruct (full) NF4 v2 ~45 GB EricRollei/HunyuanImage-3.0-Instruct-NF4-v2 50-step, highest quality
Instruct-Distil INT8 (v1) ~81 GB EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8 Legacy
Instruct-Distil NF4 (v1) ~45 GB EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4 Legacy
Instruct (full) INT8 (v1) ~81 GB EricRollei/HunyuanImage-3.0-Instruct-INT8 Legacy
Instruct (full) NF4 (v1) ~45 GB EricRollei/HunyuanImage-3.0-Instruct-NF4 Legacy
Instruct BF16 (full) ~160 GB tencent/HunyuanImage-3.0-Instruct Tencent original
Instruct-Distil BF16 ~160 GB tencent/HunyuanImage-3.0-Instruct-Distil Tencent original

Download Commands

Download directly into ComfyUI/models/:

cd ComfyUI/models

# INT8 Instruct-Distil v2 (~81GB) β€” RECOMMENDED for 96GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-INT8-v2 \
  --local-dir HunyuanImage-3.0-Instruct-Distil-INT8-v2

# NF4 Instruct-Distil v2 (~45GB) β€” for 48GB GPUs
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-Distil-NF4-v2 \
  --local-dir HunyuanImage-3.0-Instruct-Distil-NF4-v2

# NF4 Instruct Full v2 (~45GB)
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-NF4-v2 \
  --local-dir HunyuanImage-3.0-Instruct-NF4-v2

# INT8 Instruct Full v2 (~81GB) β€” now fully working
huggingface-cli download EricRollei/HunyuanImage-3.0-Instruct-INT8-v2 \
  --local-dir HunyuanImage-3.0-Instruct-INT8-v2

# BF16 originals from Tencent (~160GB each)
huggingface-cli download tencent/HunyuanImage-3.0-Instruct \
  --local-dir HunyuanImage-3.0-Instruct
huggingface-cli download tencent/HunyuanImage-3.0-Instruct-Distil \
  --local-dir HunyuanImage-3.0-Instruct-Distil

Quantize yourself (from BF16 source):

cd quantization

# Instruct-Distil INT8
python hunyuan_quantize_instruct_distil_int8.py \
  --model-path /path/to/HunyuanImage-3.0-Instruct-Distil \
  --output-path ComfyUI/models/HunyuanImage-3.0-Instruct-Distil-INT8

# Instruct-Distil NF4
python hunyuan_quantize_instruct_distil_nf4.py \
  --model-path /path/to/HunyuanImage-3.0-Instruct-Distil \
  --output-path ComfyUI/models/HunyuanImage-3.0-Instruct-Distil-NF4

Known Limitations

  1. INT8 block swap on Windows β€” On Windows WDDM, pinned CPU memory (cudaHostAlloc) is mapped into the GPU’s address space and cuMemGetInfo counts it as consumed GPU memory. With 28 blocks Γ— ~2.4GB = ~67GB of pinned buffers, this can exhaust a 96GB GPU. The fix skips pinned buffers for INT8 and uses block.to() instead (same approach as NF4). Additionally, weight.CB and weight.data are aliased to prevent memory doubling on GPU. Tradeoff: Without pinned buffers, CRT heap fragmentation may cause slow RAM growth over many generations on Windows. For INT8 Instruct-Distil on a 96GB GPU where the model fits without block swap, use blocks_to_swap=0 for best performance.

  2. Block swap requires blocks_to_swap > 0 in the loader β€” Setting it to 0 disables block swap entirely. For BF16/INT8 Instruct models, block swap is effectively required on 96 GB GPUs.

  3. RAM usage accumulates β€” The Instruct Unload node clears the model cache, but some references may persist across successive loads. If you notice RAM creeping up, restart ComfyUI. A fix is planned.

  4. Instruct models need trust_remote_code=True β€” The loader handles this automatically. The Instruct models include custom model code that must be executed.


🧠 VRAM Management & Workflow Strategies

Managing VRAM is critical when running Hunyuan3 alongside other models. Here are the recommended workflows:

Scenario 1: Hunyuan Only (Simple)

[Loader] β†’ [Generate] β†’ [Save Image]
  • Set keep_model_loaded: True for successive runs
  • Model stays cached, fast subsequent generations

Scenario 2: Hunyuan with Downstream Models (Flux, SAM2, Florence)

Option A: Keep Hunyuan Loaded (Recommended for successive runs)

[Hunyuan Loader] β†’ [Generate] β†’ [Flux Detailer] β†’ [SAM2] β†’ [Save] β†’ [Clear Downstream Models]
       ↑                                                                      ↓
       └──────────────────────── next run β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Place "Hunyuan 3 Clear Downstream Models" at the END of workflow
  • Connect its trigger input to your final output
  • This clears Flux/SAM2/Florence VRAM while keeping Hunyuan loaded
  • Next run: Hunyuan is already loaded (fast!), downstream models reload (small)

Option B: Clear Everything Between Runs

[Hunyuan Loader] β†’ [Generate] β†’ [Unload] β†’ [Flux Detailer] β†’ [SAM2] β†’ [Save]
                                   ↑
                    enable "clear_for_downstream"
  • Use standard Unload node with clear_for_downstream: True
  • Hunyuan unloads after generation, freeing VRAM for downstream
  • Hunyuan reloads from disk on next run (slower but simpler)

Scenario 3: Cross-Tab VRAM Pollution

When running workflows in multiple browser tabs, models from other tabs stay in VRAM but ComfyUI "forgets" about them.

Solution: Use the BF16 Loader's clear_vram_before_load option:

[Hunyuan Loader (BF16)]  ← enable "clear_vram_before_load"
         ↓
    [Generate]
  • Clears ALL orphaned VRAM before loading Hunyuan
  • Helps when VRAM is "mysteriously" full

VRAM Management Node Summary

Node When to Use
Unload Clear Hunyuan model between runs
Clear Downstream Models Keep Hunyuan, clear Flux/SAM2/etc.
Force Unload (Nuclear) After OOM errors, stuck VRAM
Soft Unload ⚠️ See limitations below

βœ… Soft Unload Node - Fast VRAM Release (NEW!)

The Soft Unload node moves the model to CPU RAM instead of deleting it, allowing much faster restore times.

Requirements:

  • bitsandbytes >= 0.48.2 (install with pip install bitsandbytes>=0.48.2)
  • Model loaded with offload_mode='disabled' (no meta tensors)

Now works with:

  • βœ… INT8 quantized models
  • βœ… NF4 quantized models
  • βœ… BF16 models (loaded entirely on GPU)

Does NOT work with:

  • ❌ Models loaded with device_map offloading (has meta tensors)

Performance:

  • Soft unload + restore: ~10-30 seconds (scales with model size)
  • Full unload + reload from disk: ~2+ minutes

Use Case - Multi-Model Workflows:

[Hunyuan Generate] β†’ [Soft Unload] β†’ [Flux Detailer] β†’ [SAM2] β†’ [Restore to GPU] β†’ [Next Hunyuan Gen]

This keeps Hunyuan in CPU RAM while other models use GPU, then restores without disk reload.

Actions:

  • soft_unload: Move model to CPU RAM, free VRAM
  • restore_to_gpu: Move model back to GPU
  • check_status: Report current model location

BF16 Loader Target Resolution Options

The BF16 loader has a target_resolution dropdown that reserves VRAM for inference:

Option VRAM Reserved Use When
Auto (safe default) 35GB Unknown resolution, auto-detection
1MP Fast (96GB+) 8GB 96GB+ cards, max speed at 1MP
1MP (1024x1024) 15GB Standard 1MP generation
2MP (1920x1080) 30GB HD/2K generation
3MP (2048x1536) 55GB Large images
4MP+ (2560x1920) 75GB Very large images

INT8 Budget Workflow Controls

When using the Hunyuan 3 Loader (INT8 Budget) + Hunyuan 3 Generate (Large Budget) path (recommended for 96β€―GB RTX 6000 Pro / similar cards), keep these controls in mind:

  • reserve_memory_gb (loader): Amount of VRAM the loader leaves free for inference overhead. Set to ~20β€―GB for 1.5–2β€―MP work; increase when targeting 4K+ so Smart mode has headroom without forcing CPU offload.
  • gpu_memory_target_gb (loader): How much VRAM the INT8 weights are allowed to occupy. 80β€―GB is a good sweet spot on 96β€―GB GPUs. Lower values keep more VRAM free but may require CPU spillover during load.
  • offload_mode (Large Budget generator):
    • smart (default) reads the loader’s reserve/budget metadata, estimates per-megapixel cost (~15β€―GB/MP + 6β€―GB), and only enables accelerate CPU offload when VRAM is critically low.
    • disabled forces fully-on-GPU sampling (fastest) – use this for ≀2β€―MP jobs when you already fit in VRAM.
    • always moves the model into accelerate offload before sampling. Only select this for extreme resolutions where you expect to exceed available VRAM; otherwise it slows INT8 inference to CPU speeds.
  • keep_model_loaded (Large Budget generator): When False, the node calls the same cache-clearing logic as Hunyuan 3 Unload after a run, freeing the INT8 weights from VRAM/RAM. Leave it True for iterative workflows to avoid reload times.
  • gpu_budget_gb override (Large Budget generator): Temporarily overrides the loader’s _loader_gpu_budget_gb value for Smart-mode math without remapping tensors. Set it if you want Smart mode to assume a different baseline (e.g., 65β€―GB) without reloading the model; leave at -1 to use the loader’s recorded budget.

These generator settings never remap the model by themselvesβ€”they only influence whether accelerate offload engages and whether we clear the cached model afterward. Adjust the loader slider first (20β€―GB reserve / 80β€―GB target recommended), then use the generator controls to decide when to keep models in memory or trigger CPU offload for very large renders.

Node Compatibility Guide

⚠️ IMPORTANT: Match the Loader to the correct Generate node for stability.

Loader Node Compatible Generate Node Why?
Hunyuan 3 Loader (NF4) Hunyuan 3 Generate Keeps model on GPU. Best for standard sizes (<2MP).
Hunyuan 3 Loader (Full BF16) Hunyuan 3 Generate (Large/Offload) Keeps model in RAM. Allows CPU offloading for massive images (4K+).
Hunyuan 3 Loader (Full BF16) Hunyuan 3 Generate (HighRes Efficient) Memory-efficient MoE for 3MP+ on 96GB GPUs.
Hunyuan Unified Generate V2 (built-in loader) Single node β€” auto-detects NF4/INT8/BF16, handles block swap and memory.
Hunyuan Generate with Latent (built-in loader, inherits V2) πŸ§ͺ Experimental β€” adds image/latent inputs for composition, img2img, noise shaping.
Hunyuan Instruct Loader Hunyuan Instruct Generate T2I with CoT reasoning and prompt enhancement.
Hunyuan Instruct Loader Hunyuan Instruct Image Edit Edit images with natural language.
Hunyuan Instruct Loader Hunyuan Instruct Multi-Image Fusion Combine 2–5 reference images (4–5 experimental).

Do not mix them! Base model loaders are NOT compatible with Instruct generate nodes (and vice versa). The NF4 Loader with the Large/Offload node will also cause errors because the quantized model cannot be moved to CPU correctly.


πŸ§ͺ Latent Control Nodes (Experimental)

These nodes provide composition control, img2img, and custom noise injection for base (non-Instruct) models. They are experimental β€” the HunyuanImage-3 architecture is autoregressive (not diffusion-based), so traditional latent manipulation behaves differently than with Stable Diffusion.

Latent Node Overview

Node Purpose Category
Hunyuan Empty Latent Create a random noise tensor at a specific resolution and seed Latent utility
Hunyuan Latent Noise Shaping Transform latent tensors β€” frequency filter, amplify, invert Latent utility
Hunyuan Generate with Latent All-in-one generate node with optional image and latent inputs Generate (inherits V2 Unified)

Hunyuan Generate with Latent

This node extends the Unified Generate V2 β€” it inherits all memory management, block swap, model loading, and VRAM features. On top of that, it adds two optional inputs:

Optional Input What It Does
image A ComfyUI IMAGE tensor β€” encoded to latent space via the model's VAE, then combined with noise according to image_mode
latent A HUNYUAN_LATENT tensor from Empty Latent or Noise Shaping β€” used as the noise source instead of random generation

Image Modes (when an image is connected):

Mode Description Best For
composition Extracts broad spatial layout from the image (low-pass filter in latent space) and modulates noise amplitude. No ghosting β€” only macro composition transfers. Guiding where subjects/backgrounds appear
img2img Traditional latent mix: (1βˆ’Οƒ)Β·clean + σ·noise. Low denoise preserves the image; high denoise adds variation. Can ghost at low denoise. Classic image-to-image workflows
energy_map Uses per-channel energy (absolute magnitude) of the image latent to scale noise spatially. More abstract than composition. Abstract/artistic noise shaping

Usage Modes (determined by which optional inputs are connected):

Image Latent Behavior
❌ ❌ Identical to V2 Unified (pure text-to-image)
βœ… ❌ Image-guided generation using image_mode
❌ βœ… Custom noise / latent injection
βœ… βœ… Image-guided with custom noise base

Example Workflow: Image-Guided Composition

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Load Image   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚ IMAGE
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan Generate with Latent             β”‚
β”‚  model_name: HunyuanImage-3-NF4-v2      β”‚
β”‚  prompt: "A cyberpunk cityscape..."      β”‚
β”‚  image_mode: composition                 β”‚
β”‚  denoise_strength: 0.5                   β”‚
β”‚  resolution: 1024x1024                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ IMAGE
            β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Save Image   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Note: Because HunyuanImage-3 is autoregressive, img2img results differ significantly from diffusion-model img2img. The composition mode is generally recommended for more predictable layout guidance.

Basic Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan 3 Loader (NF4)  β”‚
β”‚  model_name: HunyuanImage-3-NF4 β”‚
β”‚  keep_in_cache: True    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ HUNYUAN_MODEL
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan 3 Generate      β”‚
β”‚  prompt: "..."          β”‚
β”‚  steps: 50              β”‚
β”‚  resolution: 1024x1024  β”‚
β”‚  guidance_scale: 7.5    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ IMAGE
            β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Save Image   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Advanced: Prompt Rewriting

✨ Feature: Uses official HunyuanImage-3.0 system prompts to professionally expand your prompts for better results.

⚠️ Note: Requires a paid LLM API. This feature is optional - you can use the nodes without it.

Supported APIs (any OpenAI-compatible endpoint):

  • DeepSeek (default, recommended for cost)
  • OpenAI GPT-4/GPT-3.5
  • Claude (via OpenAI-compatible proxy)
  • Local LLMs (via LM Studio, Ollama with OpenAI API)

Setup (Secure):

  1. Rename api_config.ini.example to api_config.ini in the custom node folder.
  2. Add your API key to the file:
    [API]
    api_key = sk-your-key-here
  3. Alternatively, set environment variables: HUNYUAN_API_KEY, HUNYUAN_API_URL.

Usage:

  • Option 1 (Integrated): Enable enable_prompt_rewrite in the Generate node.
  • Option 2 (Standalone): Use the Hunyuan Prompt Rewriter node to rewrite prompts before passing them to any model.
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan Prompt Rewriter β”‚      β”‚ Hunyuan 3 Generate      β”‚
β”‚  prompt: "dog running"  β”‚ ───► β”‚  prompt: (rewritten)    β”‚
β”‚  rewrite_style: ...     β”‚      β”‚  ...                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Result: Automatically expands to:

"An energetic brown and white border collie running across a sun-drenched meadow filled with wildflowers, motion blur on legs showing speed, golden hour lighting, shallow depth of field, professional photography, high detail, 8k quality"

Large Image Generation

For high-resolution outputs (2K, 4K, 6MP+):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan 3 Generate (Large)      β”‚
β”‚  resolution: 3840x2160 - 4K UHD β”‚
β”‚  cpu_offload: True              β”‚
β”‚  steps: 50                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ High-Resolution Generation (HighRes Efficient)

The Hunyuan 3 Generate (HighRes Efficient) node enables 3MP–4K+ generation on 96GB GPUs where the standard Large/Offload node runs out of memory.

Recommended Workflow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan 3 Loader (Full BF16)    β”‚
β”‚  target_resolution: 3MP         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ HUNYUAN_MODEL
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hunyuan 3 Generate              β”‚
β”‚  (HighRes Efficient)            β”‚
β”‚  resolution: 2048x1536 (3.1MP)  β”‚
β”‚  guidance_scale: 7.5            β”‚
β”‚  steps: 50                      β”‚
β”‚  offload_mode: smart            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚ IMAGE
            β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Save Image   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why use this instead of the standard Large/Offload node?

The standard node OOMs at 3MP+ because the upstream MoE routing creates a massive intermediate tensor (the "dispatch_mask") that grows quadratically with token count. The HighRes Efficient node replaces this with a loop-based dispatch that is ~75Γ— more memory-efficient. See technical details below.

VRAM Requirements (BF16, 96GB GPU):

Resolution Standard Large Node HighRes Efficient Node
1MP (1024Γ—1024) βœ… ~42 GB total βœ… ~42 GB total
2MP (1920Γ—1080) βœ… ~55 GB total βœ… ~52 GB total
3MP (2048Γ—1536) ❌ OOM (~83 GB) βœ… ~55 GB total
4MP (2560Γ—1920) ❌ OOM βœ… ~70 GB total
8MP (3840Γ—2160) ❌ OOM ⚠️ ~90 GB (tight on 96GB)

Quality: Identical to the standard node β€” same routing decisions, same expert MLPs, just dispatched more efficiently.

Speed: Similar β€” the same expert computations run on the same data. The loop-based dispatch adds negligible overhead compared to the MLP computations themselves.


πŸ”¬ Why Standard Nodes OOM at 3MP+ (MoE dispatch_mask)

HunyuanImage-3.0 is a Mixture-of-Experts (MoE) model with 64 expert MLPs per layer, but each token only uses its top-8. The question is how to route tokens to experts.

How Image Tokens Work

The VAE has a 16Γ— downsample factor in each spatial dimension. Each latent pixel becomes one token:

Resolution Latent Size Image Tokens
1024Γ—1024 (1MP) 64Γ—64 4,096
1920Γ—1080 (2MP) 120Γ—68 ~8,160
2048Γ—1536 (3MP) 128Γ—96 12,288
3840Γ—2160 (8MP) 240Γ—135 32,400

When CFG (Classifier-Free Guidance) is enabled (guidance_scale > 1.0), the batch is doubled β€” the model runs a conditional and unconditional pass together. So at 3MP with CFG, N β‰ˆ 25,088 tokens flow through each MoE layer.

Note on CFG: The guidance_scale value (4.0, 5.0, 7.5, 20.0) only controls the blending ratio between conditional and unconditional predictions. It does not change memory usage. The batch doubling is binary: guidance_scale = 1.0 β†’ no doubling, guidance_scale > 1.0 β†’ always doubled.

The dispatch_mask Approach (Standard Nodes)

The upstream "eager" MoE implementation builds a 3D boolean tensor β€” the dispatch_mask β€” of shape [N_tokens, 64_experts, expert_capacity].

Think of it as a seating chart for 64 rooms (experts), each with a fixed number of chairs. For every token, the mask records which room and chair it sits in. With drop_tokens=True, the capacity per expert is:

$$\text{expert_capacity} = \frac{\text{topk} \times N}{\text{num_experts}} = \frac{8 \times N}{64} = \frac{N}{8}$$

The full dispatch_mask shape is [N, 64, N/8] β€” it grows quadratically with N.

Two einsum operations use it:

  1. Dispatch: einsum("sec,sm→ecm", dispatch_mask, input) — gathers tokens into expert buffers
  2. Combine: einsum("sec,ecm→sm", combine_weights, expert_output) — scatters results back

This is mathematically elegant (fully parallelized, single matrix op) but extremely memory-wasteful β€” the mask is 87.5% zeros since each token only visits 8 of 64 experts.

Memory at 3MP with CFG (N β‰ˆ 25,088)

Tensor dispatch_mask approach Loop approach
Routing info [25088, 64, 3136] = 5 GB (bool) [25088, 8] = 0.8 MB (int64)
Cast to bf16 for einsum 10 GB not needed
combine_weights 10 GB not needed
Einsum intermediates ~15 GB not needed
Per-expert gather/scatter β€” ~200 MB peak
Total MoE intermediate ~37 GB ~200 MB

Model weights (30 GB) + KV cache (16 GB) + MoE intermediates (37 GB) = 83 GB β†’ OOM.

The Efficient Approach (HighRes Node)

Instead of building the giant seating chart, the HighRes node patches the MoE forward to use a simple loop:

For each of the 64 experts:
    1. Find which tokens chose this expert (index lookup from top-k)
    2. Gather those ~N/8 tokens
    3. Run the expert MLP on them
    4. Scatter-add weighted results back to the output

Same routing decisions, same expert computations, same output quality. The only data structure needed is the top-k indices and weights β€” shape [N, 8] β€” which is tiny.


πŸ“ Resolution Guide

Standard Generation Node

  • 832x1280 - Portrait (1.0MP) [<2MP] βœ… Safe, fast
  • 1024x1024 - Square (1.0MP) [<2MP] βœ… Safe, fast
  • 1280x832 - Landscape (1.0MP) [<2MP] βœ… Safe, fast
  • 1536x1024 - Landscape (1.5MP) [<2MP] βœ… Safe, fast
  • 2048x2048 - Square (4.0MP) [>2MP] ⚠️ May OOM

Large/Offload Node

  • 2560x1440 - Landscape 2K (3.7MP) βœ… With CPU offload
  • 3840x2160 - Landscape 4K UHD (8.3MP) βœ… With CPU offload
  • 3072x2048 - Landscape 6MP (6.3MP) βœ… With CPU offload

Tip: Test prompts at small resolutions (fast), then render finals in large node.

πŸ”§ Configuration

Memory Management

Single GPU (24-48GB VRAM):

Use: Hunyuan 3 Loader (NF4)
Settings:
  - keep_in_cache: True (for multiple generations)
  - Use standard Generate node for <2MP

Single GPU (80-96GB VRAM):

Use: Hunyuan 3 Loader (88GB GPU Optimized)
Settings:
  - reserve_memory_gb: 14.0 (leaves room for inference)
  - Full BF16 quality

Multi-GPU Setup:

Use: Hunyuan 3 Loader (Multi-GPU BF16)
Settings:
  - primary_gpu: 0 (where inference runs)
  - reserve_memory_gb: 12.0
  - Automatically distributes across all GPUs

⚑ Performance Optimization Guide

To get the maximum speed and avoid unnecessary offloading (which slows down generation):

  1. Reserve Enough VRAM:

    • Use the reserve_memory_gb slider in the Loader.
    • Set it high enough to cover the generation overhead for your target resolution (e.g., 30GB+ for 4K).
    • Why? If you reserve space upfront, the model stays on the GPU. If you don't, the "Smart Offload" might panic and move everything to RAM to prevent a crash.
  2. Select Specific Resolutions:

    • Avoid using "Auto (model default)" in the Large Generate node if you are optimizing for speed.
    • Auto Mode Safety: When "Auto" is selected, the node assumes a large resolution (~2.5MP) to be safe. This might trigger offloading even if your actual image is small.
    • Specific Mode: Selecting "1024x1024" tells the node exactly how much VRAM is needed, allowing it to skip offload if you have the space.

πŸ› οΈ Quantization

Create your own NF4 quantized model:

cd quantization
python hunyuan_quantize_nf4.py \
  --model-path "/path/to/HunyuanImage-3" \
  --output-path "/path/to/HunyuanImage-3-NF4"

Benefits:

  • ~4x smaller (80GB β†’ 20GB model size)
  • ~45GB VRAM usage (vs 80GB+ for BF16)
  • Minimal quality loss
  • Attention layers kept in full precision for stability

πŸ“Š Performance Benchmarks

RTX 6000 Pro Blackwell (96GB) - NF4 Quantized ⭐ FASTEST

Configuration: PyTorch 2.9.1 + CUDA 12.8, NF4 Loader + Standard Generate

Resolution Steps Time Speed VRAM
1024x1024 (~1MP) 40 ~58s ~1.45s/step ~45GB

This is the fastest configuration for Blackwell GPUs thanks to PyTorch 2.9's optimizations. Key settings:

  • Use NF4 Loader (standard or Low VRAM+ variant)
  • Use Hunyuan 3 Generate (standard node, no offload needed)
  • Keep model loaded between runs for successive generations
  • Soft Unload supported with bitsandbytes 0.48.2+

RTX 6000 Pro Blackwell (96GB) - INT8 Quantized

Configuration: INT8 Loader + Large Generate (Budget) with offload_mode=disabled

Resolution Steps Time Speed VRAM
1152x864 (~1MP) 40 2:35 3.9s/step 85GB β†’ 95GB

INT8 offers slightly better quality than NF4 but is slower. Key settings:

  • Use INT8 Loader (Budget) with ~80GB GPU target
  • Use Large Generate (Budget) with offload_mode=disabled
  • Keep model loaded between runs for successive generations

RTX 6000 Ada (48GB) - NF4 Quantized

  • Load time: ~35 seconds
  • 1024x1024 @ 50 steps: ~4 seconds/step
  • VRAM usage: ~45GB

2x RTX 4090 (48GB each) - Multi-GPU BF16

  • Load time: ~60 seconds
  • 1024x1024 @ 50 steps: ~3.5 seconds/step
  • VRAM usage: ~70GB + 10GB distributed

RTX 6000 Blackwell (96GB) - Full BF16

  • Load time: ~25 seconds
  • 1024x1024 @ 50 steps: ~3 seconds/step
  • VRAM usage: ~80GB

πŸ› Troubleshooting

Out of Memory Errors

Solutions:

  1. For 3MP+ BF16 generations: Use "Hunyuan 3 Generate (HighRes Efficient)" β€” the standard Large node OOMs at 3MP+ due to MoE dispatch_mask memory. See technical details.
  2. Use NF4 quantized model instead of full BF16
  3. Reduce resolution (pick options marked [<2MP])
  4. Lower steps (try 30-40 instead of 50)
  5. Use "Hunyuan 3 Generate (Large/Offload)" node with cpu_offload: True (for ≀2MP)
  6. Run "Hunyuan 3 Unload" node before generating
  7. Set keep_in_cache: False in loader

Pixelated/Corrupted Output

If using NF4 quantization:

  • Re-quantize with the updated script (includes attention layer fix)
  • Old quantized models may produce artifacts

Multi-GPU Not Detecting Second GPU

Check:

  1. Run "Hunyuan 3 GPU Info" node
  2. Look for CUDA_VISIBLE_DEVICES environment variable
  3. Ensure ComfyUI can see all GPUs: torch.cuda.device_count()

Fix:

# Remove GPU visibility restrictions
unset CUDA_VISIBLE_DEVICES
# Restart ComfyUI

Slow Generation

Optimizations:

  1. Use NF4 quantized model (faster than BF16)
  2. Reduce steps (30-40 is often sufficient)
  3. Keep model in cache (keep_in_cache: True)
  4. Use smaller resolutions for testing

πŸ“ Advanced Tips

Prompt Engineering

Good prompts include:

  1. Subject: What is the main focus
  2. Action: What is happening
  3. Environment: Where it takes place
  4. Style: Artistic style, mood, atmosphere
  5. Technical: Lighting, composition, quality keywords

Example:

A majestic snow leopard prowling through a misty mountain forest at dawn,
dappled golden light filtering through pine trees, shallow depth of field,
wildlife photography, National Geographic style, 8k, highly detailed fur texture

Note: HunyuanImage-3.0 uses an autoregressive architecture (like GPT) rather than diffusion, so it doesn't support negative prompts. Instead, be explicit in your prompt about what you want to include.

Reproducible Results

Set a specific seed (0-18446744073709551615) to get the same image:

seed: 42  # Use any number, same seed = same image

πŸ“š Model Information

HunyuanImage-3.0:

  • Architecture: Native multimodal autoregressive transformer
  • Parameters: 80B total (13B active per token)
  • Experts: 64 experts (Mixture of Experts architecture)
  • Training: Text-to-image with RLHF post-training
  • License: Apache 2.0 (see Tencent repo for details)

Paper: HunyuanImage 3.0 Technical Report

Official Repo: Tencent-Hunyuan/HunyuanImage-3.0

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

πŸ“„ License

Dual License (Non-Commercial and Commercial Use):

  1. Non-Commercial Use: Licensed under Creative Commons Attribution-NonCommercial 4.0 International License
  2. Commercial Use: Requires separate license. Contact eric@historic.camera or eric@rollei.us

See LICENSE for full details.

Note: The HunyuanImage-3.0 model itself is licensed under Apache 2.0 by Tencent. This license only covers the ComfyUI integration code.

Copyright (c) 2025-2026 Eric Hiss. All rights reserved.

πŸ™ Credits

ComfyUI Integration

  • Author: Eric Hiss (GitHub: EricRollei)
  • License: CC BY-NC 4.0 (Non-Commercial) / Commercial License Available

HunyuanImage-3.0 Model

Special Thanks

  • Tencent Hunyuan Team for creating and open-sourcing the incredible HunyuanImage-3.0 model
  • ComfyUI Community for the excellent extensible framework
  • All contributors and testers

πŸ†˜ Support

πŸ”„ Changelog

v1.3.0 (2026-02-12)

INT8 Instruct Fix (5 bugs):

  • blocks_to_swap was forcibly set to 0 for INT8 models in generate() β€” now respects user setting
  • Missing _load_int8_block_swap() method in loader β€” added INT8-specific block swap loading with CB/SCB guard hooks
  • INT8 model size estimated at 40GB instead of 80GB in memory budget β€” corrected
  • _move_non_block_components_to_gpu() didn't fix INT8 CB/SCB after .to(device) β€” now calls _fix_int8_module_devices()
  • _calculate_optimal_config() used BF16's gb_per_block for INT8 β€” now uses quant-specific values (NF4: 0.72, INT8: 2.4, BF16: 4.7)

VAE Decode Crash Fix:

  • Fixed super(): __class__ is not a type (NoneType) crash during VAE decode. Root cause: the cache-clearing code's "external monkey-patch remover" used PyCell_Set(cell, None) to nuke ALL closure cells, including __class__ cells used by super() in Conv3d and other classes. Fixed with isinstance(val, type) guard in hunyuan_shared.py and hunyuan_cache_v2.py.
  • Cleaned up instance-level forward attributes left by remove_hook_from_module across 4 files.

v2 Pre-Quantized Models:

  • All 6 INT8/NF4 models (base + Instruct-Distil + Instruct-Full) re-quantized and uploaded to Hugging Face with improved model cards and block swap guidance.

Experimental Latent Control Nodes:

  • Hunyuan Empty Latent β€” creates random noise tensors at specific resolution/seed
  • Hunyuan Latent Noise Shaping β€” frequency filtering, amplification, inversion of latent tensors
  • Hunyuan Generate with Latent β€” all-in-one node with optional image/latent inputs for composition control, img2img, and energy map modes

Unified Generate V2:

  • Single node replacing all base-model generate variants β€” auto-detects NF4/INT8/BF16, handles block swap, memory budgets, and VRAM management

v1.2.0 (2026-02-11)

Resolution & UX Improvements:

  • Instruct resolution dropdown expanded to all 33 model-native bucket resolutions (~1MP each), ordered tallest portrait β†’ square β†’ widest landscape
  • Multi-Image Fusion node expanded from 3 to 5 image inputs (slots 4–5 experimental)
  • Resolution tooltips updated across all Instruct generate nodes

Bug Fixes:

  • Issue #16 β€” NF4 Low VRAM OOM: Two-stage max_memory estimation in quantized loader replaces one-shot approach that left no headroom for inference
  • Issue #15 β€” Multi-GPU device mismatch: Explicit .to(device) on freqs_cis / image_pos_id prevents cross-device errors during block-swap forward pass
  • Issue #12 β€” Transformers 5.x compatibility: _lookup dict guard, BitsAndBytesConfig import path, and modeling_utils attribute checks updated

Code Quality:

  • Instruct nodes: added missing multi-GPU block-swap patch, OOM error handlers for Image Edit and Multi-Fusion
  • Removed dead gc import from hunyuan_highres_nodes.py
  • Cache v2: added clear_generation_cache() helper used by all generate nodes
  • Shared utilities: centralized _aggressive_vram_cleanup() with stale KV-cache detection
  • Block swap: _lookup guard for INT8 Module._apply hook (transformers 5.x)

v1.1.0 (2026-02-09)

  • Instruct model nodes (Loader, Generate, Image Edit, Multi-Fusion, Unload)
  • Block swap, HighRes Efficient node, Unified V2 node
  • Flexible model paths via extra_model_paths.yaml
  • Soft Unload, Force Unload, Clear Downstream nodes

v1.0.0 (2025-11-17)

  • Initial release
  • Full BF16 and NF4 quantized model support
  • Multi-GPU loading support
  • Optional prompt rewriting with DeepSeek API
  • Improved resolution organization
  • Large image generation with CPU offload
  • Comprehensive error handling and VRAM management

Made with ❀️ for the ComfyUI community