Skip to content

Commit a1486a3

Browse files
authored
Merge pull request #402 from AInVFX/main
v2.5.20: expanded attention backends (FA2/FA3/SA2/SA3), macOS MPS dtype fixes, bitsandbytes ROCm shim
2 parents 2006fa3 + bbf649d commit a1486a3

File tree

16 files changed

+774
-242
lines changed

16 files changed

+774
-242
lines changed

README.md

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
Official release of [SeedVR2](https://github.com/ByteDance-Seed/SeedVR) for ComfyUI that enables high-quality video and image upscaling.
66

7-
Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#-run-as-standalone-cli) section.
7+
Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#-run-as-standalone-cli) section.
88

99
[![SeedVR2 v2.5 Deep Dive Tutorial](https://img.youtube.com/vi/MBtWYXq_r60/maxresdefault.jpg)](https://youtu.be/MBtWYXq_r60)
1010

@@ -14,8 +14,8 @@ Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#
1414

1515
## 📋 Quick Access
1616

17-
- [🆙 Future Releases](#-future-releases)
18-
- [🚀 Updates](#-updates)
17+
- [🆙 Future Work](#-future-work)
18+
- [🚀 Release Notes](#-release-notes)
1919
- [🎯 Features](#-features)
2020
- [🔧 Requirements](#-requirements)
2121
- [📦 Installation](#-installation)
@@ -26,15 +26,24 @@ Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#
2626
- [🙏 Credits](#-credits)
2727
- [📜 License](#-license)
2828

29-
## 🆙 Future Releases
29+
## 🆙 Future Work
3030

3131
We're actively working on improvements and new features. To stay informed:
3232

3333
- **📌 Track Active Development**: Visit [Issues](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/issues) to see active development, report bugs, and request new features
3434
- **💬 Join the Community**: Learn from others, share your workflows, and get help in the [Discussions](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/discussions)
3535
- **🔮 Next Model Survey**: We're looking for community input on the next open-source super-powerful generic restoration model. Share your suggestions in [Issue #164](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/issues/164)
3636

37-
## 🚀 Updates
37+
## 🚀 Release Notes
38+
39+
**2025.12.12 - Version 2.5.20**
40+
41+
- **⚡ Expanded attention backends** - Full support for Flash Attention 2 (Ampere+), Flash Attention 3 (Hopper+), SageAttention 2, and SageAttention 3 (Blackwell/RTX 50xx), with automatic fallback chains to PyTorch SDPA when unavailable *(based on PR by [@naxci1](https://github.com/naxci1) - thank you!)*
42+
- **🍎 macOS/Apple Silicon compatibility** - Replaced MPS autocast with explicit dtype conversion throughout VAE and DiT pipelines, resolving hangs and crashes on M-series Macs. BlockSwap now auto-disables with warning (unified memory makes it meaningless)
43+
- **🛡️ Flash Attention graceful fallback** - Added compatibility shims for corrupted or partially installed flash_attn/xformers DLLs, preventing startup crashes
44+
- **🛡️ AMD ROCm: bitsandbytes conflict fix** - Prevent kernel registration errors when diffusers attempts to re-import broken bitsandbytes installations
45+
- **📦 ComfyUI Manager: macOS classifier fix** - Removed NVIDIA CUDA classifier causing false "GPU not supported" warnings on macOS
46+
- **📚 Documentation updates** - Updated README with attention backend details, BlockSwap macOS notes, and clarified model caching descriptions
3847

3948
**2025.12.10 - Version 2.5.19**
4049

@@ -232,7 +241,7 @@ We're actively working on improvements and new features. To stay informed:
232241

233242
**2025.07.03**
234243

235-
- 🛠️ Can run as **standalone mode** with **Multi GPU** see [🖥️ Run as Standalone](#️-run-as-standalone-cli)
244+
- 🛠️ Can run as **standalone mode** with **Multi GPU** see [🖥️ Run as Standalone](#run-as-standalone-cli)
236245

237246
**2025.06.30**
238247

@@ -279,8 +288,8 @@ We're actively working on improvements and new features. To stay informed:
279288
### Performance Features
280289
- **torch.compile Integration**: Optional 20-40% DiT speedup and 15-25% VAE speedup with PyTorch 2.0+ compilation
281290
- **Multi-GPU CLI**: Distribute workload across multiple GPUs with automatic temporal overlap blending
282-
- **Model Caching**: Keep models loaded in memory for faster batch processing
283-
- **Flexible Attention Backends**: Choose between PyTorch SDPA (stable, always available) or Flash Attention 2 (faster on supported hardware)
291+
- **Model Caching**: Keep models loaded between generations for single-GPU directory processing or multi-GPU streaming
292+
- **Flexible Attention Backends**: Choose between PyTorch SDPA (stable, always available), Flash Attention 2/3, or SageAttention 2/3 for faster computation on supported hardware
284293

285294
### Quality Control
286295
- **Advanced Color Correction**: Five methods including LAB (recommended for highest fidelity), wavelet, wavelet adaptive, HSV, and AdaIN
@@ -309,7 +318,7 @@ With the current optimizations (tiling, BlockSwap, GGUF quantization), SeedVR2 c
309318
- **Python**: 3.12+ (Python 3.12 and 3.13 tested and recommended)
310319
- **PyTorch**: 2.0+ for torch.compile support (optional but recommended)
311320
- **Triton**: Required for torch.compile with inductor backend (optional)
312-
- **Flash Attention 2**: Provides faster attention computation on supported hardware (optional, falls back to PyTorch SDPA)
321+
- **Flash Attention / SageAttention**: Flash Attention 2 (Ampere+), Flash Attention 3 (Hopper+), SageAttention 2 or SageAttention 3 (Blackwell) provide faster attention computation on supported hardware (optional, falls back to PyTorch SDPA)
313322

314323
## 📦 Installation
315324

@@ -424,14 +433,21 @@ Configure the DiT (Diffusion Transformer) model for video upscaling.
424433
- Requires offload_device to be set and different from device
425434

426435
- **attention_mode**: Attention computation backend
427-
- `sdpa`: PyTorch scaled_dot_product_attention (default, stable, always available)
428-
- `flash_attn`: Flash Attention 2 (faster on supported hardware, requires flash-attn package)
436+
- `sdpa`: PyTorch scaled_dot_product_attention (default, always available)
437+
- `flash_attn_2`: Flash Attention 2 (Ampere+, requires flash-attn package)
438+
- `flash_attn_3`: Flash Attention 3 (Hopper+, requires flash-attn with FA3 support)
439+
- `sageattn_2`: SageAttention 2 (requires sageattention package)
440+
- `sageattn_3`: SageAttention 3 (Blackwell/RTX 50xx, requires sageattn3 package)
429441

430442
- **torch_compile_args**: Connect to SeedVR2 Torch Compile Settings node for 20-40% speedup
431443

432444
**BlockSwap Explained:**
433445

434-
BlockSwap enables running large models on GPUs with limited VRAM by dynamically swapping transformer blocks between GPU and CPU memory during inference. Here's how it works:
446+
BlockSwap enables running large models on GPUs with limited VRAM by dynamically swapping transformer blocks between GPU and CPU memory during inference.
447+
448+
> **Note:** BlockSwap is not available on macOS. Apple Silicon Macs use unified memory architecture where GPU and CPU share the same memory pool, making BlockSwap meaningless. The option will be automatically disabled with a warning if requested on macOS.
449+
450+
Here's how it works:
435451

436452
- **What it does**: Keeps only the currently-needed transformer blocks on the GPU, while storing the rest on CPU or another device
437453
- **When to use it**: When you get OOM (Out of Memory) errors during the upscaling phase
@@ -867,9 +883,8 @@ python inference_cli.py media_folder/ \
867883
**Memory Management:**
868884
- `--dit_offload_device`: Device to offload DiT model: 'none' (keep on GPU), 'cpu', or 'cuda:X' (default: none)
869885
- `--vae_offload_device`: Device to offload VAE model: 'none', 'cpu', or 'cuda:X' (default: none)
870-
- `--blocks_to_swap`: Number of transformer blocks to swap (0=disabled, 3B: 0-32, 7B: 0-36). Requires dit_offload_device (default: 0)
871-
- `--swap_io_components`: Offload I/O components for additional VRAM savings. Requires dit_offload_device
872-
- `--use_non_blocking`: Use non-blocking memory transfers for BlockSwap (recommended)
886+
- `--blocks_to_swap`: Number of transformer blocks to swap (0=disabled, 3B: 0-32, 7B: 0-36). Requires dit_offload_device (default: 0). Not available on macOS.
887+
- `--swap_io_components`: Offload I/O components for additional VRAM savings. Requires dit_offload_device. Not available on macOS.
873888

874889
**VAE Tiling:**
875890
- `--vae_encode_tiled`: Enable VAE encode tiling to reduce VRAM during encoding
@@ -882,7 +897,7 @@ python inference_cli.py media_folder/ \
882897

883898
**Performance Optimization:**
884899
- `--allow_vram_overflow`: Allow VRAM overflow to system RAM. Prevents OOM but may cause severe slowdown
885-
- `--attention_mode`: Attention backend: 'sdpa' (default, stable) or 'flash_attn' (faster, requires package)
900+
- `--attention_mode`: Attention backend: 'sdpa' (default), 'flash_attn_2' (Ampere+), 'flash_attn_3' (Hopper+), 'sageattn_2', or 'sageattn_3' (Blackwell)
886901
- `--compile_dit`: Enable torch.compile for DiT model (20-40% speedup, requires PyTorch 2.0+ and Triton)
887902
- `--compile_vae`: Enable torch.compile for VAE model (15-25% speedup, requires PyTorch 2.0+ and Triton)
888903
- `--compile_backend`: Compilation backend: 'inductor' (full optimization) or 'cudagraphs' (lightweight) (default: inductor)
@@ -893,8 +908,8 @@ python inference_cli.py media_folder/ \
893908
- `--compile_dynamo_recompile_limit`: Max recompilation attempts before fallback (default: 128)
894909

895910
**Model Caching (batch processing):**
896-
- `--cache_dit`: Cache DiT model between files (single GPU only, speeds up directory processing)
897-
- `--cache_vae`: Cache VAE model between files (single GPU only, speeds up directory processing)
911+
- `--cache_dit`: Keep DiT model in memory between generations. Works with single-GPU directory processing or multi-GPU streaming (`--chunk_size`). Requires `--dit_offload_device`
912+
- `--cache_vae`: Keep VAE model in memory between generations. Works with single-GPU directory processing or multi-GPU streaming (`--chunk_size`). Requires `--vae_offload_device`
898913

899914
**Multi-GPU:**
900915
- `--cuda_device`: CUDA device id(s). Single id (e.g., '0') or comma-separated list '0,1' for multi-GPU
@@ -997,7 +1012,7 @@ For detailed contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).
9971012

9981013
This ComfyUI implementation is a collaborative project by **[NumZ](https://github.com/numz)** and **[AInVFX](https://www.youtube.com/@AInVFX)** (Adrien Toupet), based on the original [SeedVR2](https://github.com/ByteDance-Seed/SeedVR) by ByteDance Seed Team.
9991014

1000-
Special thanks to our community contributors including [benjaminherb](https://github.com/benjaminherb), [cmeka](https://github.com/cmeka), [FurkanGozukara](https://github.com/FurkanGozukara), [JohnAlcatraz](https://github.com/JohnAlcatraz), [lihaoyun6](https://github.com/lihaoyun6), [Luchuanzhao](https://github.com/Luchuanzhao), [Luke2642](https://github.com/Luke2642), [naxci1](https://github.com/naxci1), [q5sys](https://github.com/q5sys), and many others for their improvements, bug fixes, and testing.
1015+
Special thanks to our community contributors including [naxci1](https://github.com/naxci1), [benjaminherb](https://github.com/benjaminherb), [cmeka](https://github.com/cmeka), [FurkanGozukara](https://github.com/FurkanGozukara), [JohnAlcatraz](https://github.com/JohnAlcatraz), [lihaoyun6](https://github.com/lihaoyun6), [Luchuanzhao](https://github.com/Luchuanzhao), [Luke2642](https://github.com/Luke2642), [proxyid](https://github.com/proxyid), [q5sys](https://github.com/q5sys), and many others for their improvements, bug fixes, and testing.
10011016

10021017
## 📜 License
10031018

inference_cli.py

Lines changed: 9 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1327,9 +1327,10 @@ def parse_arguments() -> argparse.Namespace:
13271327
blockswap_group = parser.add_argument_group('Memory optimization (BlockSwap)')
13281328
blockswap_group.add_argument("--blocks_to_swap", type=int, default=0,
13291329
help="Transformer blocks to swap for VRAM savings. 0-32 (3B) or 0-36 (7B). "
1330-
"Requires --dit_offload_device. Default: 0 (disabled)")
1330+
"Requires --dit_offload_device. Not available on macOS. Default: 0 (disabled)")
13311331
blockswap_group.add_argument("--swap_io_components", action="store_true",
1332-
help="Offload DiT I/O layers for extra VRAM savings. Requires --dit_offload_device")
1332+
help="Offload DiT I/O layers for extra VRAM savings. Requires --dit_offload_device. "
1333+
"Not available on macOS")
13331334

13341335
# VAE Tiling
13351336
vae_group = parser.add_argument_group('VAE tiling (for high resolution upscale)')
@@ -1351,8 +1352,8 @@ def parse_arguments() -> argparse.Namespace:
13511352
# Performance
13521353
perf_group = parser.add_argument_group('Performance optimization')
13531354
perf_group.add_argument("--attention_mode", type=str, default="sdpa",
1354-
choices=["sdpa", "flash_attn"],
1355-
help="Attention backend: 'sdpa' (default, always available) or 'flash_attn' (faster, requires package)")
1355+
choices=["sdpa", "flash_attn_2", "flash_attn_3", "sageattn_2", "sageattn_3"],
1356+
help="Attention backend: 'sdpa' (default), 'flash_attn_2', 'flash_attn_3', 'sageattn_2', or 'sageattn_3' (Blackwell GPUs)")
13561357
perf_group.add_argument("--compile_dit", action="store_true",
13571358
help="Enable torch.compile for DiT model (20-40%% speedup, requires PyTorch 2.0+ and Triton)")
13581359
perf_group.add_argument("--compile_vae", action="store_true",
@@ -1374,9 +1375,11 @@ def parse_arguments() -> argparse.Namespace:
13741375
# Model Caching (for batch processing)
13751376
cache_group = parser.add_argument_group('Model caching (batch processing)')
13761377
cache_group.add_argument("--cache_dit", action="store_true",
1377-
help="Cache DiT model between files (single GPU only, speeds up directory processing)")
1378+
help="Keep DiT model in memory between generations. Works with single-GPU directory processing "
1379+
"or multi-GPU streaming (--chunk_size). Requires --dit_offload_device")
13781380
cache_group.add_argument("--cache_vae", action="store_true",
1379-
help="Cache VAE model between files (single GPU only, speeds up directory processing)")
1381+
help="Keep VAE model in memory between generations. Works with single-GPU directory processing "
1382+
"or multi-GPU streaming (--chunk_size). Requires --vae_offload_device")
13801383

13811384
# Debugging
13821385
debug_group = parser.add_argument_group('Debugging')
@@ -1435,24 +1438,6 @@ def main() -> None:
14351438
debug.log(f"VAE decode tile overlap ({args.vae_decode_tile_overlap}) must be smaller than tile size ({args.vae_decode_tile_size})", level="ERROR", category="vae", force=True)
14361439
sys.exit(1)
14371440

1438-
# Validate BlockSwap configuration - either blocks_to_swap or swap_io_components requires dit_offload_device
1439-
blockswap_enabled = args.blocks_to_swap > 0 or args.swap_io_components
1440-
if blockswap_enabled and args.dit_offload_device == "none":
1441-
config_details = []
1442-
if args.blocks_to_swap > 0:
1443-
config_details.append(f"blocks_to_swap={args.blocks_to_swap}")
1444-
if args.swap_io_components:
1445-
config_details.append("swap_io_components=True")
1446-
1447-
debug.log(
1448-
f"BlockSwap enabled ({', '.join(config_details)}) but dit_offload_device='none'. "
1449-
"BlockSwap requires dit_offload_device to be set (typically 'cpu'). "
1450-
"Either set --dit_offload_device cpu or disable BlockSwap "
1451-
"(--blocks_to_swap 0 and do not use --swap_io_components)",
1452-
level="ERROR", category="blockswap", force=True
1453-
)
1454-
sys.exit(1)
1455-
14561441
# Inform about caching defaults
14571442
if args.cache_dit and args.dit_offload_device == "none":
14581443
offload_target = "system memory (CPU)" if get_gpu_backend() != "mps" else "unified memory"

pyproject.toml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
11
[project]
22
name = "seedvr2_videoupscaler"
33
description = "SeedVR2 official ComfyUI integration: ByteDance-Seed's one-step diffusion-based video/image upscaling with memory-efficient inference"
4-
version = "2.5.19"
4+
version = "2.5.20"
55
authors = [
66
{name = "numz"},
77
{name = "adrientoupet"}
88
]
99
license = {file = "LICENSE"}
1010
classifiers = [
11-
"Operating System :: OS Independent",
12-
"Environment :: GPU :: NVIDIA CUDA"
11+
"Operating System :: OS Independent"
1312
]
1413
dependencies = [
1514
"torch",

src/core/generation_phases.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -700,17 +700,18 @@ def _add_noise(x, aug_noise):
700700
)
701701
conditions = [condition]
702702

703-
# Detect DiT model dtype (handle FP8CompatibleDiT wrapper)
703+
# Detect DiT model dtype (handle CompatibleDiT wrapper)
704704
dit_model = runner.dit.dit_model if hasattr(runner.dit, 'dit_model') else runner.dit
705705
try:
706706
dit_dtype = next(dit_model.parameters()).dtype
707707
except StopIteration:
708708
dit_dtype = ctx['compute_dtype'] # Fallback for meta device or empty model
709709

710710
# Use autocast if DiT dtype differs from compute dtype
711+
# Skip autocast on MPS (CompatibleDiT already handles dtype conversion)
711712
debug.start_timer(f"dit_inference_{upscale_idx+1}")
712713
with torch.no_grad():
713-
if dit_dtype != ctx['compute_dtype']:
714+
if dit_dtype != ctx['compute_dtype'] and ctx['dit_device'].type != 'mps':
714715
with torch.autocast(ctx['dit_device'].type, ctx['compute_dtype'], enabled=True):
715716
upscaled_latents = runner.inference(
716717
noises=noises,

src/core/generation_utils.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -457,7 +457,7 @@ def prepare_runner(
457457
decode_tile_size: Tile size for decoding (height, width)
458458
decode_tile_overlap: Tile overlap for decoding (height, width)
459459
tile_debug: Tile visualization mode (false/encode/decode)
460-
attention_mode: Attention computation backend ('sdpa' or 'flash_attn')
460+
attention_mode: Attention computation backend ('sdpa', 'flash_attn_2', 'flash_attn_3', 'sageattn_2', or 'sageattn_3')
461461
torch_compile_args_dit: Optional torch.compile configuration for DiT model
462462
torch_compile_args_vae: Optional torch.compile configuration for VAE model
463463
@@ -529,8 +529,8 @@ def load_text_embeddings(script_directory: str, device: torch.device,
529529
- Memory-efficient embedding preparation
530530
- Consistent movement logging
531531
"""
532-
text_pos_embeds = torch.load(os.path.join(script_directory, 'pos_emb.pt'))
533-
text_neg_embeds = torch.load(os.path.join(script_directory, 'neg_emb.pt'))
532+
text_pos_embeds = torch.load(os.path.join(script_directory, 'pos_emb.pt'), weights_only=True)
533+
text_neg_embeds = torch.load(os.path.join(script_directory, 'neg_emb.pt'), weights_only=True)
534534

535535
text_pos_embeds = manage_tensor(
536536
tensor=text_pos_embeds,
@@ -819,4 +819,4 @@ def ensure_precision_initialized(
819819
debug.log(f"Model precision: {', '.join(parts)}", category="precision")
820820

821821
except Exception as e:
822-
debug.log(f"Could not log model dtypes: {e}", level="WARNING", category="precision", force=True)
822+
debug.log(f"Could not log model dtypes: {e}", level="WARNING", category="precision", force=True)

0 commit comments

Comments
 (0)