numz
diff --git a/‎README.md‎
Lines changed: 34 additions & 19 deletions b/‎README.md‎
Lines changed: 34 additions & 19 deletions
diff --git a/‎inference_cli.py‎
Lines changed: 9 additions & 24 deletions b/‎inference_cli.py‎
Lines changed: 9 additions & 24 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 3 deletions b/‎pyproject.toml‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎src/core/generation_phases.py‎
Lines changed: 3 additions & 2 deletions b/‎src/core/generation_phases.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎src/core/generation_utils.py‎
Lines changed: 4 additions & 4 deletions b/‎src/core/generation_utils.py‎
Lines changed: 4 additions & 4 deletions
@@ -4,7 +4,7 @@
 
 Official release of [SeedVR2](https://github.com/ByteDance-Seed/SeedVR) for ComfyUI that enables high-quality video and image upscaling.
 
-Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#️-run-as-standalone-cli) section.
+Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#-run-as-standalone-cli) section.
 
 [![SeedVR2 v2.5 Deep Dive Tutorial](https://img.youtube.com/vi/MBtWYXq_r60/maxresdefault.jpg)](https://youtu.be/MBtWYXq_r60)
 
@@ -14,8 +14,8 @@ Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#
 
 ## 📋 Quick Access
 
-- [🆙 Future Releases](#-future-releases)
-- [🚀 Updates](#-updates)
+- [🆙 Future Work](#-future-work)
+- [🚀 Release Notes](#-release-notes)
 - [🎯 Features](#-features)
 - [🔧 Requirements](#-requirements)
 - [📦 Installation](#-installation)
@@ -26,15 +26,24 @@ Can run as **Multi-GPU standalone CLI** too, see [🖥️ Run as Standalone](#
 - [🙏 Credits](#-credits)
 - [📜 License](#-license)
 
-## 🆙 Future Releases
+## 🆙 Future Work
 
 We're actively working on improvements and new features. To stay informed:
 
 - **📌 Track Active Development**: Visit [Issues](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/issues) to see active development, report bugs, and request new features
 - **💬 Join the Community**: Learn from others, share your workflows, and get help in the [Discussions](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/discussions)
 - **🔮 Next Model Survey**: We're looking for community input on the next open-source super-powerful generic restoration model. Share your suggestions in [Issue #164](https://github.com/numz/ComfyUI-SeedVR2_VideoUpscaler/issues/164)
 
-## 🚀 Updates
+## 🚀 Release Notes
+
+**2025.12.12 - Version 2.5.20**
+
+- **⚡ Expanded attention backends** - Full support for Flash Attention 2 (Ampere+), Flash Attention 3 (Hopper+), SageAttention 2, and SageAttention 3 (Blackwell/RTX 50xx), with automatic fallback chains to PyTorch SDPA when unavailable *(based on PR by [@naxci1](https://github.com/naxci1) - thank you!)*
+- **🍎 macOS/Apple Silicon compatibility** - Replaced MPS autocast with explicit dtype conversion throughout VAE and DiT pipelines, resolving hangs and crashes on M-series Macs. BlockSwap now auto-disables with warning (unified memory makes it meaningless)
+- **🛡️ Flash Attention graceful fallback** - Added compatibility shims for corrupted or partially installed flash_attn/xformers DLLs, preventing startup crashes
+- **🛡️ AMD ROCm: bitsandbytes conflict fix** - Prevent kernel registration errors when diffusers attempts to re-import broken bitsandbytes installations
+- **📦 ComfyUI Manager: macOS classifier fix** - Removed NVIDIA CUDA classifier causing false "GPU not supported" warnings on macOS
+- **📚 Documentation updates** - Updated README with attention backend details, BlockSwap macOS notes, and clarified model caching descriptions
 
 **2025.12.10 - Version 2.5.19**
 
@@ -232,7 +241,7 @@ We're actively working on improvements and new features. To stay informed:
 
 **2025.07.03**
 
-- 🛠️ Can run as **standalone mode** with **Multi GPU** see [🖥️ Run as Standalone](#️-run-as-standalone-cli)
+- 🛠️ Can run as **standalone mode** with **Multi GPU** see [🖥️ Run as Standalone](#run-as-standalone-cli)
 
 **2025.06.30**
 
@@ -279,8 +288,8 @@ We're actively working on improvements and new features. To stay informed:
 ### Performance Features
 - **torch.compile Integration**: Optional 20-40% DiT speedup and 15-25% VAE speedup with PyTorch 2.0+ compilation
 - **Multi-GPU CLI**: Distribute workload across multiple GPUs with automatic temporal overlap blending
-- **Model Caching**: Keep models loaded in memory for faster batch processing
-- **Flexible Attention Backends**: Choose between PyTorch SDPA (stable, always available) or Flash Attention 2 (faster on supported hardware)
+- **Model Caching**: Keep models loaded between generations for single-GPU directory processing or multi-GPU streaming
+- **Flexible Attention Backends**: Choose between PyTorch SDPA (stable, always available), Flash Attention 2/3, or SageAttention 2/3 for faster computation on supported hardware
 
 ### Quality Control
 - **Advanced Color Correction**: Five methods including LAB (recommended for highest fidelity), wavelet, wavelet adaptive, HSV, and AdaIN
@@ -309,7 +318,7 @@ With the current optimizations (tiling, BlockSwap, GGUF quantization), SeedVR2 c
 - **Python**: 3.12+ (Python 3.12 and 3.13 tested and recommended)
 - **PyTorch**: 2.0+ for torch.compile support (optional but recommended)
 - **Triton**: Required for torch.compile with inductor backend (optional)
-- **Flash Attention 2**: Provides faster attention computation on supported hardware (optional, falls back to PyTorch SDPA)
+- **Flash Attention / SageAttention**: Flash Attention 2 (Ampere+), Flash Attention 3 (Hopper+), SageAttention 2 or SageAttention 3 (Blackwell) provide faster attention computation on supported hardware (optional, falls back to PyTorch SDPA)
 
 ## 📦 Installation
 
@@ -424,14 +433,21 @@ Configure the DiT (Diffusion Transformer) model for video upscaling.
   - Requires offload_device to be set and different from device
 
 - **attention_mode**: Attention computation backend
-  - `sdpa`: PyTorch scaled_dot_product_attention (default, stable, always available)
-  - `flash_attn`: Flash Attention 2 (faster on supported hardware, requires flash-attn package)
+  - `sdpa`: PyTorch scaled_dot_product_attention (default, always available)
+  - `flash_attn_2`: Flash Attention 2 (Ampere+, requires flash-attn package)
+  - `flash_attn_3`: Flash Attention 3 (Hopper+, requires flash-attn with FA3 support)
+  - `sageattn_2`: SageAttention 2 (requires sageattention package)
+  - `sageattn_3`: SageAttention 3 (Blackwell/RTX 50xx, requires sageattn3 package)
 
 - **torch_compile_args**: Connect to SeedVR2 Torch Compile Settings node for 20-40% speedup
 
 **BlockSwap Explained:**
 
-BlockSwap enables running large models on GPUs with limited VRAM by dynamically swapping transformer blocks between GPU and CPU memory during inference. Here's how it works:
+BlockSwap enables running large models on GPUs with limited VRAM by dynamically swapping transformer blocks between GPU and CPU memory during inference.
+
+> **Note:** BlockSwap is not available on macOS. Apple Silicon Macs use unified memory architecture where GPU and CPU share the same memory pool, making BlockSwap meaningless. The option will be automatically disabled with a warning if requested on macOS.
+
+Here's how it works:
 
 - **What it does**: Keeps only the currently-needed transformer blocks on the GPU, while storing the rest on CPU or another device
 - **When to use it**: When you get OOM (Out of Memory) errors during the upscaling phase
@@ -867,9 +883,8 @@ python inference_cli.py media_folder/ \
 **Memory Management:**
 - `--dit_offload_device`: Device to offload DiT model: 'none' (keep on GPU), 'cpu', or 'cuda:X' (default: none)
 - `--vae_offload_device`: Device to offload VAE model: 'none', 'cpu', or 'cuda:X' (default: none)
-- `--blocks_to_swap`: Number of transformer blocks to swap (0=disabled, 3B: 0-32, 7B: 0-36). Requires dit_offload_device (default: 0)
-- `--swap_io_components`: Offload I/O components for additional VRAM savings. Requires dit_offload_device
-- `--use_non_blocking`: Use non-blocking memory transfers for BlockSwap (recommended)
+- `--blocks_to_swap`: Number of transformer blocks to swap (0=disabled, 3B: 0-32, 7B: 0-36). Requires dit_offload_device (default: 0). Not available on macOS.
+- `--swap_io_components`: Offload I/O components for additional VRAM savings. Requires dit_offload_device. Not available on macOS.
 
 **VAE Tiling:**
 - `--vae_encode_tiled`: Enable VAE encode tiling to reduce VRAM during encoding
@@ -882,7 +897,7 @@ python inference_cli.py media_folder/ \
 
 **Performance Optimization:**
 - `--allow_vram_overflow`: Allow VRAM overflow to system RAM. Prevents OOM but may cause severe slowdown
-- `--attention_mode`: Attention backend: 'sdpa' (default, stable) or 'flash_attn' (faster, requires package)
+- `--attention_mode`: Attention backend: 'sdpa' (default), 'flash_attn_2' (Ampere+), 'flash_attn_3' (Hopper+), 'sageattn_2', or 'sageattn_3' (Blackwell)
 - `--compile_dit`: Enable torch.compile for DiT model (20-40% speedup, requires PyTorch 2.0+ and Triton)
 - `--compile_vae`: Enable torch.compile for VAE model (15-25% speedup, requires PyTorch 2.0+ and Triton)
 - `--compile_backend`: Compilation backend: 'inductor' (full optimization) or 'cudagraphs' (lightweight) (default: inductor)
@@ -893,8 +908,8 @@ python inference_cli.py media_folder/ \
 - `--compile_dynamo_recompile_limit`: Max recompilation attempts before fallback (default: 128)
 
 **Model Caching (batch processing):**
-- `--cache_dit`: Cache DiT model between files (single GPU only, speeds up directory processing)
-- `--cache_vae`: Cache VAE model between files (single GPU only, speeds up directory processing)
+- `--cache_dit`: Keep DiT model in memory between generations. Works with single-GPU directory processing or multi-GPU streaming (`--chunk_size`). Requires `--dit_offload_device`
+- `--cache_vae`: Keep VAE model in memory between generations. Works with single-GPU directory processing or multi-GPU streaming (`--chunk_size`). Requires `--vae_offload_device`
 
 **Multi-GPU:**
 - `--cuda_device`: CUDA device id(s). Single id (e.g., '0') or comma-separated list '0,1' for multi-GPU
@@ -997,7 +1012,7 @@ For detailed contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).
 
 This ComfyUI implementation is a collaborative project by **[NumZ](https://github.com/numz)** and **[AInVFX](https://www.youtube.com/@AInVFX)** (Adrien Toupet), based on the original [SeedVR2](https://github.com/ByteDance-Seed/SeedVR) by ByteDance Seed Team.
 
-Special thanks to our community contributors including [benjaminherb](https://github.com/benjaminherb), [cmeka](https://github.com/cmeka), [FurkanGozukara](https://github.com/FurkanGozukara), [JohnAlcatraz](https://github.com/JohnAlcatraz), [lihaoyun6](https://github.com/lihaoyun6), [Luchuanzhao](https://github.com/Luchuanzhao), [Luke2642](https://github.com/Luke2642), [naxci1](https://github.com/naxci1), [q5sys](https://github.com/q5sys), and many others for their improvements, bug fixes, and testing.
+Special thanks to our community contributors including [naxci1](https://github.com/naxci1), [benjaminherb](https://github.com/benjaminherb), [cmeka](https://github.com/cmeka), [FurkanGozukara](https://github.com/FurkanGozukara), [JohnAlcatraz](https://github.com/JohnAlcatraz), [lihaoyun6](https://github.com/lihaoyun6), [Luchuanzhao](https://github.com/Luchuanzhao), [Luke2642](https://github.com/Luke2642), [proxyid](https://github.com/proxyid), [q5sys](https://github.com/q5sys), and many others for their improvements, bug fixes, and testing.
 
 ## 📜 License
 
 
@@ -1327,9 +1327,10 @@ def parse_arguments() -> argparse.Namespace:
     blockswap_group = parser.add_argument_group('Memory optimization (BlockSwap)')
     blockswap_group.add_argument("--blocks_to_swap", type=int, default=0,
                         help="Transformer blocks to swap for VRAM savings. 0-32 (3B) or 0-36 (7B). "
-                             "Requires --dit_offload_device. Default: 0 (disabled)")
+                             "Requires --dit_offload_device. Not available on macOS. Default: 0 (disabled)")
     blockswap_group.add_argument("--swap_io_components", action="store_true",
-                        help="Offload DiT I/O layers for extra VRAM savings. Requires --dit_offload_device")
+                        help="Offload DiT I/O layers for extra VRAM savings. Requires --dit_offload_device. "
+                             "Not available on macOS")
 
     # VAE Tiling
     vae_group = parser.add_argument_group('VAE tiling (for high resolution upscale)')
@@ -1351,8 +1352,8 @@ def parse_arguments() -> argparse.Namespace:
     # Performance
     perf_group = parser.add_argument_group('Performance optimization')
     perf_group.add_argument("--attention_mode", type=str, default="sdpa",
-                        choices=["sdpa", "flash_attn"],
-                        help="Attention backend: 'sdpa' (default, always available) or 'flash_attn' (faster, requires package)")
+                        choices=["sdpa", "flash_attn_2", "flash_attn_3", "sageattn_2", "sageattn_3"],
+                        help="Attention backend: 'sdpa' (default), 'flash_attn_2', 'flash_attn_3', 'sageattn_2', or 'sageattn_3' (Blackwell GPUs)")
     perf_group.add_argument("--compile_dit", action="store_true", 
                         help="Enable torch.compile for DiT model (20-40%% speedup, requires PyTorch 2.0+ and Triton)")
     perf_group.add_argument("--compile_vae", action="store_true",
@@ -1374,9 +1375,11 @@ def parse_arguments() -> argparse.Namespace:
     # Model Caching (for batch processing)
     cache_group = parser.add_argument_group('Model caching (batch processing)')
     cache_group.add_argument("--cache_dit", action="store_true",
-                        help="Cache DiT model between files (single GPU only, speeds up directory processing)")
+                        help="Keep DiT model in memory between generations. Works with single-GPU directory processing "
+                             "or multi-GPU streaming (--chunk_size). Requires --dit_offload_device")
     cache_group.add_argument("--cache_vae", action="store_true",
-                        help="Cache VAE model between files (single GPU only, speeds up directory processing)")
+                        help="Keep VAE model in memory between generations. Works with single-GPU directory processing "
+                             "or multi-GPU streaming (--chunk_size). Requires --vae_offload_device")
 
     # Debugging
     debug_group = parser.add_argument_group('Debugging')
@@ -1435,24 +1438,6 @@ def main() -> None:
         debug.log(f"VAE decode tile overlap ({args.vae_decode_tile_overlap}) must be smaller than tile size ({args.vae_decode_tile_size})", level="ERROR", category="vae", force=True)
         sys.exit(1)
 
-    # Validate BlockSwap configuration - either blocks_to_swap or swap_io_components requires dit_offload_device
-    blockswap_enabled = args.blocks_to_swap > 0 or args.swap_io_components
-    if blockswap_enabled and args.dit_offload_device == "none":
-        config_details = []
-        if args.blocks_to_swap > 0:
-            config_details.append(f"blocks_to_swap={args.blocks_to_swap}")
-        if args.swap_io_components:
-            config_details.append("swap_io_components=True")
-        
-        debug.log(
-            f"BlockSwap enabled ({', '.join(config_details)}) but dit_offload_device='none'. "
-            "BlockSwap requires dit_offload_device to be set (typically 'cpu'). "
-            "Either set --dit_offload_device cpu or disable BlockSwap "
-            "(--blocks_to_swap 0 and do not use --swap_io_components)",
-            level="ERROR", category="blockswap", force=True
-        )
-        sys.exit(1)
-    
     # Inform about caching defaults
     if args.cache_dit and args.dit_offload_device == "none":
         offload_target = "system memory (CPU)" if get_gpu_backend() != "mps" else "unified memory"
 
@@ -1,15 +1,14 @@
 [project]
 name = "seedvr2_videoupscaler"
 description = "SeedVR2 official ComfyUI integration: ByteDance-Seed's one-step diffusion-based video/image upscaling with memory-efficient inference"
-version = "2.5.19"
+version = "2.5.20"
 authors = [
     {name = "numz"},
     {name = "adrientoupet"}
 ]
 license = {file = "LICENSE"} 
 classifiers = [
-     "Operating System :: OS Independent",
-     "Environment :: GPU :: NVIDIA CUDA"
+     "Operating System :: OS Independent"
 ]
 dependencies = [
     "torch",
 
@@ -700,17 +700,18 @@ def _add_noise(x, aug_noise):
             )
             conditions = [condition]
 
-            # Detect DiT model dtype (handle FP8CompatibleDiT wrapper)
+            # Detect DiT model dtype (handle CompatibleDiT wrapper)
             dit_model = runner.dit.dit_model if hasattr(runner.dit, 'dit_model') else runner.dit
             try:
                 dit_dtype = next(dit_model.parameters()).dtype
             except StopIteration:
                 dit_dtype = ctx['compute_dtype']  # Fallback for meta device or empty model
 
             # Use autocast if DiT dtype differs from compute dtype
+            # Skip autocast on MPS (CompatibleDiT already handles dtype conversion)
             debug.start_timer(f"dit_inference_{upscale_idx+1}")
             with torch.no_grad():
-                if dit_dtype != ctx['compute_dtype']:
+                if dit_dtype != ctx['compute_dtype'] and ctx['dit_device'].type != 'mps':
                     with torch.autocast(ctx['dit_device'].type, ctx['compute_dtype'], enabled=True):
                         upscaled_latents = runner.inference(
                             noises=noises,
 
@@ -457,7 +457,7 @@ def prepare_runner(
         decode_tile_size: Tile size for decoding (height, width)
         decode_tile_overlap: Tile overlap for decoding (height, width)
         tile_debug: Tile visualization mode (false/encode/decode)
-        attention_mode: Attention computation backend ('sdpa' or 'flash_attn')
+        attention_mode: Attention computation backend ('sdpa', 'flash_attn_2', 'flash_attn_3', 'sageattn_2', or 'sageattn_3')
         torch_compile_args_dit: Optional torch.compile configuration for DiT model
         torch_compile_args_vae: Optional torch.compile configuration for VAE model
         
@@ -529,8 +529,8 @@ def load_text_embeddings(script_directory: str, device: torch.device,
         - Memory-efficient embedding preparation
         - Consistent movement logging
     """
-    text_pos_embeds = torch.load(os.path.join(script_directory, 'pos_emb.pt'))
-    text_neg_embeds = torch.load(os.path.join(script_directory, 'neg_emb.pt'))
+    text_pos_embeds = torch.load(os.path.join(script_directory, 'pos_emb.pt'), weights_only=True)
+    text_neg_embeds = torch.load(os.path.join(script_directory, 'neg_emb.pt'), weights_only=True)
 
     text_pos_embeds = manage_tensor(
         tensor=text_pos_embeds,
@@ -819,4 +819,4 @@ def ensure_precision_initialized(
             debug.log(f"Model precision: {', '.join(parts)}", category="precision")
 
     except Exception as e:
-        debug.log(f"Could not log model dtypes: {e}", level="WARNING", category="precision", force=True)
+        debug.log(f"Could not log model dtypes: {e}", level="WARNING", category="precision", force=True)