Add support ROCM AMD GPU (#661)

vivienfanghuagood · web-flow · commit 4ad5a50ce397 · 2025-12-25T11:51:50.000+08:00
# AMD ROCm Support for LightX2V ## Summary This PR adds AMD ROCm support for LightX2V, enabling high-performance video generation on AMD GPUs. The implementation leverages the `aiter` library for optimized kernels. ## End-to-End Performance ### Wan2.1-T2V-1.3B (33 frames, 480×848, 20 steps) | Configuration | Time | Speedup | |---------------|------|---------| | After PR (`flash_attn2`) | 13.96s | 1.00x | | **After PR (`aiter_attn`)** | **6.31s** | **2.21x** | ### Qwen-Image-Edit-2511 (1024×1024, 8 steps) | Configuration | Time | Speedup | |---------------|------|---------| | After PR (`flash_attn2`) | 4.83s | 1.00x | | **After PR (`aiter_attn`)** | **2.86s** | **1.69x** | > **Note**: Before this PR, LightX2V does not work on AMD ROCm due to missing `sgl_kernel` support. This PR enables AMD support and provides significant performance optimizations. ## Optimizations Added | Optimization | Description | How to Enable | |--------------|-------------|---------------| | **sgl_kernel replacement** | Uses `aiter` for RMSNorm, FP8/INT8 GEMM | Automatic (required on AMD) | | **VAE cudnn disabled** | Disables cudnn for faster convolution | Automatic | | **aiter_attn** | Optimized Flash Attention using aiter FA3 | `attn_mode="aiter_attn"` | ### Kernel-Level Performance | Kernel | Speedup (vs baseline) | |--------|----------------------| | Flash Attention (aiter vs flash_attn) | 5.5x - 5.7x | | RMSNorm (aiter vs torch) | 3x - 4.2x | | VAE Conv (cudnn disabled) | ~1.2x | ## Usage ### Environment Setup Set the platform environment variable for AMD ROCm: ```bash export PLATFORM=amd_rocm ``` ### Wan Models (Text-to-Video) ```python from lightx2v import LightX2VPipeline # Initialize pipeline pipe = LightX2VPipeline( model_path="/path/to/Wan2.1-T2V-14B", model_cls="wan2.1", task="t2v", ) # Create generator with aiter_attn for AMD ROCm optimization pipe.create_generator( attn_mode="aiter_attn", # Use aiter_attn for 2.21x speedup on AMD infer_steps=50, height=480, width=832, num_frames=81, guidance_scale=5.0, sample_shift=5.0, ) # Generate video pipe.generate( seed=42, prompt="Two anthropomorphic cats in comfy boxing gear fight on a spotlighted stage.", negative_prompt="", save_result_path="output.mp4", ) ``` ### Qwen-Image-Edit Models (Image-to-Image) ```python from lightx2v import LightX2VPipeline # Initialize pipeline pipe = LightX2VPipeline( model_path="/path/to/Qwen-Image-Edit-2511", model_cls="qwen-image-edit-2511", task="i2i", ) # Create generator with aiter_attn for AMD ROCm optimization pipe.create_generator( attn_mode="aiter_attn", # Use aiter_attn for 1.69x speedup on AMD auto_resize=True, infer_steps=8, guidance_scale=1, ) # Generate image pipe.generate( seed=42, image_path="input.png", prompt="Replace the shirt with a light blue shirt.", negative_prompt="", save_result_path="output.png", ) ``` ## Installation ### Prerequisites For AMD ROCm platform, `aiter` is **required**: ```bash git clone https://github.com/ROCm/aiter.git /tmp/aiter && \ cd /tmp/aiter && \ git checkout a7d3bf8cd47afbaf6a6133c1f12e3b01d2c27b0e && \ pip install -e . ``` ### Docker Support Build AMD ROCm image: ```bash docker build -f dockerfiles/Dockerfile.mi350 -t lightx2v:rocm . ``` Run container: ```bash docker run --device=/dev/kfd --device=/dev/dri \ --group-add video --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ -v /path/to/models:/models \ lightx2v:rocm python your_script.py ``` ## Platform Detection AMD ROCm is automatically detected: ```python IS_AMD_ROCM = hasattr(torch.version, "hip") and torch.version.hip is not None ``` When AMD ROCm is detected: 1. `aiter` is injected as `sgl_kernel` replacement (required for AMD support) 2. `cudnn` is automatically disabled (VAE optimization) 3. User can use `attn_mode="aiter_attn"` for additional Flash Attention optimization
diff --git a/dockerfiles/Dockerfile.mi350 b/dockerfiles/Dockerfile.mi350
@@ -0,0 +1,48 @@
+# Dockerfile for LightX2V on AMD ROCm platform
+# Base image: SGLang with ROCm 7.0.0 for MI300X
+FROM lmsysorg/sglang:v0.5.6.post2-rocm700-mi35x
+
+LABEL maintainer="LightX2V Contributors"
+LABEL description="LightX2V video generation framework with AMD ROCm support"
+
+# Set working directory
+WORKDIR /workspace
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    ffmpeg \
+    libsm6 \
+    libxext6 \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install aiter (AMD ROCm optimized kernels)
+# Commit: a7d3bf8cd47afbaf6a6133c1f12e3b01d2c27b0e
+ARG AITER_COMMIT=a7d3bf8cd47afbaf6a6133c1f12e3b01d2c27b0e
+RUN git clone https://github.com/ROCm/aiter.git /tmp/aiter && \
+    cd /tmp/aiter && \
+    git checkout ${AITER_COMMIT} && \
+    pip install --no-cache-dir -e . && \
+    rm -rf /tmp/aiter/.git
+
+# Install flash-attn for ROCm
+RUN pip install --no-cache-dir flash-attn --no-build-isolation
+
+# Copy LightX2V source
+COPY . /workspace/LightX2V
+
+# Install LightX2V dependencies
+WORKDIR /workspace/LightX2V
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Install LightX2V
+RUN pip install --no-cache-dir -e .
+
+# Set environment variables for AMD ROCm
+ENV HIP_VISIBLE_DEVICES=0
+ENV ROCM_PATH=/opt/rocm
+ENV HSA_FORCE_FINE_GRAIN_PCIE=1
+
+# Default command
+CMD ["python", "-c", "from lightx2v import LightX2VPipeline; print('LightX2V AMD ROCm ready!')"]
+
diff --git a/lightx2v/pipeline.py b/lightx2v/pipeline.py
@@ -203,7 +203,7 @@ def set_infer_config(
             self.self_attn_1_type = attn_mode
             self.cross_attn_1_type = attn_mode
             self.cross_attn_2_type = attn_mode
-        elif self.model_cls in ["hunyuan_video_1.5", "hunyuan_video_1.5_distill"]:
+        elif self.model_cls in ["hunyuan_video_1.5", "hunyuan_video_1.5_distill", "qwen_image"]:
             self.attn_type = attn_mode
 
     def set_infer_config_json(self, config_json):
diff --git a/lightx2v_platform/base/__init__.py b/lightx2v_platform/base/__init__.py
@@ -1,7 +1,8 @@
 from lightx2v_platform.base.base import check_ai_device, init_ai_device
+from lightx2v_platform.base.amd_rocm import AmdRocmDevice
 from lightx2v_platform.base.cambricon_mlu import MluDevice
 from lightx2v_platform.base.hygon_dcu import HygonDcuDevice
 from lightx2v_platform.base.metax import MetaxDevice
 from lightx2v_platform.base.nvidia import CudaDevice
 
-__all__ = ["init_ai_device", "check_ai_device", "CudaDevice", "MluDevice", "MetaxDevice", "HygonDcuDevice"]
+__all__ = ["init_ai_device", "check_ai_device", "CudaDevice", "MluDevice", "MetaxDevice", "HygonDcuDevice", "AmdRocmDevice"]
diff --git a/lightx2v_platform/base/amd_rocm.py b/lightx2v_platform/base/amd_rocm.py
@@ -0,0 +1,161 @@
+"""
+AMD ROCm Device implementation for LightX2V.
+
+AMD ROCm provides CUDA-compatible APIs through HIP (Heterogeneous-computing Interface for Portability).
+This module handles AMD-specific optimizations including:
+- Disabling cudnn for faster VAE convolution
+- sgl_kernel compatibility layer using aiter library (required on AMD)
+"""
+
+import sys
+import torch
+import torch.distributed as dist
+
+from loguru import logger
+
+from lightx2v_platform.registry_factory import PLATFORM_DEVICE_REGISTER
+
+# Detect AMD ROCm platform
+IS_AMD_ROCM = hasattr(torch.version, "hip") and torch.version.hip is not None
+
+# aiter installation info
+AITER_REPO = "https://github.com/ROCm/aiter.git"
+AITER_COMMIT = "a7d3bf8cd47afbaf6a6133c1f12e3b01d2c27b0e"
+AITER_INSTALL_CMD = f"""
+# One-line install command for aiter (AMD ROCm optimized kernels):
+git clone {AITER_REPO} /tmp/aiter && \\
+cd /tmp/aiter && \\
+git checkout {AITER_COMMIT} && \\
+pip install -e .
+"""
+
+
+class AiterSglKernelCompat:
+    """
+    Compatibility layer to use aiter with sgl_kernel interface.
+    
+    This class wraps aiter functions to match sgl_kernel's API,
+    allowing existing code to work seamlessly on AMD GPUs.
+    
+    Note: This is REQUIRED on AMD ROCm as the original sgl_kernel
+    does not support AMD GPUs.
+    """
+    
+    def __init__(self, aiter_module):
+        self._aiter = aiter_module
+        self._gemm_a8w8 = aiter_module.gemm_a8w8_CK
+        self._pertoken_quant = aiter_module.pertoken_quant
+        self._dtypes = aiter_module.dtypes
+        self._rms_norm = aiter_module.rms_norm
+        logger.info("Using aiter as sgl_kernel backend (AMD ROCm optimized)")
+    
+    def rmsnorm(self, input, weight, eps):
+        """RMSNorm compatible with sgl_kernel.rmsnorm(input, weight, eps)"""
+        return self._rms_norm(input, weight, eps)
+    
+    def fp8_scaled_mm(self, input_quant, weight, input_scale, weight_scale, dtype, bias=None):
+        """FP8 GEMM compatible with sgl_kernel.fp8_scaled_mm"""
+        return self._gemm_a8w8(input_quant, weight, input_scale, weight_scale, bias, dtype)
+    
+    def int8_scaled_mm(self, input_quant, weight, input_scale, weight_scale, dtype, bias=None):
+        """INT8 GEMM compatible with sgl_kernel.int8_scaled_mm"""
+        return self._gemm_a8w8(input_quant, weight, input_scale, weight_scale, bias, dtype)
+    
+    def sgl_per_token_quant_fp8(self, x, out, scale):
+        """Per-token FP8 quantization compatible with sgl_kernel.sgl_per_token_quant_fp8"""
+        q, s = self._pertoken_quant(x, quant_dtype=self._dtypes.fp8)
+        out.copy_(q)
+        scale.copy_(s)
+    
+    def sgl_per_token_group_quant_fp8(self, x, out, scale, group_size=128, eps=1e-10, fp8_min=-448.0, fp8_max=448.0):
+        """Per-token per-group FP8 quantization compatible with sgl_kernel.sgl_per_token_group_quant_fp8"""
+        m, k = x.shape
+        x_view = x.view(m, -1, group_size)
+        x_amax = x_view.abs().float().amax(dim=2).view(m, -1).clamp(eps)
+        q = (x_view * (fp8_max / x_amax.unsqueeze(2))).to(torch.float8_e4m3fn).view(m, k)
+        s = (x_amax / fp8_max).view(m, -1)
+        out.copy_(q)
+        scale.copy_(s)
+
+
+def _get_aiter_sgl_kernel():
+    """Get aiter-based sgl_kernel compatibility layer."""
+    try:
+        import aiter
+        return AiterSglKernelCompat(aiter)
+    except ImportError:
+        logger.error(
+            f"\n{'='*60}\n"
+            f"ERROR: AMD ROCm detected but aiter is not installed.\n"
+            f"aiter is REQUIRED for LightX2V to work on AMD GPUs.\n"
+            f"\nPlease install aiter:\n"
+            f"{AITER_INSTALL_CMD}\n"
+            f"{'='*60}\n"
+        )
+        raise ImportError(
+            "aiter is required for AMD ROCm support. "
+            f"Please install: pip install git+{AITER_REPO}@{AITER_COMMIT}"
+        )
+
+
+@PLATFORM_DEVICE_REGISTER("amd_rocm")
+class AmdRocmDevice:
+    """
+    AMD ROCm Device implementation for LightX2V.
+
+    AMD ROCm uses CUDA-compatible APIs through HIP.
+    This class provides AMD-specific optimizations.
+    """
+
+    name = "amd_rocm"
+
+    @staticmethod
+    def init_device_env():
+        """
+        Initialize AMD ROCm optimizations.
+        
+        This is called from lightx2v_platform.set_ai_device when platform is amd_rocm.
+        1. Disable cudnn for faster VAE convolution
+        2. Inject aiter as sgl_kernel compatibility layer (REQUIRED on AMD)
+        """
+        logger.info("AMD ROCm platform detected, initializing optimizations...")
+        
+        # Disable cudnn for faster VAE conv computation
+        torch.backends.cudnn.enabled = False
+        logger.info("  - cudnn disabled for faster VAE convolution")
+        
+        # Inject aiter as sgl_kernel compatibility layer (REQUIRED)
+        sgl_kernel = _get_aiter_sgl_kernel()
+        sys.modules["sgl_kernel"] = sgl_kernel
+        # Update any module that already imported sgl_kernel
+        for mod_name, mod in list(sys.modules.items()):
+            if mod is not None and hasattr(mod, 'sgl_kernel'):
+                setattr(mod, 'sgl_kernel', sgl_kernel)
+        logger.info("  - aiter sgl_kernel compatibility layer enabled (RMSNorm, GEMM)")
+
+    @staticmethod
+    def is_available() -> bool:
+        """Check if AMD ROCm is available."""
+        return IS_AMD_ROCM and torch.cuda.is_available()
+
+    @staticmethod
+    def get_device() -> str:
+        """Get the device type string. Returns 'cuda' for ROCm compatibility."""
+        return "cuda"
+
+    @staticmethod
+    def init_parallel_env():
+        """Initialize distributed parallel environment for AMD ROCm."""
+        dist.init_process_group(backend="nccl")
+        torch.cuda.set_device(dist.get_rank())
+
+
+# Export constants
+__all__ = [
+    "IS_AMD_ROCM",
+    "AITER_REPO", 
+    "AITER_COMMIT",
+    "AITER_INSTALL_CMD",
+    "AiterSglKernelCompat",
+    "AmdRocmDevice",
+]
diff --git a/lightx2v_platform/ops/__init__.py b/lightx2v_platform/ops/__init__.py
@@ -6,6 +6,8 @@
     from .attn.cambricon_mlu import *
     from .mm.cambricon_mlu import *
 elif AI_DEVICE == "cuda":
-    # Check if running on Hygon DCU platform
-    if os.getenv("PLATFORM") == "hygon_dcu":
+    platform = os.getenv("PLATFORM")
+    if platform == "hygon_dcu":
         from .attn.hygon_dcu import *
+    elif platform == "amd_rocm":
+        from .attn.amd_rocm import *
diff --git a/lightx2v_platform/ops/attn/amd_rocm/__init__.py b/lightx2v_platform/ops/attn/amd_rocm/__init__.py
@@ -0,0 +1,2 @@
+from .flash_attn import *
+
diff --git a/lightx2v_platform/ops/attn/amd_rocm/flash_attn.py b/lightx2v_platform/ops/attn/amd_rocm/flash_attn.py
@@ -0,0 +1,110 @@
+"""
+AMD ROCm optimized attention using aiter library.
+Provides significantly faster attention computation on AMD GPUs (2.5x-6x speedup).
+Internally uses FA3 (fmha_v3) when conditions are met.
+"""
+
+import torch
+from loguru import logger
+
+from lightx2v_platform.ops.attn.template import AttnWeightTemplate
+from lightx2v_platform.registry_factory import PLATFORM_ATTN_WEIGHT_REGISTER
+
+# Detect AMD ROCm platform
+IS_AMD_ROCM = hasattr(torch.version, "hip") and torch.version.hip is not None
+
+# aiter installation info
+AITER_REPO = "https://github.com/ROCm/aiter.git"
+AITER_COMMIT = "a7d3bf8cd47afbaf6a6133c1f12e3b01d2c27b0e"
+AITER_INSTALL_CMD = f"""
+# One-line install command for aiter (AMD ROCm optimized kernels):
+git clone {AITER_REPO} /tmp/aiter && \\
+cd /tmp/aiter && \\
+git checkout {AITER_COMMIT} && \\
+pip install -e .
+"""
+
+# Try to import aiter (AMD ROCm optimized)
+aiter_flash_attn_varlen_func = None
+AITER_AVAILABLE = False
+AITER_IMPORT_ERROR = None
+
+try:
+    from aiter import flash_attn_varlen_func as aiter_flash_attn_varlen_func
+
+    AITER_AVAILABLE = True
+    logger.info("aiter flash_attn_varlen_func found (AMD ROCm optimized)")
+except ImportError as e:
+    AITER_IMPORT_ERROR = str(e)
+    if IS_AMD_ROCM:
+        logger.warning(
+            f"aiter not found on AMD ROCm platform. "
+            f"For optimal performance, please install aiter:\n{AITER_INSTALL_CMD}"
+        )
+    else:
+        logger.debug("aiter not found (only available on AMD ROCm platform)")
+
+
+@PLATFORM_ATTN_WEIGHT_REGISTER("aiter_attn")
+class AiterAttnWeight(AttnWeightTemplate):
+    """
+    AMD ROCm optimized attention using aiter library.
+
+    Performance:
+        - 2.5x-6x faster than flash_attn package on AMD GPUs
+        - Automatically uses FA3 (fmha_v3) when conditions are met
+
+    Requirements:
+        - aiter library (AMD ROCm)
+        - AMD GPU with ROCm support
+    """
+
+    def __init__(self):
+        self.config = {}
+        
+        # Check platform first
+        if not IS_AMD_ROCM:
+            raise RuntimeError(
+                "aiter_attn is only available on AMD ROCm platform.\n"
+                "Current platform is not AMD ROCm (torch.version.hip is not set).\n"
+                "For NVIDIA GPUs, please use 'flash_attn2' or 'flash_attn3' instead."
+            )
+        
+        # Check aiter availability
+        if not AITER_AVAILABLE:
+            raise ImportError(
+                f"aiter is not installed on AMD ROCm platform.\n"
+                f"Import error: {AITER_IMPORT_ERROR}\n"
+                f"Please install aiter for optimal performance:\n{AITER_INSTALL_CMD}"
+            )
+
+    def apply(
+        self,
+        q,
+        k,
+        v,
+        cu_seqlens_q=None,
+        cu_seqlens_kv=None,
+        max_seqlen_q=None,
+        max_seqlen_kv=None,
+        model_cls=None,
+    ):
+        if len(q.shape) == 3:
+            bs = 1
+        elif len(q.shape) == 4:
+            bs = q.shape[0]
+            q = q.reshape(-1, q.shape[-2], q.shape[-1])
+            k = k.reshape(-1, k.shape[-2], k.shape[-1])
+            v = v.reshape(-1, v.shape[-2], v.shape[-1])
+
+        x = aiter_flash_attn_varlen_func(
+            q,
+            k,
+            v,
+            cu_seqlens_q,
+            cu_seqlens_kv,
+            max_seqlen_q,
+            max_seqlen_kv,
+        ).reshape(bs * max_seqlen_q, -1)
+        return x
+