ByteDance-Seed
diff --git a/‎docs/design/kernel_selection.md‎
Lines changed: 263 additions & 0 deletions b/‎docs/design/kernel_selection.md‎
Lines changed: 263 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/index.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/transformers_v5/veomni_flash_attention_kernel_adapter.md‎
Lines changed: 10 additions & 6 deletions b/‎docs/transformers_v5/veomni_flash_attention_kernel_adapter.md‎
Lines changed: 10 additions & 6 deletions
diff --git a/‎docs/usage/arguments.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/usage/arguments.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 4 additions & 8 deletions b/‎pyproject.toml‎
Lines changed: 4 additions & 8 deletions
@@ -0,0 +1,263 @@
+# Kernel Selection in VeOmni
+
+VeOmni selects optimized kernel implementations for attention, cross-entropy
+loss, Liger fused ops, and MoE at different points in the lifecycle. This
+document describes every selection mechanism, when it fires, and how to
+configure it.
+
+## Quick Reference
+
+| Kernel | Config field | Env var | Default | Selection time |
+|--------|-------------|---------|---------|----------------|
+| Attention | `attn_implementation` | — | `"flash_attention_2"` | Config `__post_init__` + `build_foundation_model` |
+| Cross-entropy loss | — | `VEOMNI_USE_LIGER_KERNEL` | `"1"` | Import time |
+| Liger fused ops (RMSNorm, RoPE, SwiGLU) | — | `VEOMNI_USE_LIGER_KERNEL` | `"1"` | Model registration (import time) |
+| MoE implementation | `moe_implementation` | — | `None` | `build_foundation_model` |
+
+All config fields live in `OpsImplementationConfig` (`veomni/arguments/arguments_types.py`),
+accessible via `model.ops_implementation.*` in YAML.
+
+---
+
+## Lifecycle Overview
+
+```
+import veomni                                 # (1) import time
+  └─ apply_ops_patch()
+       ├─ apply_veomni_attention_patch()      # register FA2/3/4 with SP
+       ├─ apply_veomni_loss_patch()           # bind cross-entropy kernel
+       └─ (MoE patch is NOT applied here)
+
+MODELING_REGISTRY.register()                  # (2) model class registration
+  └─ gpu_patch files                          # Liger RMSNorm/RoPE/SwiGLU
+
+OpsImplementationConfig.__post_init__()       # (3) config parse time
+  └─ rewrite attn_implementation for SP
+
+build_foundation_model(...)                   # (4) model build time
+  ├─ apply_veomni_fused_moe_patch(backend=)   # bind MoE GEMM kernel
+  ├─ config._moe_implementation = ...
+  └─ model init + weight loading
+
+model.forward()                               # (5) runtime
+  ├─ attention: ALL_ATTENTION_FUNCTIONS[config._attn_implementation]
+  ├─ loss: _cross_entropy(...)
+  └─ MoE: fused_moe_forward(...) or eager loop
+```
+
+---
+
+## 1. Attention
+
+### Config
+
+```yaml
+model:
+  ops_implementation:
+    attn_implementation: flash_attention_2    # default
+```
+
+**Field:** `OpsImplementationConfig.attn_implementation`
+
+### Available implementations
+
+| Value | Kernel | Sequence Parallel | Requirements |
+|-------|--------|:-:|---|
+| `eager` | PyTorch | No | — |
+| `sdpa` | `F.scaled_dot_product_attention` | No | — |
+| `flash_attention_2` | Flash Attention v2 | Yes | `flash-attn` |
+| `flash_attention_3` | Flash Attention v3 | Yes | `flash-attn-interface` |
+| `flash_attention_4` | Flash Attention v4 | Yes | `flash-attn.cute` |
+| `native-sparse` | Sparse attention | No | — |
+
+When `MODELING_BACKEND=veomni` (the default), `__post_init__` automatically
+rewrites `flash_attention_2/3/4` to VeOmni SP-aware variants
+(`veomni_flash_attention_2_with_sp`, etc.) which wrap the underlying kernel
+with DeepSpeed Ulysses sequence parallelism gather/scatter. This is why FA2/3/4
+support SP — the rewrite is transparent to the user.
+
+### Selection flow
+
+1. **Config `__post_init__`** — `flash_attention_2` → `veomni_flash_attention_2_with_sp`
+2. **`build_foundation_model`** — passed to HuggingFace `AutoModel.from_config(attn_implementation=...)`, stored as `config._attn_implementation`
+3. **Import-time registration** — `apply_veomni_attention_patch()` registers the VeOmni names in `ALL_ATTENTION_FUNCTIONS`
+4. **Forward** — Transformers dispatches to `flash_attention_forward()` via `ALL_ATTENTION_FUNCTIONS[config._attn_implementation]`
+
+### Key files
+
+- Config: `veomni/arguments/arguments_types.py` — `OpsImplementationConfig`
+- Registration: `veomni/ops/flash_attn/__init__.py` — `apply_veomni_attention_patch()`, `flash_attention_forward()`
+- Plumbing: `veomni/models/auto.py` — `build_foundation_model(attn_implementation=...)`
+
+---
+
+## 2. Cross-Entropy Loss
+
+### Config
+
+No config field. Controlled by environment variable.
+
+| Env var | Default | Values |
+|---------|---------|--------|
+| `VEOMNI_USE_LIGER_KERNEL` | `"1"` | `"0"` / `"1"` |
+| `VEOMNI_ENABLE_CHUNK_LOSS` | `"0"` | `"0"` / `"1"` (NPU only) |
+
+### Available implementations
+
+| Implementation | When selected |
+|---|---|
+| `fused_liger_kernel_cross_entropy` | GPU + Liger installed + `VEOMNI_USE_LIGER_KERNEL=1` |
+| `eager_cross_entropy` | GPU fallback, or NPU |
+| `chunk_loss_function` | NPU + `VEOMNI_ENABLE_CHUNK_LOSS=1` |
+
+### Selection flow
+
+`apply_veomni_loss_patch()` runs at import time and sets the global
+`_cross_entropy` function pointer:
+
+1. NPU → `eager_cross_entropy` (+ optional `chunk_loss_function` for `LOSS_MAPPING`)
+2. GPU + Liger + env `"1"` → `fused_liger_kernel_cross_entropy`
+3. Fallback → `eager_cross_entropy`
+
+### Key files
+
+- Selection: `veomni/ops/fused_cross_entropy/__init__.py` — `apply_veomni_loss_patch()`
+- Eager impl: `veomni/ops/fused_cross_entropy/eager.py`
+- Liger impl: `veomni/ops/fused_cross_entropy/liger_kernel.py`
+
+---
+
+## 3. Liger Fused Ops (RMSNorm, RoPE, SwiGLU MLP)
+
+### Config
+
+No config field. Same environment variable as cross-entropy.
+
+| Env var | Default |
+|---------|---------|
+| `VEOMNI_USE_LIGER_KERNEL` | `"1"` |
+
+### What gets patched
+
+When `VEOMNI_USE_LIGER_KERNEL=1` and the `liger_kernel` package is installed,
+each model's `gpu_patch.py` replaces HuggingFace module classes:
+
+| Component | Original | Liger replacement |
+|---|---|---|
+| RMSNorm | `{Model}RMSNorm` | `LigerRMSNorm` |
+| Rotary embedding | `apply_rotary_pos_emb` | `liger_rotary_pos_emb` |
+| SwiGLU MLP | `{Model}MLP` | `LigerSwiGLUMLP` |
+
+### Selection flow
+
+Patching happens at model class registration time (import of the model
+module). Each model's `gpu_patch.py` checks:
+
+```python
+if is_liger_kernel_available() and get_env("VEOMNI_USE_LIGER_KERNEL") == "1":
+    hf_module.apply_rotary_pos_emb = liger_rotary_pos_emb
+    hf_module.ModelRMSNorm = LigerRMSNorm
+    hf_module.ModelMLP = LigerSwiGLUMLP
+```
+
+### Models with Liger support
+
+Qwen2, Qwen3, Qwen3-MoE, Qwen2-VL, DeepSeek-V3, Llama, Seed-OSS.
+
+### Key files
+
+- `veomni/models/transformers/{model}/gpu_patch.py` (7 model-specific files)
+
+---
+
+## 4. MoE Kernel
+
+MoE kernel selection is controlled by a single `moe_implementation` field:
+
+```yaml
+model:
+  ops_implementation:
+    moe_implementation: fused          # Triton group-gemm (default fused path)
+    # moe_implementation: fused_quack  # Quack CUTLASS/CuTe kernels (SM90+)
+    # moe_implementation: eager        # Reference PyTorch loop (very slow, debug only)
+```
+
+**Field:** `OpsImplementationConfig.moe_implementation`
+**Default:** `None` (falls back to `"eager"` per model config)
+
+| Value | Kernel | Hardware | EP support |
+|-------|--------|----------|:----------:|
+| `eager` | PyTorch expert loop | Any | No |
+| `fused` | Triton group-gemm (`group_gemm_same_nk`) | SM70+ (V100+) | Yes |
+| `fused_quack` | Quack CUTLASS/CuTe (`quack.gemm_interface.gemm`) | SM90+ (H100+) | No |
+| *(NPU auto)* | NPU group-gemm | Ascend NPU | Yes |
+
+Models only see `_moe_implementation` as `"eager"` or `"fused"` — the
+`fused_quack` variant is mapped to `"fused"` on the config, with the kernel
+backend selected separately via `apply_veomni_fused_moe_patch`.
+
+On NPU devices, the backend parameter is ignored — the NPU kernel is always
+selected.
+
+### Selection flow
+
+Unlike attention and loss, the MoE patch is **not** applied at import time.
+It is applied inside `build_foundation_model()`:
+
+```python
+def build_foundation_model(..., moe_implementation="fused_quack"):
+    config._moe_implementation = "fused"
+    apply_veomni_fused_moe_patch(moe_implementation="fused_quack")
+```
+
+This deferred approach allows the config to drive kernel selection without
+env vars.
+
+### Usage
+
+**Via config (YAML):**
+
+```yaml
+model:
+  ops_implementation:
+    moe_implementation: fused_quack
+```
+
+**Via `build_foundation_model` (standalone scripts):**
+
+```python
+model = build_foundation_model(
+    config_path="...",
+    moe_implementation="fused_quack",
+)
+```
+
+**Direct patch (tests / benchmarks):**
+
+```python
+from veomni.ops.fused_moe import apply_veomni_fused_moe_patch
+apply_veomni_fused_moe_patch(moe_implementation="fused_quack")
+```
+
+### Key files
+
+- Config: `veomni/arguments/arguments_types.py` — `OpsImplementationConfig`
+- Dispatch: `veomni/ops/fused_moe/__init__.py` — `apply_veomni_fused_moe_patch()`
+- Triton impl: `veomni/ops/fused_moe/group_gemm.py`
+- Quack impl: `veomni/ops/fused_moe/quack_gemm.py`
+- NPU impl: `veomni/ops/fused_moe/npu_group_gemm.py`
+- Plumbing: `veomni/models/auto.py` — `build_foundation_model(moe_implementation=...)`
+
+---
+
+## Environment Variables Summary
+
+| Env var | Default | Scope | Notes |
+|---------|---------|-------|-------|
+| `MODELING_BACKEND` | `"veomni"` | Global | `"veomni"` or `"hf"` — controls whether VeOmni ops patches are applied |
+| `VEOMNI_USE_LIGER_KERNEL` | `"1"` | Global | Controls Liger kernel for RMSNorm/RoPE/SwiGLU + cross-entropy loss |
+| `USE_GROUP_GEMM` | `"1"` | MoE | Gate for Triton group-gemm availability; set `"0"` to force fallback |
+| `VEOMNI_ENABLE_CHUNK_LOSS` | `"0"` | NPU only | Enable chunked loss computation |
+
+All env vars are registered in `veomni/utils/env.py` with defaults and can be
+overridden by setting the corresponding shell environment variable.
@@ -47,6 +47,7 @@ hardware_support/get_started_npu.md
 :caption: Examples
 
 examples/qwen3.md
+examples/qwen3_5.md
 examples/qwen3_moe.md
 examples/qwen3_vl.md
 examples/qwen3_omni_moe.md
@@ -64,6 +65,13 @@ key_features/ulysses.md
 
 ```
 
+```{toctree}
+:maxdepth: 1
+:caption: Design
+
+design/kernel_selection.md
+```
+
 ```{toctree}
 :maxdepth: 1
 :caption: Transformers v5 Updates
 
@@ -79,9 +79,13 @@ After `import veomni`:
   no-op for those two in practice, but is kept for safety.
 - FA4 (`veomni_flash_attention_4_with_sp`) has no such branch in `_lazy_imports` and
   always falls through to the hub-kernel path in Transformers v5. The adapter is the
-  **critical** component that makes FA4 usable. FA4 is not supported on Transformers v4.
-- FA4 requires the `flash-attn-cute` package (`flash_attn.cute`). To install Transformers v5
-  and FA4 together, run:
-  ```
-  uv sync --extra gpu --extra fa4 --extra transformers5-exp --no-group transformers-stable
-  ```
+  **critical** component that makes FA4 usable on v5.
+- On Transformers v4, FA4 is supported via the VeOmni SP variant
+  (`veomni_flash_attention_4_with_sp`). Instead of the string name, VeOmni passes
+  a `SimpleNamespace` object (from `_load_veomni_local_flash_kernel`) directly to
+  `_lazy_imports`, which v4 accepts in its kernels-fallback branch via `getattr()`.
+  The bare `flash_attention_4` name still requires Transformers v5; for Transformers v4,
+  use `attn_implementation="veomni_flash_attention_4_with_sp"`.
+- FA4 requires the `flash-attn-cute` package (`flash_attn.cute`). To install FA4:
+  - **Transformers v5**: `uv sync --extra gpu --extra fa4 --extra transformers5-exp --no-group transformers-stable`
+  - **Transformers v4**: `uv sync --extra gpu --extra fa4`
@@ -122,7 +122,7 @@ Root config — assembles `model`, `data`, and `train`.
 | Field | Type | Default | Description |
 | --- | --- | --- | --- |
 | attn_implementation | `Optional[Literal["eager", "sdpa", "flash_attention_2", "flash_attention_3", "flash_attention_4", "native-sparse"]]` | `"flash_attention_2"` | Attention implementation to use. |
-| moe_implementation | `Optional[Literal["eager", "fused"]]` | `None` | MoE implementation to use. |
+| moe_implementation | `Optional[Literal["eager", "fused", "fused_quack"]]` | `None` | MoE implementation: `eager` (reference loop), `fused` (Triton), `fused_quack` (Quack CUTLASS, SM90+). |
 
 ### DataArguments
 
 
@@ -93,6 +93,7 @@ gpu = [
   "torch-c-dlpack-ext",
   # For models with linear attention like Qwen 3.5
   "flash-linear-attention",
+  "quack-kernels==0.3.2",
 ]
 megatron = [
   "megatron-energon>=7.2.1"
@@ -102,7 +103,7 @@ trl = [
 ]
 
 fa4 = [
-  "flash-attn-cute",
+  "flash-attn-4==0.1.0",
   "nvidia-cutlass-dsl>=4.4.0"
 ]
 
@@ -210,11 +211,6 @@ conflicts = [
     { group = "transformers-stable" },
     { extra = "transformers5-exp" },
   ],
-  # FA4 only works for transformers v5
-  [
-    { group = "transformers-stable" },
-    { extra = "fa4" },
-  ],
 ]
 
 [tool.uv.sources]
@@ -244,8 +240,8 @@ flash-attn-3 = [
   { url = "https://github.com/windreamer/flash-attention3-wheels/releases/download/2026.01.12-6b9e0bf/flash_attn_3-3.0.0b1%2B20260112.cu129torch291cxx11abitrue.ea8f73-cp39-abi3-linux_x86_64.whl", marker = "extra == 'gpu'"},
 ]
 # FlashAttention 4 is developed under flash-attention/flash-attn/cute folder as a standalone python project.
-# Pinned to 02/20/2026 latest main commit.
-flash-attn-cute = { git = "https://github.com/Dao-AILab/flash-attention", subdirectory = "flash_attn/cute", rev = "6079a9bf4cfd7af8e7586afea6c49a97ebddf46e" }
+# Pinned to 03/10/2026 latest main commit.
+flash-attn-4 = { git = "https://github.com/Dao-AILab/flash-attention", subdirectory = "flash_attn/cute", rev = "7fd16f28bffe71c9ab6b7eecc5dd14bf87c1dc9e" }
 
 # Download av wheel directly to avoid FFmpeg build dependency issues in CI.
 av = { url = "https://files.pythonhosted.org/packages/f8/9a/8ffabfcafb42154b4b3a67d63f9b69e68fa8c34cb39ddd5cb813dd049ed4/av-14.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", marker = "extra == 'audio' or extra == 'video'" }