ModelTC
diff --git a/‎README.md‎
Lines changed: 16 additions & 1 deletion b/‎README.md‎
Lines changed: 16 additions & 1 deletion
diff --git a/‎docker/Dockerfile‎
Lines changed: 2 additions & 2 deletions b/‎docker/Dockerfile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docker/Dockerfile.deepep‎
Lines changed: 2 additions & 2 deletions b/‎docker/Dockerfile.deepep‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎lightllm/common/basemodel/basemodel.py‎
Lines changed: 8 additions & 9 deletions b/‎lightllm/common/basemodel/basemodel.py‎
Lines changed: 8 additions & 9 deletions
diff --git a/‎lightllm/common/basemodel/layer_weights/meta_weights/fused_moe_weight_ep.py‎
Lines changed: 10 additions & 12 deletions b/‎lightllm/common/basemodel/layer_weights/meta_weights/fused_moe_weight_ep.py‎
Lines changed: 10 additions & 12 deletions
diff --git a/‎lightllm/common/basemodel/triton_kernel/apply_penalty.py‎
Lines changed: 1 addition & 1 deletion b/‎lightllm/common/basemodel/triton_kernel/apply_penalty.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎lightllm/common/basemodel/triton_kernel/apply_penalty_gpu_cache.py‎
Lines changed: 1 addition & 1 deletion b/‎lightllm/common/basemodel/triton_kernel/apply_penalty_gpu_cache.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎lightllm/common/basemodel/triton_kernel/gather_token_id.py‎
Lines changed: 4 additions & 0 deletions b/‎lightllm/common/basemodel/triton_kernel/gather_token_id.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎lightllm/common/basemodel/triton_kernel/gen_sampling_params.py‎
Lines changed: 4 additions & 0 deletions b/‎lightllm/common/basemodel/triton_kernel/gen_sampling_params.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎lightllm/common/fused_moe/grouped_fused_moe.py‎
Lines changed: 36 additions & 22 deletions b/‎lightllm/common/fused_moe/grouped_fused_moe.py‎
Lines changed: 36 additions & 22 deletions
@@ -21,7 +21,9 @@ LightLLM is a Python-based LLM (Large Language Model) inference and serving fram
 [English Docs](https://lightllm-en.readthedocs.io/en/latest/) | [中文文档](https://lightllm-cn.readthedocs.io/en/latest/) | [Blogs](https://modeltc.github.io/lightllm-blog/)
 
 ## News
-- [2025/05] LightLLM paper on constrained decoding accepted by [ACL25](https://arxiv.org/pdf/2506.03887) (Pre $^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation). For a more accessible overview of the research with key insights and examples, check out our blog post: [LightLLM Blog](https://www.light-ai.top/lightllm-blog/2025/06/15/pre3.html)
+- [2025/09] 🔥 LightLLM [v1.1.0](https://www.light-ai.top/lightllm-blog/2025/09/03/lightllm.html) release!
+- [2025/08] Pre $^3$ achieves the outstanding paper award of [ACL2025](https://2025.aclweb.org/program/awards/).
+- [2025/05] LightLLM paper on constrained decoding accepted by [ACL2025](https://arxiv.org/pdf/2506.03887) (Pre $^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation). For a more accessible overview of the research with key insights and examples, check out our blog post: [LightLLM Blog](https://www.light-ai.top/lightllm-blog/2025/06/15/pre3.html)
 - [2025/04] LightLLM paper on request scheduler published in [ASPLOS’25](https://dl.acm.org/doi/10.1145/3676641.3716011) (Past-Future Scheduler for LLM Serving under SLA Guarantees)
 - [2025/02] 🔥 LightLLM v1.0.0 release, achieving the **fastest DeepSeek-R1** serving performance on single H200 machine.
 
@@ -90,6 +92,19 @@ We learned a lot from the following projects when developing LightLLM.
 
 We have published a number of papers around components or features of LightLLM, if you use LightLLM in your work, please consider citing the relevant paper.
 
+**constrained decoding**: accepted by [ACL2025](https://arxiv.org/pdf/2506.03887) and achieved the outstanding paper award.
+```bibtex
+@inproceedings{
+anonymous2025pre,
+title={Pre\${\textasciicircum}3\$: Enabling Deterministic Pushdown Automata for Faster Structured {LLM} Generation},
+author={Anonymous},
+booktitle={Submitted to ACL Rolling Review - February 2025},
+year={2025},
+url={https://openreview.net/forum?id=g1aBeiyZEi},
+note={under review}
+}
+```
+
 **Request scheduler**: accepted by [ASPLOS’25](https://dl.acm.org/doi/10.1145/3676641.3716011):
 ```bibtex
 @inproceedings{gong2025past,
 
@@ -39,8 +39,8 @@ RUN pip install -r /lightllm/requirements.txt --no-cache-dir
 
 RUN pip install --no-cache-dir vllm --pre --extra-index-url https://wheels.vllm.ai/nightly 
 
-RUN git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v . && \
-    cd flash-attention/hopper/ && python setup.py install
+RUN pip install https://github.com/ModelTC/LightKernel/releases/download/v1.0.1/lightllm_kernel-0.1.0-cp310-cp310-linux_x86_64.whl && \
+    pip install https://github.com/ModelTC/LightKernel/releases/download/v1.0.1/flash_attn_3-3.0.0b1-cp39-abi3-linux_x86_64.whl
 
 RUN apt-get update && apt-get install -y libnuma-dev # for sgl_kernel
 
 
@@ -39,8 +39,8 @@ RUN pip install -r /lightllm/requirements.txt --no-cache-dir
 
 RUN pip install --no-cache-dir vllm --pre --extra-index-url https://wheels.vllm.ai/nightly 
 
-RUN git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v . && \
-    cd flash-attention/hopper/ && python setup.py install
+RUN pip install https://github.com/ModelTC/LightKernel/releases/download/v1.0.1/lightllm_kernel-0.1.0-cp310-cp310-linux_x86_64.whl && \
+    pip install https://github.com/ModelTC/LightKernel/releases/download/v1.0.1/flash_attn_3-3.0.0b1-cp39-abi3-linux_x86_64.whl
 
 RUN apt-get update && apt-get install -y libnuma-dev wget devscripts debhelper dh-make build-essential dkms
 RUN apt-get install -y ibverbs-providers infiniband-diags perftest rdma-core libibverbs-dev librdmacm-dev
 
@@ -7,6 +7,7 @@
 import torch
 import torch.nn.functional as F
 from typing import final
+from tqdm import tqdm
 
 from lightllm.common.basemodel.layer_weights.hf_load_utils import load_hf_weights
 from lightllm.common.basemodel.infer_struct import InferStateInfo
@@ -24,8 +25,10 @@
 from lightllm.utils.envs_utils import get_env_start_args
 from lightllm.distributed.communication_op import dist_group_manager
 from lightllm.common.basemodel.batch_objs import ModelInput, ModelOutput
+from lightllm.common.triton_utils.autotuner import AutotuneLevel
 from lightllm.utils.custom_kernel_utis import pad2dim_tensor_to_new_batch
-from lightllm.utils.envs_utils import set_model_init_status, is_triton_autotune_enabled, disable_triton_autotune
+from lightllm.utils.envs_utils import set_model_init_status
+from lightllm.common.triton_utils.autotuner import Autotuner
 from lightllm.utils.infer_utils import post_empty_cache
 
 logger = init_logger(__name__)
@@ -731,12 +734,10 @@ def autotune_layers(self):
     @torch.no_grad()
     @post_empty_cache
     def _autotune_warmup(self):
-        if not is_triton_autotune_enabled():
-            return
-
+        Autotuner.start_autotune_warmup()
         torch.distributed.barrier()
 
-        warmup_lengths = [1, 8, 16, 64, 128, 256, 1024, 2048, 4096]
+        warmup_lengths = [1, 8, 16, 32, 64, 100, 128, 256, 1024, 2048, 4096]
 
         if self.batch_max_tokens not in warmup_lengths:
             warmup_lengths.append(self.batch_max_tokens)
@@ -747,9 +748,8 @@ def _autotune_warmup(self):
 
         layer_num_bak = self.layers_num
         self.layers_num = self.autotune_layers()
-        for input_len in warmup_lengths:
+        for input_len in tqdm(warmup_lengths, desc="warming up"):
             try:
-                logger.info(f"autotune warmup for length {input_len}")
                 rand_gen = torch.Generator(device="cuda")
                 rand_gen.manual_seed(input_len)
                 dummy_input_ids = torch.randint(
@@ -784,7 +784,6 @@ def _autotune_warmup(self):
                 self.mem_manager.free_all()
                 gc.collect()
                 torch.cuda.empty_cache()
-                logger.info(f"autotune warmup for length {input_len} ok")
             except Exception as e:
                 logger.warning(f"autotune warmup for length {input_len} failed: {str(e)}")
                 logger.exception(str(e))
@@ -794,7 +793,7 @@ def _autotune_warmup(self):
                 torch.cuda.empty_cache()
         self.layers_num = layer_num_bak
         torch.distributed.barrier()
-        disable_triton_autotune()
+        Autotuner.end_autotune_warmup()
 
     @final
     @torch.no_grad()
 
@@ -4,7 +4,11 @@
 from typing import Optional, Tuple, List, Dict, Any
 from lightllm.utils.dist_utils import get_global_world_size, get_global_rank, get_current_device_id
 from .base_weight import BaseWeight
-from lightllm.common.fused_moe.grouped_fused_moe_ep import fused_experts_impl, masked_group_gemm
+from lightllm.common.fused_moe.grouped_fused_moe_ep import (
+    fused_experts_impl,
+    masked_group_gemm,
+    _deepgemm_grouped_fp8_nt_contiguous,
+)
 from lightllm.common.fused_moe.moe_silu_and_mul import silu_and_mul_fwd
 from lightllm.distributed import dist_group_manager
 from lightllm.common.fused_moe.topk_select import select_experts
@@ -17,15 +21,11 @@
 )
 from lightllm.common.fused_moe.deepep_scatter_gather import ep_scatter, ep_gather
 from lightllm.common.basemodel.triton_kernel.redundancy_topk_ids_repair import redundancy_topk_ids_repair
-from lightllm.utils.envs_utils import is_triton_autotune_enabled
 from lightllm.utils.log_utils import init_logger
+from lightllm.common.triton_utils.autotuner import Autotuner
 
-logger = init_logger(__name__)
 
-try:
-    import deep_gemm
-except:
-    logger.warning("no deepep or deep_gemm")
+logger = init_logger(__name__)
 
 
 class FusedMoeWeightEP(BaseWeight):
@@ -335,7 +335,7 @@ def prefilled_group_gemm(
             # groupgemm (contiguous layout)
             gemm_out_a = torch.empty((all_tokens, N), device=device, dtype=hidden_dtype)
 
-            deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(input_tensor, (w1, w1_scale), gemm_out_a, m_indices)
+            _deepgemm_grouped_fp8_nt_contiguous(input_tensor, (w1, w1_scale), gemm_out_a, m_indices)
 
             # silu_and_mul_fwd + qaunt
             # TODO fused kernel
@@ -349,16 +349,14 @@ def prefilled_group_gemm(
             # groupgemm (contiguous layout)
             gemm_out_b = torch.empty((all_tokens, K), device=device, dtype=hidden_dtype)
 
-            deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(
-                (qsilu_out, qsilu_out_scale), (w2, w2_scale), gemm_out_b, m_indices
-            )
+            _deepgemm_grouped_fp8_nt_contiguous((qsilu_out, qsilu_out_scale), (w2, w2_scale), gemm_out_b, m_indices)
             # gather and local reduce
             ep_gather(gemm_out_b, recv_topk_idx, recv_topk_weights, output_index, gather_out)
         else:
             ######################################## warning ##################################################
             # here is used to match autotune feature, make moe model run same triton kernel in different rank.
             # in some special case, one rank will recv 0 token, so add a token to make it run triton kernel.
-            if is_triton_autotune_enabled():
+            if Autotuner.is_autotune_warmup():
                 _gemm_out_a = torch.zeros((1, N), device=device, dtype=hidden_dtype)
                 _silu_out = torch.zeros((1, N // 2), device=device, dtype=hidden_dtype)
                 silu_and_mul_fwd(_gemm_out_a.view(-1, N), _silu_out)
 
@@ -61,7 +61,7 @@ def _fwd_kernel_apply_penalty(
         cur_eos_logit_ptr = Logits + cur_batch * stride_logit_b + eos_id
         cur_eos_logit = tl.load(cur_eos_logit_ptr)
         cur_eos_logit = cur_eos_logit + tl.abs(cur_eos_logit) * penalty_scale
-        cur_eos_logit = tl.where(mask_eos, -10000000.0, cur_eos_logit)
+        cur_eos_logit = tl.where(mask_eos != 0, -10000000.0, cur_eos_logit)
         tl.store(cur_eos_logit_ptr, cur_eos_logit)
     return
 
 
@@ -76,7 +76,7 @@ def _eos_penalty(
         cur_eos_logit_ptr = Logits + offs * stride_logit_b + eos_id
         cur_eos_logit = tl.load(cur_eos_logit_ptr, mask=mask, other=0.0)
         cur_eos_logit = cur_eos_logit + tl.abs(cur_eos_logit) * penalty_scale
-        cur_eos_logit = tl.where(mask_eos, -10000000.0, cur_eos_logit)
+        cur_eos_logit = tl.where(mask_eos != 0, -10000000.0, cur_eos_logit)
         tl.store(cur_eos_logit_ptr, cur_eos_logit, mask=mask)
     return
 
 
@@ -16,6 +16,7 @@ def _fwd_kernel_scatter(
     num_size,
     HAS_OUT_IS_NONE: tl.constexpr,
     BLOCK: tl.constexpr,
+    OLD_VERSION_TRITON: tl.constexpr,
 ):
     block_index = tl.program_id(0)
     block_range = block_index * BLOCK + tl.arange(0, BLOCK)
@@ -27,6 +28,8 @@ def _fwd_kernel_scatter(
 
     if not HAS_OUT_IS_NONE:
         cur_has_out = tl.load(b_has_out + block_range, mask=block_mask, other=False)
+        if OLD_VERSION_TRITON:
+            cur_has_out = cur_has_out != 0
         tl.store(
             req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index,
             cur_next_token_id,
@@ -76,6 +79,7 @@ def scatter_token(
         num_size=batch_size,
         HAS_OUT_IS_NONE=b_has_out is None,
         BLOCK=BLOCK,
+        OLD_VERSION_TRITON=triton.__version__ < "3.2.0",
         num_warps=num_warps,
         num_stages=1,
     )
 
@@ -125,6 +125,7 @@ def _token_id_counter_update_kernel(
     batch_size,
     HAS_MASK: tl.constexpr,
     BLOCK: tl.constexpr,
+    OLD_VERSION_TRITON: tl.constexpr,
 ):
 
     block_start_index = tl.program_id(0) * BLOCK
@@ -136,6 +137,8 @@ def _token_id_counter_update_kernel(
 
     if HAS_MASK:
         mask = tl.load(mask_ptr + offs, mask=loc_mask, other=False)
+        if OLD_VERSION_TRITON:
+            mask = mask != 0
         tl.atomic_add(
             req_to_out_token_id_counter_ptr + req_idx * counter_stride_m + token_ids * counter_stride_n,
             1,
@@ -170,6 +173,7 @@ def update_req_to_token_id_counter(
         batch_size=batch_size,
         HAS_MASK=has_mask,
         BLOCK=BLOCK,
+        OLD_VERSION_TRITON=triton.__version__ < "3.2.0",
         num_warps=1,
     )
     return
@@ -332,6 +332,7 @@ def grouped_matmul_kernel(
     GROUP_SIZE_M: tl.constexpr,
     MUL_ROUTED_WEIGHT: tl.constexpr = False,
     NEED_K_MASK: tl.constexpr = True,
+    NEED_TRANS: tl.constexpr = False,
 ):
     pid = tl.program_id(0)
 
@@ -367,13 +368,6 @@ def grouped_matmul_kernel(
         mask=token_mask,
         other=0,
     )
-    if MUL_ROUTED_WEIGHT:
-        a_m_scale = tl.load(
-            expert_to_weights_ptr + expert_id * expert_to_weights_stride0 + offs_am,
-            mask=token_mask,
-            other=0.0,
-        )
-
     offs_bn = (tile_n_idx * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % n
     offs_k = tl.arange(0, BLOCK_SIZE_K)
 
@@ -387,7 +381,7 @@ def grouped_matmul_kernel(
             b_scale = tl.load(weight_scale_ptr + expert_id, eviction_policy="evict_last")
             ab_scale = a_scale * b_scale
 
-    if use_fp8_w8a8:
+    if NEED_TRANS:
         a_ptrs = token_ptr + (a_m_index // topk_num)[None, :] * token_stride_0 + offs_k[:, None]
         b_ptrs = weights_ptr + weight_stride_0 * expert_id + offs_k[None, :] + offs_bn[:, None] * weight_stride_1
         accumulator = tl.zeros((BLOCK_SIZE_N, BLOCK_SIZE_M), dtype=tl.float32)
@@ -401,16 +395,20 @@ def grouped_matmul_kernel(
         # tl.multiple_of(a_ptrs, [16, 16])
         # tl.multiple_of(b_ptrs, [16, 16])
 
-        if use_fp8_w8a8:
+        if NEED_TRANS:
             if NEED_K_MASK:
-                a = tl.load(a_ptrs, mask=(token_mask[None, :]) & (offs_k[:, None] < k), other=0.0)
+                a = tl.load(
+                    a_ptrs, mask=(token_mask[None, :]) & (offs_k[:, None] < k - step_k * BLOCK_SIZE_K), other=0.0
+                )
                 b = tl.load(b_ptrs, mask=(offs_k[None, :] < k), other=0.0)
             else:
                 a = tl.load(a_ptrs, mask=(token_mask[None, :]), other=0.0)
                 b = tl.load(b_ptrs)
         else:
             if NEED_K_MASK:
-                a = tl.load(a_ptrs, mask=(token_mask[:, None]) & (offs_k[None, :] < k), other=0.0)
+                a = tl.load(
+                    a_ptrs, mask=(token_mask[:, None]) & (offs_k[None, :] < k - step_k * BLOCK_SIZE_K), other=0.0
+                )
                 b = tl.load(b_ptrs, mask=(offs_k[:, None] < k), other=0.0)
             else:
                 a = tl.load(a_ptrs, mask=(token_mask[:, None]), other=0.0)
@@ -421,24 +419,34 @@ def grouped_matmul_kernel(
                 offs_ks = step_k * BLOCK_SIZE_K // block_size_k
                 a_scale = tl.load(a_scale_ptrs + offs_ks, mask=token_mask, other=0.0)
                 b_scale = tl.load(b_scale_ptrs + offs_ks * weight_scale_stride2)
-                accumulator += tl.dot(b, a) * b_scale[:, None] * a_scale[None, :]
+                if NEED_TRANS:
+                    accumulator += tl.dot(b, a) * b_scale[:, None] * a_scale[None, :]
+                else:
+                    accumulator += tl.dot(a, b) * a_scale[:, None] * b_scale[None, :]
             else:
-                accumulator = tl.dot(b, a, acc=accumulator)
+                if NEED_TRANS:
+                    accumulator = tl.dot(b, a, acc=accumulator)
+                else:
+                    accumulator = tl.dot(a, b, acc=accumulator)
         else:
             accumulator += tl.dot(a, b)
 
         a_ptrs += BLOCK_SIZE_K
         b_ptrs += BLOCK_SIZE_K
-        offs_k += BLOCK_SIZE_K
+
+    if NEED_TRANS:
+        accumulator = accumulator.T
 
     if use_fp8_w8a8:
-        if block_size_k > 0 and block_size_n > 0:
-            accumulator = accumulator.T
-        else:
-            accumulator = accumulator.T
+        if not (block_size_k > 0 and block_size_n > 0):
             accumulator *= ab_scale
 
     if MUL_ROUTED_WEIGHT:
+        a_m_scale = tl.load(
+            expert_to_weights_ptr + expert_id * expert_to_weights_stride0 + offs_am,
+            mask=token_mask,
+            other=0.0,
+        )
         accumulator *= a_m_scale[:, None]
 
     c = accumulator.to(compute_type)
@@ -478,13 +486,15 @@ def _get_grouped_matmul_configs():
             "GROUP_SIZE_M": gm,
             "num_warps": nw,
             "num_stages": ns,
+            "NEED_TRANS": need_trans,
         }
-        for ns in [1, 2, 3, 4, 5]
-        for gm in [1, 2, 4, 8]
-        for nw in [2, 4, 8]
+        for ns in [2, 3, 4, 5]
+        for gm in [1, 16, 32, 64]
+        for nw in [4, 8]
         for bm in [16, 32, 64, 128]
         for bn in [16, 32, 64, 128]
-        for bk in [16, 32, 64, 128]
+        for bk in [32, 64, 128]
+        for need_trans in [True, False]
     ]
 
 
@@ -559,6 +569,9 @@ def grouped_matmul(
     GROUP_SIZE_M = run_config["GROUP_SIZE_M"]
     num_warps = run_config["num_warps"]
     num_stages = run_config["num_stages"]
+    NEED_TRANS = run_config.get("NEED_TRANS", False)
+    if not use_fp8_w8a8:
+        assert NEED_TRANS is False, "only use_fp8_w8a8 mode can use NEED_TRANS to accelerate"
 
     if block_size_k != 0:
         # 如果使用了 block wise 量化，分块大小不能超过 block size
@@ -638,6 +651,7 @@ def grouped_matmul(
         GROUP_SIZE_M=GROUP_SIZE_M,
         MUL_ROUTED_WEIGHT=mul_routed_weight,
         NEED_K_MASK=NEED_K_MASK,
+        NEED_TRANS=NEED_TRANS,
         num_warps=num_warps,
         num_stages=num_stages,
     )