flashinfer-ai
diff --git a/‎docs/api/attention.rst
Lines changed: 94 additions & 0 deletions b/‎docs/api/attention.rst
Lines changed: 94 additions & 0 deletions
diff --git a/‎docs/api/comm.rst
Lines changed: 130 additions & 0 deletions b/‎docs/api/comm.rst
Lines changed: 130 additions & 0 deletions
diff --git a/‎docs/api/decode.rst
Lines changed: 0 additions & 28 deletions b/‎docs/api/decode.rst
Lines changed: 0 additions & 28 deletions
diff --git a/‎docs/api/fp4_quantization.rst
Lines changed: 36 additions & 0 deletions b/‎docs/api/fp4_quantization.rst
Lines changed: 36 additions & 0 deletions
diff --git a/‎docs/api/fused_moe.rst
Lines changed: 44 additions & 0 deletions b/‎docs/api/fused_moe.rst
Lines changed: 44 additions & 0 deletions
diff --git a/‎docs/api/gemm.rst
Lines changed: 23 additions & 5 deletions b/‎docs/api/gemm.rst
Lines changed: 23 additions & 5 deletions
diff --git a/‎docs/api/logits_processor.rst
Lines changed: 1 addition & 1 deletion b/‎docs/api/logits_processor.rst
Lines changed: 1 addition & 1 deletion
@@ -0,0 +1,94 @@
+.. _apiattention:
+
+FlashInfer Attention Kernels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+flashinfer.decode
+=================
+
+.. currentmodule:: flashinfer.decode
+
+Single Request Decoding
+-----------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    single_decode_with_kv_cache
+
+Batch Decoding
+--------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    cudnn_batch_decode_with_kv_cache
+    trtllm_batch_decode_with_kv_cache
+
+.. autoclass:: BatchDecodeWithPagedKVCacheWrapper
+    :members:
+    :exclude-members: begin_forward, end_forward, forward, forward_return_lse
+
+    .. automethod:: __init__
+
+.. autoclass:: CUDAGraphBatchDecodeWithPagedKVCacheWrapper
+    :members:
+
+    .. automethod:: __init__
+
+
+flashinfer.prefill
+==================
+
+Attention kernels for prefill & append attention in both single request and batch serving setting.
+
+.. currentmodule:: flashinfer.prefill
+
+Single Request Prefill/Append Attention
+---------------------------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    single_prefill_with_kv_cache
+    single_prefill_with_kv_cache_return_lse
+
+Batch Prefill/Append Attention
+------------------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    cudnn_batch_prefill_with_kv_cache
+    trtllm_batch_context_with_kv_cache
+
+.. autoclass:: BatchPrefillWithPagedKVCacheWrapper
+    :members:
+    :exclude-members: begin_forward, end_forward, forward, forward_return_lse
+
+    .. automethod:: __init__
+
+.. autoclass:: BatchPrefillWithRaggedKVCacheWrapper
+    :members:
+    :exclude-members: begin_forward, end_forward, forward, forward_return_lse
+
+    .. automethod:: __init__
+
+
+flashinfer.mla
+==============
+
+MLA (Multi-head Latent Attention) is an attention mechanism proposed in DeepSeek series of models (
+`DeepSeek-V2 <https://arxiv.org/abs/2405.04434>`_, `DeepSeek-V3 <https://arxiv.org/abs/2412.19437>`_,
+and `DeepSeek-R1 <https://arxiv.org/abs/2501.12948>`_).
+
+.. currentmodule:: flashinfer.mla
+
+PageAttention for MLA
+---------------------
+
+.. autoclass:: BatchMLAPagedAttentionWrapper
+    :members:
+
+    .. automethod:: __init__
@@ -0,0 +1,130 @@
+.. _apicomm:
+
+flashinfer.comm
+===============
+
+.. currentmodule:: flashinfer.comm
+
+This module provides communication primitives and utilities for distributed computing, including CUDA IPC, AllReduce operations, and memory management utilities.
+
+CUDA IPC Utilities
+------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    CudaRTLibrary
+    create_shared_buffer
+    free_shared_buffer
+
+DLPack Utilities
+----------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    pack_strided_memory
+
+Mapping Utilities
+-----------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    Mapping
+
+TensorRT-LLM AllReduce
+----------------------
+
+Types and Enums
+~~~~~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: ../generated
+
+    AllReduceFusionOp
+    AllReduceFusionPattern
+    AllReduceStrategyConfig
+    AllReduceStrategyType
+    FP4QuantizationSFLayout
+
+Core Operations
+~~~~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: ../generated
+
+    trtllm_allreduce_fusion
+    trtllm_custom_all_reduce
+    trtllm_moe_allreduce_fusion
+    trtllm_moe_finalize_allreduce_fusion
+
+Workspace Management
+~~~~~~~~~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: ../generated
+
+    trtllm_create_ipc_workspace_for_all_reduce
+    trtllm_create_ipc_workspace_for_all_reduce_fusion
+    trtllm_destroy_ipc_workspace_for_all_reduce
+    trtllm_destroy_ipc_workspace_for_all_reduce_fusion
+
+Initialization and Utilities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: ../generated
+
+    trtllm_lamport_initialize
+    trtllm_lamport_initialize_all
+    compute_fp4_swizzled_layout_sf_size
+
+vLLM AllReduce
+--------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    vllm_all_reduce
+    vllm_dispose
+    vllm_init_custom_ar
+    vllm_register_buffer
+    vllm_register_graph_buffers
+    vllm_get_graph_buffer_ipc_meta
+    vllm_meta_size
+
+MNNVL (Multi-Node NVLink)
+-------------------------
+
+.. currentmodule:: flashinfer.comm.mnnvl
+
+Core Classes
+~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: ../generated
+
+    MnnvlMemory
+    McastGPUBuffer
+
+Utility Functions
+~~~~~~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: ../generated
+
+    create_tensor_from_cuda_memory
+    alloc_and_copy_to_cuda
+
+TensorRT-LLM MNNVL AllReduce
+----------------------------
+
+.. currentmodule:: flashinfer.comm.trtllm_mnnvl_ar
+
+.. autosummary::
+    :toctree: ../generated
+
+    trtllm_mnnvl_all_reduce
+    trtllm_mnnvl_fused_allreduce_rmsnorm
+    mpi_barrier
@@ -0,0 +1,36 @@
+.. _apifp4_quantization:
+
+flashinfer.fp4_quantization
+===========================
+
+.. currentmodule:: flashinfer.fp4_quantization
+
+This module provides FP4 quantization operations for LLM inference, supporting various scale factor layouts and quantization formats.
+
+Core Quantization Functions
+---------------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    fp4_quantize
+    nvfp4_quantize
+    nvfp4_block_scale_interleave
+    e2m1_and_ufp8sf_scale_to_float
+
+Matrix Shuffling Utilities
+--------------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    shuffle_matrix_a
+    shuffle_matrix_sf_a
+
+Types and Enums
+---------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    SfLayout
@@ -0,0 +1,44 @@
+.. _apifused_moe:
+
+flashinfer.fused_moe
+====================
+
+.. currentmodule:: flashinfer.fused_moe
+
+This module provides fused Mixture-of-Experts (MoE) operations optimized for different backends and data types.
+
+Types and Enums
+---------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    RoutingMethodType
+    WeightLayout
+
+Utility Functions
+-----------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    convert_to_block_layout
+    reorder_rows_for_gated_act_gemm
+
+CUTLASS Fused MoE
+-----------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    cutlass_fused_moe
+
+TensorRT-LLM Fused MoE
+----------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    trtllm_fp4_block_scale_moe
+    trtllm_fp8_block_scale_moe
+    trtllm_fp8_per_tensor_scale_moe
@@ -7,18 +7,36 @@ flashinfer.gemm
 
 This module provides a set of GEMM operations.
 
-FP8 Batch GEMM
---------------
+FP4 GEMM
+--------
 
 .. autosummary::
     :toctree: ../generated
 
+    mm_fp4
+
+FP8 GEMM
+--------
+
+.. autosummary::
+    :toctree: ../generated
+
+    bmm_fp8
     gemm_fp8_nt_groupwise
     group_gemm_fp8_nt_groupwise
-    bmm_fp8
+    group_deepgemm_fp8_nt_groupwise
+    batch_deepgemm_fp8_nt_groupwise
+
+Mixed Precision GEMM (fp8 x fp4)
+--------------------------------
+
+.. autosummary::
+    :toctree: ../generated
+
+    group_gemm_mxfp4_nt_groupwise
 
-Grouped GEMM
-------------
+Grouped GEMM (Ampere/Hopper)
+----------------------------
 
 .. autoclass:: SegmentGEMMWrapper
     :members:
 
@@ -63,7 +63,7 @@ Types
     TaggedTensor
 
 Customization Features
--------------
+----------------------
 
 Custom Logits Processor
 ^^^^^^^^^^^^^^^^^^^^^^^