Skip to content

Commit 40471bf

Browse files
authored
Update documentation index (#1374)
1 parent f4e1420 commit 40471bf

19 files changed

+938
-437
lines changed

docs/api/attention.rst

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
.. _apiattention:
2+
3+
FlashInfer Attention Kernels
4+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5+
6+
7+
flashinfer.decode
8+
=================
9+
10+
.. currentmodule:: flashinfer.decode
11+
12+
Single Request Decoding
13+
-----------------------
14+
15+
.. autosummary::
16+
:toctree: ../generated
17+
18+
single_decode_with_kv_cache
19+
20+
Batch Decoding
21+
--------------
22+
23+
.. autosummary::
24+
:toctree: ../generated
25+
26+
cudnn_batch_decode_with_kv_cache
27+
trtllm_batch_decode_with_kv_cache
28+
29+
.. autoclass:: BatchDecodeWithPagedKVCacheWrapper
30+
:members:
31+
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
32+
33+
.. automethod:: __init__
34+
35+
.. autoclass:: CUDAGraphBatchDecodeWithPagedKVCacheWrapper
36+
:members:
37+
38+
.. automethod:: __init__
39+
40+
41+
flashinfer.prefill
42+
==================
43+
44+
Attention kernels for prefill & append attention in both single request and batch serving setting.
45+
46+
.. currentmodule:: flashinfer.prefill
47+
48+
Single Request Prefill/Append Attention
49+
---------------------------------------
50+
51+
.. autosummary::
52+
:toctree: ../generated
53+
54+
single_prefill_with_kv_cache
55+
single_prefill_with_kv_cache_return_lse
56+
57+
Batch Prefill/Append Attention
58+
------------------------------
59+
60+
.. autosummary::
61+
:toctree: ../generated
62+
63+
cudnn_batch_prefill_with_kv_cache
64+
trtllm_batch_context_with_kv_cache
65+
66+
.. autoclass:: BatchPrefillWithPagedKVCacheWrapper
67+
:members:
68+
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
69+
70+
.. automethod:: __init__
71+
72+
.. autoclass:: BatchPrefillWithRaggedKVCacheWrapper
73+
:members:
74+
:exclude-members: begin_forward, end_forward, forward, forward_return_lse
75+
76+
.. automethod:: __init__
77+
78+
79+
flashinfer.mla
80+
==============
81+
82+
MLA (Multi-head Latent Attention) is an attention mechanism proposed in DeepSeek series of models (
83+
`DeepSeek-V2 <https://arxiv.org/abs/2405.04434>`_, `DeepSeek-V3 <https://arxiv.org/abs/2412.19437>`_,
84+
and `DeepSeek-R1 <https://arxiv.org/abs/2501.12948>`_).
85+
86+
.. currentmodule:: flashinfer.mla
87+
88+
PageAttention for MLA
89+
---------------------
90+
91+
.. autoclass:: BatchMLAPagedAttentionWrapper
92+
:members:
93+
94+
.. automethod:: __init__

docs/api/comm.rst

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
.. _apicomm:
2+
3+
flashinfer.comm
4+
===============
5+
6+
.. currentmodule:: flashinfer.comm
7+
8+
This module provides communication primitives and utilities for distributed computing, including CUDA IPC, AllReduce operations, and memory management utilities.
9+
10+
CUDA IPC Utilities
11+
------------------
12+
13+
.. autosummary::
14+
:toctree: ../generated
15+
16+
CudaRTLibrary
17+
create_shared_buffer
18+
free_shared_buffer
19+
20+
DLPack Utilities
21+
----------------
22+
23+
.. autosummary::
24+
:toctree: ../generated
25+
26+
pack_strided_memory
27+
28+
Mapping Utilities
29+
-----------------
30+
31+
.. autosummary::
32+
:toctree: ../generated
33+
34+
Mapping
35+
36+
TensorRT-LLM AllReduce
37+
----------------------
38+
39+
Types and Enums
40+
~~~~~~~~~~~~~~~~
41+
42+
.. autosummary::
43+
:toctree: ../generated
44+
45+
AllReduceFusionOp
46+
AllReduceFusionPattern
47+
AllReduceStrategyConfig
48+
AllReduceStrategyType
49+
FP4QuantizationSFLayout
50+
51+
Core Operations
52+
~~~~~~~~~~~~~~~
53+
54+
.. autosummary::
55+
:toctree: ../generated
56+
57+
trtllm_allreduce_fusion
58+
trtllm_custom_all_reduce
59+
trtllm_moe_allreduce_fusion
60+
trtllm_moe_finalize_allreduce_fusion
61+
62+
Workspace Management
63+
~~~~~~~~~~~~~~~~~~~~
64+
65+
.. autosummary::
66+
:toctree: ../generated
67+
68+
trtllm_create_ipc_workspace_for_all_reduce
69+
trtllm_create_ipc_workspace_for_all_reduce_fusion
70+
trtllm_destroy_ipc_workspace_for_all_reduce
71+
trtllm_destroy_ipc_workspace_for_all_reduce_fusion
72+
73+
Initialization and Utilities
74+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
75+
76+
.. autosummary::
77+
:toctree: ../generated
78+
79+
trtllm_lamport_initialize
80+
trtllm_lamport_initialize_all
81+
compute_fp4_swizzled_layout_sf_size
82+
83+
vLLM AllReduce
84+
--------------
85+
86+
.. autosummary::
87+
:toctree: ../generated
88+
89+
vllm_all_reduce
90+
vllm_dispose
91+
vllm_init_custom_ar
92+
vllm_register_buffer
93+
vllm_register_graph_buffers
94+
vllm_get_graph_buffer_ipc_meta
95+
vllm_meta_size
96+
97+
MNNVL (Multi-Node NVLink)
98+
-------------------------
99+
100+
.. currentmodule:: flashinfer.comm.mnnvl
101+
102+
Core Classes
103+
~~~~~~~~~~~~
104+
105+
.. autosummary::
106+
:toctree: ../generated
107+
108+
MnnvlMemory
109+
McastGPUBuffer
110+
111+
Utility Functions
112+
~~~~~~~~~~~~~~~~~
113+
114+
.. autosummary::
115+
:toctree: ../generated
116+
117+
create_tensor_from_cuda_memory
118+
alloc_and_copy_to_cuda
119+
120+
TensorRT-LLM MNNVL AllReduce
121+
----------------------------
122+
123+
.. currentmodule:: flashinfer.comm.trtllm_mnnvl_ar
124+
125+
.. autosummary::
126+
:toctree: ../generated
127+
128+
trtllm_mnnvl_all_reduce
129+
trtllm_mnnvl_fused_allreduce_rmsnorm
130+
mpi_barrier

docs/api/decode.rst

Lines changed: 0 additions & 28 deletions
This file was deleted.

docs/api/fp4_quantization.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
.. _apifp4_quantization:
2+
3+
flashinfer.fp4_quantization
4+
===========================
5+
6+
.. currentmodule:: flashinfer.fp4_quantization
7+
8+
This module provides FP4 quantization operations for LLM inference, supporting various scale factor layouts and quantization formats.
9+
10+
Core Quantization Functions
11+
---------------------------
12+
13+
.. autosummary::
14+
:toctree: ../generated
15+
16+
fp4_quantize
17+
nvfp4_quantize
18+
nvfp4_block_scale_interleave
19+
e2m1_and_ufp8sf_scale_to_float
20+
21+
Matrix Shuffling Utilities
22+
--------------------------
23+
24+
.. autosummary::
25+
:toctree: ../generated
26+
27+
shuffle_matrix_a
28+
shuffle_matrix_sf_a
29+
30+
Types and Enums
31+
---------------
32+
33+
.. autosummary::
34+
:toctree: ../generated
35+
36+
SfLayout

docs/api/fused_moe.rst

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
.. _apifused_moe:
2+
3+
flashinfer.fused_moe
4+
====================
5+
6+
.. currentmodule:: flashinfer.fused_moe
7+
8+
This module provides fused Mixture-of-Experts (MoE) operations optimized for different backends and data types.
9+
10+
Types and Enums
11+
---------------
12+
13+
.. autosummary::
14+
:toctree: ../generated
15+
16+
RoutingMethodType
17+
WeightLayout
18+
19+
Utility Functions
20+
-----------------
21+
22+
.. autosummary::
23+
:toctree: ../generated
24+
25+
convert_to_block_layout
26+
reorder_rows_for_gated_act_gemm
27+
28+
CUTLASS Fused MoE
29+
-----------------
30+
31+
.. autosummary::
32+
:toctree: ../generated
33+
34+
cutlass_fused_moe
35+
36+
TensorRT-LLM Fused MoE
37+
----------------------
38+
39+
.. autosummary::
40+
:toctree: ../generated
41+
42+
trtllm_fp4_block_scale_moe
43+
trtllm_fp8_block_scale_moe
44+
trtllm_fp8_per_tensor_scale_moe

docs/api/gemm.rst

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,36 @@ flashinfer.gemm
77

88
This module provides a set of GEMM operations.
99

10-
FP8 Batch GEMM
11-
--------------
10+
FP4 GEMM
11+
--------
1212

1313
.. autosummary::
1414
:toctree: ../generated
1515

16+
mm_fp4
17+
18+
FP8 GEMM
19+
--------
20+
21+
.. autosummary::
22+
:toctree: ../generated
23+
24+
bmm_fp8
1625
gemm_fp8_nt_groupwise
1726
group_gemm_fp8_nt_groupwise
18-
bmm_fp8
27+
group_deepgemm_fp8_nt_groupwise
28+
batch_deepgemm_fp8_nt_groupwise
29+
30+
Mixed Precision GEMM (fp8 x fp4)
31+
--------------------------------
32+
33+
.. autosummary::
34+
:toctree: ../generated
35+
36+
group_gemm_mxfp4_nt_groupwise
1937

20-
Grouped GEMM
21-
------------
38+
Grouped GEMM (Ampere/Hopper)
39+
----------------------------
2240

2341
.. autoclass:: SegmentGEMMWrapper
2442
:members:

docs/api/logits_processor.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Types
6363
TaggedTensor
6464

6565
Customization Features
66-
-------------
66+
----------------------
6767

6868
Custom Logits Processor
6969
^^^^^^^^^^^^^^^^^^^^^^^

0 commit comments

Comments
 (0)