Skip to content

Commit 1543081

Browse files
authored
[Docs]Updata docs of graph opt backend (#3442)
* Updata docs of graph opt backend * update best_practices
1 parent 5703d7a commit 1543081

12 files changed

+243
-174
lines changed

docs/best_practices/ERNIE-4.5-0.3B-Paddle.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Add the following lines to the startup parameters
7676
--use-cudagraph
7777
```
7878
Notes:
79-
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
79+
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
8080
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
8181
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
8282

docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ Add the following lines to the startup parameters
8686
--use-cudagraph
8787
```
8888
Notes:
89-
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
89+
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
9090
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
9191
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
9292

docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ Add the following lines to the startup parameters
135135
--enable-custom-all-reduce
136136
```
137137
Notes:
138-
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../parameters.md) for related configuration parameter descriptions
138+
1. Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
139139
2. When CUDAGraph is enabled, if running with multi-GPUs TP>1, `--enable-custom-all-reduce` must be specified at the same time.
140140
3. When CUDAGraph is enabled, the scenario of `max-model-len > 32768` is not currently supported.
141141

docs/features/graph_optimization.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Graph optimization technology in FastDeploy
2+
3+
FastDeploy's `GraphOptimizationBackend` integrates a variety of graph optimization technologies:
4+
+ **CUDA Graph**:A mechanism that starts multiple GPU operations with a single CPU operation reduces overhead and improves performance
5+
6+
+ **StaticGraph to DynamicGraph**:Convert dynamic graphs to static graphs, optimize calculation graphs and improve execution efficiency using global graph structure information
7+
8+
+ **CINN Neural Network Compiler**:Perform IR conversion, Kernel fusion, Kernel generation and other computational graph compilation optimization methods based on static graphs to achieve comprehensive optimization
9+
10+
Any dynamic situations such as data-dependent control flow, Host-Device synchronization, model input of address/shape changes, dynamic Kernel execution configuration, etc. will cause CUDAGraph Capture/Replay to fail. The scenarios facing LLM inference are dynamic input lengths, dynamic Batch Size, and flexible Attention implementation and multi-device communication, making CUDAGraph difficult to apply.
11+
12+
The mainstream open source solution implements CUDA Graph based on static graphs, with a deep technology stack. FastDeploy not only supports static graphs, neural network compilers, and CUDAGraph combination optimization, but also supports directly applying CUDAGraph in dynamic graphs, which has lower development costs, but the dynamic situations faced are more complex.
13+
14+
FastDeploy's `GraphOptimizationBackend` design architecture is as follows, **some functions are still under development, so it is recommended to read the first chapter carefully using restrictions**.
15+
16+
![](./images/GraphOptBackendArch.svg)
17+
18+
## 1. GraphOptimizationBackend Current usage restrictions
19+
In the CUDAGraph multi-device inference task, you need to use the Custom all-reduce operator to perform multi-card all-reduce.
20+
21+
Before version 2.2, neither the CUDAGraph nor the Custom all-reduce operators were enabled by default. You need to add `--enable-custom-all-reduce` to the startup command to manually enable it.
22+
23+
### 1.1 The multi-device scene needs to be enabled Custom all-reduce
24+
The `FLAGS_max_partition_size` environment variable controls the `gridDim` execution configuration of Kernel in CascadeAppend Attention, and dynamic execution configuration will cause CUDAGraph execution to fail.
25+
[PR#3223](https://github.com/PaddlePaddle/FastDeploy/pull/3223) Fixed this issue, but it still existed in Release versions before 2.2.
26+
27+
**Problem self-checking method:**
28+
+ Calculate `div_up(max_model_len, max_partition_size)` based on the value of `FLAGS_max_partition_size` (default is 32K) and `max_model_len` in the startup parameters. The result is greater than `1` and it can run normally when it is equal to `1`.
29+
30+
**Solution:**
31+
1. Adjust the values of `FLAGS_max_partition_size` and `max_model_len` without triggering dynamic execution of configuration.
32+
2. Close CUDAGraph
33+
34+
## 2. GraphOptimizationBackend related configuration parameters
35+
Currently, only user configuration of the following parameters is supported:
36+
+ `use_cudagraph` : bool = False
37+
+ `graph_optimization_config` : Dict[str, Any]
38+
+ `graph_opt_level`: int = 0
39+
+ `use_cudagraph`: bool = False
40+
+ `cudagraph_capture_sizes` : List[int] = None
41+
42+
CudaGrpah can be enabled by setting `--use-cudagraph` or `--graph-optimization-config '{"use_cudagraph":true}'`. Using two different methods to set the use graph simultaneously may cause conflicts.
43+
44+
The `graph_opt_level` parameter within `--graph-optimization-config` is used to configure the graph optimization level, with the following available options:
45+
+ `0`: Use Dynamic compute graph, default to 0
46+
+ `1`: Use Static compute graph, during the initialization phase, Paddle API will be used to convert the dynamic image into a static image
47+
+ `2`: Base on Static compute graph, use the complier(CINN, Compiler Infrastructure for Neural Networks) of Paddle to compile and optimize
48+
49+
In general, static graphs have lower Kernel Launch overhead than dynamic graphs, and it is recommended to use static graphs.
50+
For adapted models, FastDeploy's CudaGraph *can support both dynamic and static graphs* simultaneously.
51+
52+
When CudaGraph is enabled in the default configuration, a list of Batch Sizes that CudaGraph needs to capture will be automatically set based on the 'max_num_deqs' parameter. The logic for generating the list of Batch Sizes that need to be captured is as follows:
53+
54+
1. Generate a candidate list with a range of [1,1024] Batch Size.
55+
56+
```
57+
# Batch Size [1, 2, 4, 8, 16, ... 120, 128]
58+
candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)]
59+
# Batch Size (128, 144, ... 240, 256]
60+
candidate_capture_sizes += [16 * i for i in range(9, 17)]
61+
# Batch Size (256, 288, ... 992, 1024]
62+
candidate_capture_sizes += [32 * i for i in range(17, 33)]
63+
```
64+
65+
2. Crop the candidate list based on the user set 'max_num_deqs' to obtain a CudaGraph capture list with a range of [1,' max_num_deqs'].
66+
67+
Users can also customize the batch size list that needs to be captured by CudaGraph through the parameter `cudagraph_capture_sizes` in`--graph-optimization-config`:
68+
69+
```
70+
--graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}'
71+
```
72+
73+
### 2.1 CudaGraph related parameters
74+
75+
Using CudaGraph incurs some additional memory overhead, divided into two categories in FastDeploy:
76+
+ Additional input Buffer overhead
77+
+ CudaGraph uses dedicated memory pool, thus holding some intermediate activation memory isolated from main framework
78+
79+
FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter to calculate available memory for `KVCache`, after initializing `KVCache` then uses remaining memory to initialize CudaGraph. Since CudaGraph is not enabled by default currently, using default startup parameters may encounter `Out of memory` errors, can try following solutions:
80+
+ Lower `gpu_memory_utilization` value, reserve more memory for CudaGraph.
81+
+ Lower `max_num_seqs` to decrease the maximum concurrency.
82+
+ Customize the batch size list that CudaGraph needs to capture through `graph_optimization_config`, and reduce the number of captured graphs by using `cudagraph_capture_sizes`
83+
84+
+ Before use, must ensure loaded model is properly decorated with ```@support_graph_optimization```.
85+
86+
```python
87+
# 1. import decorator
88+
from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
89+
...
90+
91+
# 2. add decorator
92+
@support_graph_optimization
93+
class Ernie4_5_Model(nn.Layer): # Note decorator is added to nn.Layer subclass
94+
...
95+
96+
# 3. modify parameter passing in ModelForCasualLM subclass's self.model()
97+
class Ernie4_5_MoeForCausalLM(ModelForCasualLM):
98+
...
99+
def forward(
100+
self,
101+
ids_remove_padding: paddle.Tensor,
102+
forward_meta: ForwardMeta,
103+
):
104+
hidden_states = self.model(ids_remove_padding=ids_remove_padding, # specify parameter name when passing
105+
forward_meta=forward_meta)
106+
return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
107+
...
108+
109+
@support_graph_optimization
110+
class Ernie45TModel(nn.Layer): # Note decorator is added to nn.Layer subclass
111+
...
112+
```

docs/features/images/GraphOptBackendArch.svg

Lines changed: 1 addition & 0 deletions
Loading

docs/parameters.md

Lines changed: 2 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,8 @@ When using FastDeploy to deploy models (including offline inference and service
3535
| ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
3636
| ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
3737
| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output |
38-
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default: False |
39-
```graph_optimization_config``` | `str` | Parameters related to graph optimization can be configured, with default values of'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }' |
38+
| ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. |
39+
| ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
4040
| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False |
4141
| ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] |
4242
| ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None |
@@ -70,86 +70,3 @@ In actual inference, it's difficult for users to know how to properly configure
7070
When `enable_chunked_prefill` is enabled, the service processes long input sequences through dynamic chunking, significantly improving GPU resource utilization. In this mode, the original `max_num_batched_tokens` parameter no longer constrains the batch token count in prefill phase (limiting single prefill token count), thus introducing `max_num_partial_prefills` parameter specifically to limit concurrently processed partial batches.
7171

7272
To optimize scheduling priority for short requests, new `max_long_partial_prefills` and `long_prefill_token_threshold` parameter combination is added. The former limits the number of long requests in single prefill batch, the latter defines the token threshold for long requests. The system will prioritize batch space for short requests, thereby reducing short request latency in mixed workload scenarios while maintaining stable throughput.
73-
74-
## 4. GraphOptimizationBackend related configuration parameters
75-
Currently, only user configuration of the following parameters is supported:
76-
- `use_cudagraph` : bool = False
77-
- `graph_optimization_config` : Dict[str, Any]
78-
- `graph_opt_level`: int = 0
79-
- `use_cudagraph`: bool = False
80-
- `cudagraph_capture_sizes` : List[int] = None
81-
82-
CudaGrpah can be enabled by setting `--use-cudagraph` or `--graph-optimization-config '{"use_cudagraph":true}'`. Using two different methods to set the use graph simultaneously may cause conflicts.
83-
84-
The `graph_opt_level` parameter within `--graph-optimization-config` is used to configure the graph optimization level, with the following available options:
85-
- `0`: Use Dynamic compute graph, default to 0
86-
- `1`: Use Static compute graph, during the initialization phase, Paddle API will be used to convert the dynamic image into a static image
87-
- `2`: Base on Static compute graph, use the complier(CINN, Compiler Infrastructure for Neural Networks) of Paddle to compile and optimize
88-
89-
In general, static graphs have lower Kernel Launch overhead than dynamic graphs, and it is recommended to use static graphs.
90-
For adapted models, FastDeploy's CudaGraph *can support both dynamic and static graphs* simultaneously.
91-
92-
When CudaGraph is enabled in the default configuration, a list of Batch Sizes that CudaGraph needs to capture will be automatically set based on the 'max_num_deqs' parameter. The logic for generating the list of Batch Sizes that need to be captured is as follows:
93-
94-
1. Generate a candidate list with a range of [1,1024] Batch Size.
95-
96-
```
97-
# Batch Size [1, 2, 4, 8, 16, ... 120, 128]
98-
candidate_capture_sizes = [1, 2, 4] + [8 * i for i in range(1, 17)]
99-
# Batch Size (128, 144, ... 240, 256]
100-
candidate_capture_sizes += [16 * i for i in range(9, 17)]
101-
# Batch Size (256, 288, ... 992, 1024]
102-
candidate_capture_sizes += [32 * i for i in range(17, 33)]
103-
```
104-
105-
2. Crop the candidate list based on the user set 'max_num_deqs' to obtain a CudaGraph capture list with a range of [1,' max_num_deqs'].
106-
107-
Users can also customize the batch size list that needs to be captured by CudaGraph through the parameter `cudagraph_capture_sizes` in`--graph-optimization-config`:
108-
109-
```
110-
--graph-optimization-config '{"cudagraph_capture_sizes": [1, 3, 5, 7, 9]}'
111-
```
112-
113-
### CudaGraph related parameters
114-
115-
Using CudaGraph incurs some additional memory overhead, divided into two categories in FastDeploy:
116-
- Additional input Buffer overhead
117-
- CudaGraph uses dedicated memory pool, thus holding some intermediate activation memory isolated from main framework
118-
119-
FastDeploy initialization sequence first uses `gpu_memory_utilization` parameter to calculate available memory for `KVCache`, after initializing `KVCache` then uses remaining memory to initialize CudaGraph. Since CudaGraph is not enabled by default currently, using default startup parameters may encounter `Out of memory` errors, can try following solutions:
120-
- Lower `gpu_memory_utilization` value, reserve more memory for CudaGraph.
121-
- Lower `max_num_seqs` to decrease the maximum concurrency.
122-
- Customize the batch size list that CudaGraph needs to capture through `graph_optimization_config`, and reduce the number of captured graphs by using `cudagraph_capture_sizes`
123-
124-
- Before use, must ensure loaded model is properly decorated with ```@support_graph_optimization```.
125-
126-
```python
127-
# 1. import decorator
128-
from fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
129-
...
130-
131-
# 2. add decorator
132-
@support_graph_optimization
133-
class Ernie4_5_Model(nn.Layer): # Note decorator is added to nn.Layer subclass
134-
...
135-
136-
# 3. modify parameter passing in ModelForCasualLM subclass's self.model()
137-
class Ernie4_5_MoeForCausalLM(ModelForCasualLM):
138-
...
139-
def forward(
140-
self,
141-
ids_remove_padding: paddle.Tensor,
142-
forward_meta: ForwardMeta,
143-
):
144-
hidden_states = self.model(ids_remove_padding=ids_remove_padding, # specify parameter name when passing
145-
forward_meta=forward_meta)
146-
return hidden_statesfrom fastdeploy.model_executor.graph_optimization.decorator import support_graph_optimization
147-
...
148-
149-
@support_graph_optimization
150-
class Ernie45TModel(nn.Layer): # Note decorator is added to nn.Layer subclass
151-
...
152-
```
153-
154-
- When ```use_cudagraph``` is enabled, currently only supports single-GPU inference, i.e. ```tensor_parallel_size``` set to 1.
155-
- When ```use_cudagraph``` is enabled, cannot enable ```enable_prefix_caching``` or ```enable_chunked_prefill```.

docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操
7676
--use-cudagraph
7777
```
7878
注:
79-
1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
79+
1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
8080
2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce`
8181
3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。
8282

docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术,通过将 CUDA 操
8686
--use-cudagraph
8787
```
8888
注:
89-
1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../parameters.md) 相关配置参数说明
89+
1. 通常情况下不需要额外设置其他参数,但CUDAGraph会产生一些额外的显存开销,在一些显存受限的场景下可能需要调整。详细的参数调整请参考[GraphOptimizationBackend](../features/graph_optimization.md) 相关配置参数说明
9090
2. 开启CUDAGraph时,如果是TP>1的多卡推理场景,需要同时指定 `--enable-custom-all-reduce`
9191
3. 开启CUDAGraph时,暂时不支持`max-model-len > 32768`的场景。
9292

0 commit comments

Comments
 (0)