You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/features/async_schedule.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,16 +13,17 @@ In the overall architecture, stages 1 and 3 on the CPU side are handled by diffe
13
13
14
14
## Usage
15
15
16
-
xLLM provides the gflags parameter enable_schedule_overlap, which defaults to false. To enable this feature, simply set it to true in xLLM’s service startup script, as
16
+
xLLM provides the gflags parameter enable_schedule_overlap, which defaults to true. To disable this feature, simply set it to false in xLLM's service startup script, as
17
17
```shell
18
-
--enable_schedule_overlap=true
18
+
--enable_schedule_overlap=false
19
19
```
20
20
21
21
## Performance
22
22
23
-
- With asynchronous scheduling enabled, the device idle time between two steps is approximately 200μs - comparable to a single kernel launch duration.
23
+
- With asynchronous scheduling enabled, the device idle time between two steps is approximately 200us - comparable to a single kernel launch duration.
24
24
- On the DeepSeek-R1-Distill-Qwen-1.5B model with TPOT constrained to 50ms, this achieves 17% throughput improvement.
25
25
26
26
27
27
## Notice
28
-
The asynchronous scheduling feature requires the server to compute one additional step. For use cases involving limited output tokens (e.g., few-token generation) or single-output scenarios like embedding models, enabling this feature is not recommended as it may reduce server-side throughput.
28
+
The asynchronous scheduling feature requires the server to compute one additional step. For use cases involving limited output tokens (e.g., few-token generation) or single-output scenarios like embedding models, enabling this feature is not recommended as it may reduce server-side throughput, thus hard-disabled internally.
29
+
The VLM model is currently being adapted, will be temporarily disabled.
0 commit comments