Skip to content

Commit 132a8ef

Browse files
luukunnXieYunshenming1753zoooo0820EmmonsCurse
authored
Release/2.1 (#3414)
* Pre ce modified (#3335) (#3360) * Pre ce modified (#3335) * update * update * fix * fix * update * update * update * fix * update * update * update * add ut fix pr(3367) * [Bug Fix] Fix V1 video bug (#3387) * fix stopseq error info (#3342) Co-authored-by: YuBaoku <[email protected]> * [BugFix] Fix default log level of paddleformers (#3377) Co-authored-by: YuBaoku <[email protected]> * [Polish Code] Remove useless notes * feat(log):add_request_and_response_log (#3392) * Optimize CI execution workflow. (#3371) (#3384) * fix * [BugFix] fix control signal release failed (#3374) * [BugFix] * [BugFix] * [BugFix] * [BugFix] * fix * fix --------- Co-authored-by: YuBaoku <[email protected]> Co-authored-by: Jiang-Jia-Jun <[email protected]> * Revert "Merge branch 'feature/online/vs_think_20250813' into release/2.1" This reverts commit 02596fc, reversing changes made to 0334762. * [XPU] Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER (#3393) * fix v1 schedule oom bug * fix v1 schedule oom bug * [BugFix] fix ErnieProcessor not set raw_prediction (#3401) * [Doc]Release fastdeploy-xpu 2.1.0 (#3407) * fix v1 schedule oom bug * fix v1 schedule oom bug * update release note * [Doc]Release fastdeploy-xpu 2.0.3 (#3408) * fix v1 schedule oom bug * fix v1 schedule oom bug * update release note * update info --------- Co-authored-by: YUNSHEN XIE <[email protected]> Co-authored-by: ming1753 <[email protected]> Co-authored-by: JYChen <[email protected]> Co-authored-by: YuBaoku <[email protected]> Co-authored-by: Jiang-Jia-Jun <[email protected]> Co-authored-by: Jiang-Jia-Jun <[email protected]> Co-authored-by: xiaolei373 <[email protected]> Co-authored-by: ltd0924 <[email protected]> Co-authored-by: yinwei <[email protected]> Co-authored-by: memoryCoderC <[email protected]>
1 parent e113319 commit 132a8ef

30 files changed

+132
-1068
lines changed

docs/get_started/installation/kunlunxin_xpu.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
- OS: Linux
66
- Python: 3.10
77
- XPU Model: P800
8-
- XPU Driver Version: ≥ 5.0.21.10
8+
- XPU Driver Version: ≥ 5.0.21.26
99
- XPU Firmware Version: ≥ 1.31
1010

1111
Verified platform:
@@ -15,7 +15,7 @@ Verified platform:
1515
- OS: CentOS release 7.6 (Final)
1616
- Python: 3.10
1717
- XPU Model: P800 (OAM Edition)
18-
- XPU Driver Version: 5.0.21.10
18+
- XPU Driver Version: 5.0.21.26
1919
- XPU Firmware Version: 1.31
2020

2121
**Note:** Currently, only INTEL or Hygon CPU-based P800 (OAM Edition) servers have been verified. Other CPU types and P800 (PCIe Edition) servers have not been tested yet.
@@ -25,9 +25,9 @@ Verified platform:
2525
```bash
2626
mkdir Work
2727
cd Work
28-
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.3
28+
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.1.0
2929
docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
30-
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.3 \
30+
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.1.0 \
3131
/bin/bash
3232
docker exec -it fastdeploy-xpu /bin/bash
3333
```
@@ -37,7 +37,7 @@ docker exec -it fastdeploy-xpu /bin/bash
3737
### Install PaddlePaddle
3838

3939
```bash
40-
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
40+
python -m pip install paddlepaddle-xpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
4141
```
4242

4343
Alternatively, you can install the latest version of PaddlePaddle (Not recommended)
@@ -49,7 +49,7 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/
4949
### Install FastDeploy (**Do NOT install via PyPI source**)
5050

5151
```bash
52-
python -m pip install fastdeploy-xpu==2.0.3 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
52+
python -m pip install fastdeploy-xpu==2.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
5353
```
5454

5555
Alternatively, you can install the latest version of FastDeploy (Not recommended)
@@ -63,7 +63,7 @@ python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/pa
6363
### Install PaddlePaddle
6464

6565
```bash
66-
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
66+
python -m pip install paddlepaddle-xpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
6767
```
6868

6969
Alternatively, you can install the latest version of PaddlePaddle (Not recommended)

docs/usage/kunlunxin_xpu_deployment.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,14 @@
55
|ERNIE-4.5-300B-A47B|32K|WINT4|4 (recommend)|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
66
|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
77
|ERNIE-4.5-300B-A47B|128K|WINT4|8 (recommend)|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
8+
|ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
9+
|ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
10+
|ERNIE-4.5-21B-A3B|32K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
11+
|ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
12+
|ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
13+
|ERNIE-4.5-21B-A3B|128K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
814
|ERNIE-4.5-0.3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
9-
|ERNIE-4.5-0.3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="x" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
15+
|ERNIE-4.5-0.3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
1016
|ERNIE-4.5-0.3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
1117
|ERNIE-4.5-0.3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
1218

docs/zh/get_started/installation/kunlunxin_xpu.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
- OS:Linux
66
- Python:3.10
77
- XPU 型号:P800
8-
- XPU 驱动版本:≥ 5.0.21.10
8+
- XPU 驱动版本:≥ 5.0.21.26
99
- XPU 固件版本:≥ 1.31
1010

1111
已验证的平台:
@@ -15,7 +15,7 @@
1515
- OS:CentOS release 7.6 (Final)
1616
- Python:3.10
1717
- XPU 型号:P800(OAM 版)
18-
- XPU 驱动版本:5.0.21.10
18+
- XPU 驱动版本:5.0.21.26
1919
- XPU 固件版本:1.31
2020

2121
**注:** 目前只验证过 INTEL 或海光 CPU OAM 版 P800 服务器,暂未验证其它 CPU 和 PCIe 版 P800 服务器。
@@ -25,9 +25,9 @@
2525
```bash
2626
mkdir Work
2727
cd Work
28-
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.3
28+
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.1.0
2929
docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \
30-
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.3 \
30+
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.1.0 \
3131
/bin/bash
3232
docker exec -it fastdeploy-xpu /bin/bash
3333
```
@@ -37,7 +37,7 @@ docker exec -it fastdeploy-xpu /bin/bash
3737
### 安装 PaddlePaddle
3838

3939
```bash
40-
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
40+
python -m pip install paddlepaddle-xpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
4141
```
4242

4343
或者您也可以安装最新版 PaddlePaddle(不推荐)
@@ -49,7 +49,7 @@ python -m pip install --pre paddlepaddle-xpu -i https://www.paddlepaddle.org.cn/
4949
### 安装 FastDeploy(**注意不要通过 pypi 源安装**
5050

5151
```bash
52-
python -m pip install fastdeploy-xpu==2.0.3 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
52+
python -m pip install fastdeploy-xpu==2.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/fastdeploy-xpu-p800/ --extra-index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
5353
```
5454

5555
或者你也可以安装最新版 FastDeploy(不推荐)
@@ -63,7 +63,7 @@ python -m pip install --pre fastdeploy-xpu -i https://www.paddlepaddle.org.cn/pa
6363
### 安装 PaddlePaddle
6464

6565
```bash
66-
python -m pip install paddlepaddle-xpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
66+
python -m pip install paddlepaddle-xpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/xpu-p800/
6767
```
6868

6969
或者您也可以安装最新版 PaddlePaddle(不推荐)

docs/zh/usage/kunlunxin_xpu_deployment.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,12 @@
55
|ERNIE-4.5-300B-A47B|32K|WINT4|4 (推荐)|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
66
|ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
77
|ERNIE-4.5-300B-A47B|128K|WINT4|8 (推荐)|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.0.0|
8+
|ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
9+
|ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
10+
|ERNIE-4.5-21B-A3B|32K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
11+
|ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
12+
|ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
13+
|ERNIE-4.5-21B-A3B|128K|WINT4|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|>=2.1.0|
814
|ERNIE-4.5-0.3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
915
|ERNIE-4.5-0.3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="x" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|>=2.0.3|
1016
|ERNIE-4.5-0.3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|>=2.0.3|

fastdeploy/engine/args_utils.py

Lines changed: 6 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@
2020
from dataclasses import fields as dataclass_fields
2121
from typing import Any, Dict, List, Optional
2222

23+
import paddle
24+
2325
from fastdeploy.config import (
2426
CacheConfig,
2527
EarlyStopConfig,
@@ -93,14 +95,6 @@ class EngineArgs:
9395
"""
9496
specifies the reasoning parser to use for extracting reasoning content from the model output
9597
"""
96-
tool_call_parser: str = None
97-
"""
98-
specifies the tool call parser to use for extracting tool call from the model output
99-
"""
100-
tool_parser_plugin: str = None
101-
"""
102-
tool parser plugin used to register user defined tool parsers
103-
"""
10498
enable_mm: bool = False
10599
"""
106100
Flags to enable multi-modal model
@@ -429,18 +423,6 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
429423
help="Flag specifies the reasoning parser to use for extracting "
430424
"reasoning content from the model output",
431425
)
432-
model_group.add_argument(
433-
"--tool-call-parser",
434-
type=str,
435-
default=EngineArgs.tool_call_parser,
436-
help="Flag specifies the tool call parser to use for extracting" "tool call from the model output",
437-
)
438-
model_group.add_argument(
439-
"--tool-parser-plugin",
440-
type=str,
441-
default=EngineArgs.tool_parser_plugin,
442-
help="tool parser plugin used to register user defined tool parsers",
443-
)
444426
model_group.add_argument(
445427
"--speculative-config",
446428
type=json.loads,
@@ -889,7 +871,10 @@ def create_engine_config(self) -> Config:
889871
if not int(os.getenv("ENABLE_V1_KVCACHE_SCHEDULER", "0")):
890872
self.max_num_batched_tokens = self.max_model_len
891873
else:
892-
self.max_num_batched_tokens = 8192 # if set to max_model_len, it's easy to be OOM
874+
if paddle.is_compiled_with_xpu():
875+
self.max_num_batched_tokens = self.max_model_len
876+
else:
877+
self.max_num_batched_tokens = 8192
893878

894879
all_dict = asdict(self)
895880
all_dict["model_cfg"] = model_cfg
@@ -928,7 +913,6 @@ def create_engine_config(self) -> Config:
928913
mm_processor_kwargs=self.mm_processor_kwargs,
929914
enable_mm=self.enable_mm,
930915
reasoning_parser=self.reasoning_parser,
931-
tool_parser=self.tool_call_parser,
932916
splitwise_role=self.splitwise_role,
933917
innode_prefill_ports=self.innode_prefill_ports,
934918
max_num_partial_prefills=self.max_num_partial_prefills,

fastdeploy/engine/config.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,6 @@ def __init__(
8585
max_long_partial_prefills: int = 1,
8686
long_prefill_token_threshold: int = 0,
8787
reasoning_parser: str = None,
88-
tool_parser: str = None,
8988
guided_decoding_backend: Optional[str] = None,
9089
disable_any_whitespace: bool = False,
9190
enable_logprob: bool = False,
@@ -166,7 +165,6 @@ def __init__(
166165
self.max_long_partial_prefills = max_long_partial_prefills
167166
self.long_prefill_token_threshold = long_prefill_token_threshold
168167
self.reasoning_parser = reasoning_parser
169-
self.tool_parser = tool_parser
170168
self.graph_optimization_config = graph_optimization_config
171169
self.early_stop_config = early_stop_config
172170
self.guided_decoding_backend = guided_decoding_backend
@@ -241,7 +239,10 @@ def postprocess(self):
241239
if not int(os.getenv("ENABLE_V1_KVCACHE_SCHEDULER", "0")):
242240
self.max_num_batched_tokens = self.max_model_len
243241
else:
244-
self.max_num_batched_tokens = 8192 # if set to max_model_len, it's easy to be OOM
242+
if paddle.is_compiled_with_xpu():
243+
self.max_num_batched_tokens = self.max_model_len
244+
else:
245+
self.max_num_batched_tokens = 8192
245246

246247
if self.long_prefill_token_threshold == 0:
247248
self.long_prefill_token_threshold = int(self.max_model_len * 0.04)

fastdeploy/engine/engine.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,6 @@ def __init__(self, cfg):
106106
cfg.limit_mm_per_prompt,
107107
cfg.mm_processor_kwargs,
108108
cfg.enable_mm,
109-
cfg.tool_parser,
110109
)
111110

112111
self.start_queue_service()

fastdeploy/engine/request.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
import numpy as np
2525

2626
from fastdeploy.engine.sampling_params import SamplingParams
27-
from fastdeploy.entrypoints.openai.protocol import ToolCall
2827
from fastdeploy.utils import data_processor_logger
2928
from fastdeploy.worker.output import LogprobsLists, SampleLogprobs
3029

@@ -250,7 +249,6 @@ class CompletionOutput:
250249
draft_token_ids: list[int] = None
251250
text: Optional[str] = None
252251
reasoning_content: Optional[str] = None
253-
tool_calls: Optional[ToolCall] = None
254252

255253
def to_dict(self):
256254
"""

fastdeploy/engine/sched/resource_manager_v1.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,7 @@ def schedule(self):
289289
while self.waiting and token_budget > 0:
290290
if len(self.running) == self.max_num_seqs:
291291
break
292-
if self.config.enable_mm and self.exist_prefill(scheduled_reqs):
292+
if (self.config.enable_mm or paddle.is_compiled_with_xpu()) and self.exist_prefill(scheduled_reqs):
293293
break
294294
request = self.waiting[0]
295295
if request.status == RequestStatus.WAITING:

0 commit comments

Comments
 (0)