PaddlePaddle · kxz2002 · Sep 25, 2025 · Sep 25, 2025 · Sep 25, 2025 · Sep 25, 2025
diff --git a/docs/features/data_parallel_service.md b/docs/features/data_parallel_service.md
@@ -15,9 +15,9 @@ The scheduling flow is shown below - users randomly request IP and port, obtain
 ```python
 prompts = [
     "Hello, my name is",
-    "你好，请问今天是星期", 
-    "请写6个以数字开头的成语", 
-    "写一个300字的小说大纲，内容是李白穿越到现代，最后成为公司文职人员的故事", 
+    "你好，请问今天是星期",
+    "请写6个以数字开头的成语",
+    "写一个300字的小说大纲，内容是李白穿越到现代，最后成为公司文职人员的故事",
     "我要采访一位科幻作家，创建一个包含5个问题的列表"
 ]
 
@@ -83,9 +83,9 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
 ```
 
 ### Parameter Description
-- num-servers: Number of API servers to launch  
-- ports: Ports for API servers  
-- args: Arguments for API servers  
+- num-servers: Number of API servers to launch
+- ports: Ports for API servers
+- args: Arguments for API servers
 
 ### Data Parallelism + Disaggregated Deployment
 Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated-deployment)
@@ -94,9 +94,8 @@ Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated
 For multi-machine deployment, ensure network cards support RDMA and all cluster nodes are interconnected.
 
 **Note**:
-* `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
-* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
-
+- `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
+- The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
 
 **Prefill Instance**
 ```bash
@@ -148,4 +147,4 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --scheduler-ttl 9000
        --scheduler-topic "test" \
        --splitwise-role "decode"
-```
+```
diff --git a/docs/features/disaggregated.md b/docs/features/disaggregated.md
@@ -73,10 +73,10 @@ Refer to the example code `offline_disaggregated_demo.py` in the `fastdeploy/dem
 
 #### Prerequisite: Redis
 
-> **⚠️ NOTE**  
-> **Redis requirement: version 6.2.0 or higher**  
+> **⚠️ NOTE**
+> **Redis requirement: version 6.2.0 or higher**
 > Versions below this may not support the required commands.
-> 
+>
 * Installation via `conda`
 
 ```bash

diff --git a/docs/features/multi-node_deployment.md b/docs/features/multi-node_deployment.md
@@ -1,71 +1,71 @@
 # Multi-Node Deployment
 
-## Overview  
+## Overview
 Multi-node deployment addresses scenarios where a single machine's GPU memory is insufficient to support deployment of large models by enabling tensor parallelism across multiple machines.
 
-## Environment Preparation  
-#### Network Requirements  
-1. All nodes must be within the same local network  
-2. Ensure bidirectional connectivity between all nodes (test using `ping` and `nc -zv`)  
+## Environment Preparation
+### Network Requirements
+1. All nodes must be within the same local network
+2. Ensure bidirectional connectivity between all nodes (test using `ping` and `nc -zv`)
 
-#### Software Requirements  
-1. Install the same version of FastDeploy on all nodes  
-2. [Recommended] Install and configure MPI (OpenMPI or MPICH)  
+#### Software Requirements
+1. Install the same version of FastDeploy on all nodes
+2. [Recommended] Install and configure MPI (OpenMPI or MPICH)
 
-## Tensor Parallel Deployment  
+## Tensor Parallel Deployment
 
-### Recommended Launch Method  
-We recommend using mpirun for one-command startup without manually starting each node.  
+### Recommended Launch Method
+We recommend using mpirun for one-command startup without manually starting each node.
 
-### Usage Instructions  
-1. Execute the same command on all machines  
-2. The IP order in the `ips` parameter determines the node startup sequence  
-3. The first IP will be designated as the master node  
-4. Ensure all nodes can resolve each other's hostnames  
+### Usage Instructions
+1. Execute the same command on all machines
+2. The IP order in the `ips` parameter determines the node startup sequence
+3. The first IP will be designated as the master node
+4. Ensure all nodes can resolve each other's hostnames
 
-* Online inference startup example:  
-    ```shell  
-    python -m fastdeploy.entrypoints.openai.api_server \  
-    --model baidu/ERNIE-4.5-300B-A47B-Paddle \  
-    --port 8180 \  
-    --metrics-port 8181 \  
-    --engine-worker-queue-port 8182 \  
-    --max-model-len 32768 \  
-    --max-num-seqs 32 \  
-    --tensor-parallel-size 16 \  
-    --ips 192.168.1.101,192.168.1.102  
-    ```  
+* Online inference startup example:
+    ```shell
+    python -m fastdeploy.entrypoints.openai.api_server \
+    --model baidu/ERNIE-4.5-300B-A47B-Paddle \
+    --port 8180 \
+    --metrics-port 8181 \
+    --engine-worker-queue-port 8182 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --tensor-parallel-size 16 \
+    --ips 192.168.1.101,192.168.1.102
+    ```
 
-* Offline startup example:  
-    ```python  
-    from fastdeploy.engine.sampling_params import SamplingParams  
-    from fastdeploy.entrypoints.llm import LLM  
-
-    model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-Paddle"  
-
-    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)  
-    llm = LLM(model=model_name_or_path, tensor_parallel_size=16, ips="192.168.1.101,192.168.1.102")  
-    if llm._check_master():  
-        output = llm.generate(prompts="Who are you?", use_tqdm=True, sampling_params=sampling_params)  
-        print(output)  
-    ```  
+* Offline startup example:
+    ```python
+    from fastdeploy.engine.sampling_params import SamplingParams
+    from fastdeploy.entrypoints.llm import LLM
 
-* Notes:  
-- Only the master node can receive completion requests  
-- Always send requests to the master node (the first IP in the ips list)  
-- The master node will distribute workloads across all nodes  
+    model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-Paddle"
 
-### Parameter Description  
+    sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
+    llm = LLM(model=model_name_or_path, tensor_parallel_size=16, ips="192.168.1.101,192.168.1.102")
+    if llm._check_master():
+        output = llm.generate(prompts="Who are you?", use_tqdm=True, sampling_params=sampling_params)
+        print(output)
+    ```
 
-#### `ips` Parameter  
-- **Type**: `string`  
-- **Format**: Comma-separated IPv4 addresses  
-- **Description**: Specifies the IP addresses of all nodes in the deployment group  
-- **Required**: Only for multi-node deployments  
-- **Example**: `"192.168.1.101,192.168.1.102,192.168.1.103"`  
+* Notes:
+* Only the master node can receive completion requests
+* Always send requests to the master node (the first IP in the ips list)
+* The master node will distribute workloads across all nodes
 
-#### `tensor_parallel_size` Parameter  
-- **Type**: `integer`  
-- **Description**: Total number of GPUs across all nodes  
-- **Required**: Yes  
-- **Example**: For 2 nodes with 8 GPUs each, set to 16
+### Parameter Description
+
+#### `ips` Parameter
+* **Type**: `string`
+* **Format**: Comma-separated IPv4 addresses
+* **Description**: Specifies the IP addresses of all nodes in the deployment group
+* **Required**: Only for multi-node deployments
+* **Example**: `"192.168.1.101,192.168.1.102,192.168.1.103"`
+
+#### `tensor_parallel_size` Parameter
+* **Type**: `integer`
+* **Description**: Total number of GPUs across all nodes
+* **Required**: Yes
+* **Example**: For 2 nodes with 8 GPUs each, set to 16
diff --git a/docs/zh/features/data_parallel_service.md b/docs/zh/features/data_parallel_service.md
@@ -12,15 +12,14 @@ FastDeploy 提供了splitwise scheduler，可以感知各个DP的负载状态，
 具体调度流程如下图，用户随机请求ip 与端口，通过redis获取负载状态，将数据分发到负载较低的DP进行推理。
 ![数据调度架构图](./images/scheduler_img.png)
 
-
 #### 离线推理
 ```python
 
 prompts = [
     "Hello, my name is",
-    "你好，请问今天是星期", 
-    "请写6个以数字开头的成语", 
-    "写一个300字的小说大纲，内容是李白穿越到现代，最后成为公司文职人员的故事", 
+    "你好，请问今天是星期",
+    "请写6个以数字开头的成语",
+    "写一个300字的小说大纲，内容是李白穿越到现代，最后成为公司文职人员的故事",
     "我要采访一位科幻作家，创建一个包含5个问题的列表"
 ]
 
@@ -65,11 +64,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --scheduler-ttl 9000
 ```
 
-
 ### 用户自行调度
 FastDeploy 提供了multi_api_server，用户可以拉起多个api server，用户自行选择dp 进行请求，在该种情况下用户可以自行添加负载均衡模型进行调度。（目前该种方式只支持在线推理）
 
-
 #### 在线推理
 
 ![数据调度架构图](./images/no_scheduler_img.png)
@@ -95,8 +92,6 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
 - ports: 指定拉起的api server 的端口
 - args: 指定拉起的api server 的参数
 
-
-
 ### 数据并行 + 分离式部署
 
 具体可以参考[分离式部署](disaggregated.md#多机分离式部署)
@@ -106,8 +101,8 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
 多机部署时需要确认当前网卡是否支持RDMA，并且需要集群中所有节点网络互通。
 
 **注意**：
-* `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡，多个网卡用逗号隔开。
-* 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu` 或 `gpu`。
+- `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡，多个网卡用逗号隔开。
+- 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu` 或 `gpu`。
 
 **prefill 实例**
 
@@ -163,4 +158,3 @@ python -m fastdeploy.entrypoints.openai.api_server \
        --scheduler-topic "test" \
        --splitwise-role "decode"
 ```
-
diff --git a/docs/zh/features/disaggregated.md b/docs/zh/features/disaggregated.md
@@ -75,8 +75,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
 #### 前置依赖 Redis
 * 使用`conda`安装
 
-> **⚠️ 注意**  
-> **Redis 版本要求：6.2.0 及以上**  
+> **⚠️ 注意**
+> **Redis 版本要求：6.2.0 及以上**
 > 低于此版本可能不支持所需的命令。
 
 ```bash

diff --git a/docs/zh/features/multi-node_deployment.md b/docs/zh/features/multi-node_deployment.md
@@ -4,11 +4,10 @@
 多节点部署旨在解决单个机器GPU显存不足时，支持跨多台机器的张量并行执行。
 
 ## 环境准备
-#### 网络要求
+### 网络要求
 1. 所有节点必须在同一本地网络中
 2. 确保所有节点之间双向连通（可使用`ping`和`nc -zv`测试）
 
-
 #### 软件要求
 1. 所有节点安装相同版本的FastDeploy
 2. [建议安装]安装并配置MPI（OpenMPI或MPICH）
@@ -52,22 +51,21 @@
     ```
 
 * 注意：
-- 只有主节点可以接收完成请求
-- 请始终将请求发送到主节点（ips列表中的第一个IP）
-- 主节点将在所有节点间分配工作负载
+* 只有主节点可以接收完成请求
+* 请始终将请求发送到主节点（ips列表中的第一个IP）
+* 主节点将在所有节点间分配工作负载
 
 ### 参数说明
 
 #### `ips`参数
-- **类型**: `字符串`
-- **格式**: 逗号分隔的IPv4地址
-- **描述**: 指定部署组中所有节点的IP地址
-- **必填**: 仅多节点部署时需要
-- **示例**: `"192.168.1.101,192.168.1.102,192.168.1.103"`
+* **类型**: `字符串`
+* **格式**: 逗号分隔的IPv4地址
+* **描述**: 指定部署组中所有节点的IP地址
+* **必填**: 仅多节点部署时需要
+* **示例**: `"192.168.1.101,192.168.1.102,192.168.1.103"`
 
 #### `tensor_parallel_size`参数
-- **类型**: `整数`
-- **描述**: 所有节点上的GPU总数
-- **必填**: 是
-- **示例**: 对于2个节点各8个GPU，设置为16
-
+* **类型**: `整数`
+* **描述**: 所有节点上的GPU总数
+* **必填**: 是
+* **示例**: 对于2个节点各8个GPU，设置为16
diff --git a/fastdeploy/entrypoints/engine_client.py b/fastdeploy/entrypoints/engine_client.py
@@ -19,6 +19,7 @@
 import time
 import traceback
 import uuid
+from copy import copy
 
 import numpy as np
 
@@ -210,26 +211,36 @@ async def add_requests(self, task):
 
         self.valid_parameters(task)
         api_server_logger.debug(f"Receive task: {task}")
+        n = task.get("n", 1)
         try:
-            if not self.enable_mm:
-                self.zmq_client.send_json(task)
+            request_id_idx = task.get("request_id")
+            parts = request_id_idx.rsplit("_", 1)
+            if len(parts) == 1:
+                self._send_task(task)
             else:
-                self.zmq_client.send_pyobj(task)
+                request_id = parts[0]
+                index = int(parts[1])
+                for i in range(index * n, (index + 1) * n):
+                    child_task = copy(task)
+                    child_task["request_id"] = f"{request_id}_{i}"
+                    self._send_task(child_task)
         except Exception as e:
             api_server_logger.error(f"zmq_client send task error: {e}, {str(traceback.format_exc())}")
             raise EngineError(str(e), error_code=400)
 
+    def _send_task(self, task):
+        if not self.enable_mm:
+            self.zmq_client.send_json(task)
+        else:
+            self.zmq_client.send_pyobj(task)
+
     def valid_parameters(self, data):
         """
         Validate stream options
         超参数（top_p、seed、frequency_penalty、temperature、presence_penalty）的校验逻辑
         前置到了ChatCompletionRequest/CompletionRequest中
         """
 
-        if data.get("n") is not None:
-            if data["n"] != 1:
-                raise ParameterError("n", "n only support 1.")
-
         if data.get("max_tokens") is not None:
             if data["max_tokens"] < 1 or data["max_tokens"] >= self.max_model_len:
                 raise ParameterError("max_tokens", f"max_tokens can be defined [1, {self.max_model_len}).")