Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions docs/features/data_parallel_service.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ The scheduling flow is shown below - users randomly request IP and port, obtain
```python
prompts = [
"Hello, my name is",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
"我要采访一位科幻作家,创建一个包含5个问题的列表"
]

Expand Down Expand Up @@ -83,9 +83,9 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
```

### Parameter Description
- num-servers: Number of API servers to launch
- ports: Ports for API servers
- args: Arguments for API servers
- num-servers: Number of API servers to launch
- ports: Ports for API servers
- args: Arguments for API servers

### Data Parallelism + Disaggregated Deployment
Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated-deployment)
Expand All @@ -94,9 +94,8 @@ Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated
For multi-machine deployment, ensure network cards support RDMA and all cluster nodes are interconnected.

**Note**:
* `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.

- `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
- The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.

**Prefill Instance**
```bash
Expand Down Expand Up @@ -148,4 +147,4 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode"
```
```
6 changes: 3 additions & 3 deletions docs/features/disaggregated.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,10 @@ Refer to the example code `offline_disaggregated_demo.py` in the `fastdeploy/dem

#### Prerequisite: Redis

> **⚠️ NOTE**
> **Redis requirement: version 6.2.0 or higher**
> **⚠️ NOTE**
> **Redis requirement: version 6.2.0 or higher**
> Versions below this may not support the required commands.
>
>
* Installation via `conda`

```bash
Expand Down
114 changes: 57 additions & 57 deletions docs/features/multi-node_deployment.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,71 @@
# Multi-Node Deployment

## Overview
## Overview
Multi-node deployment addresses scenarios where a single machine's GPU memory is insufficient to support deployment of large models by enabling tensor parallelism across multiple machines.

## Environment Preparation
#### Network Requirements
1. All nodes must be within the same local network
2. Ensure bidirectional connectivity between all nodes (test using `ping` and `nc -zv`)
## Environment Preparation
### Network Requirements
1. All nodes must be within the same local network
2. Ensure bidirectional connectivity between all nodes (test using `ping` and `nc -zv`)

#### Software Requirements
1. Install the same version of FastDeploy on all nodes
2. [Recommended] Install and configure MPI (OpenMPI or MPICH)
#### Software Requirements
1. Install the same version of FastDeploy on all nodes
2. [Recommended] Install and configure MPI (OpenMPI or MPICH)

## Tensor Parallel Deployment
## Tensor Parallel Deployment

### Recommended Launch Method
We recommend using mpirun for one-command startup without manually starting each node.
### Recommended Launch Method
We recommend using mpirun for one-command startup without manually starting each node.

### Usage Instructions
1. Execute the same command on all machines
2. The IP order in the `ips` parameter determines the node startup sequence
3. The first IP will be designated as the master node
4. Ensure all nodes can resolve each other's hostnames
### Usage Instructions
1. Execute the same command on all machines
2. The IP order in the `ips` parameter determines the node startup sequence
3. The first IP will be designated as the master node
4. Ensure all nodes can resolve each other's hostnames

* Online inference startup example:
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--tensor-parallel-size 16 \
--ips 192.168.1.101,192.168.1.102
```
* Online inference startup example:
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-300B-A47B-Paddle \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--max-model-len 32768 \
--max-num-seqs 32 \
--tensor-parallel-size 16 \
--ips 192.168.1.101,192.168.1.102
```

* Offline startup example:
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-Paddle"

sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=16, ips="192.168.1.101,192.168.1.102")
if llm._check_master():
output = llm.generate(prompts="Who are you?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```
* Offline startup example:
```python
from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

* Notes:
- Only the master node can receive completion requests
- Always send requests to the master node (the first IP in the ips list)
- The master node will distribute workloads across all nodes
model_name_or_path = "baidu/ERNIE-4.5-300B-A47B-Paddle"

### Parameter Description
sampling_params = SamplingParams(temperature=0.1, max_tokens=30)
llm = LLM(model=model_name_or_path, tensor_parallel_size=16, ips="192.168.1.101,192.168.1.102")
if llm._check_master():
output = llm.generate(prompts="Who are you?", use_tqdm=True, sampling_params=sampling_params)
print(output)
```

#### `ips` Parameter
- **Type**: `string`
- **Format**: Comma-separated IPv4 addresses
- **Description**: Specifies the IP addresses of all nodes in the deployment group
- **Required**: Only for multi-node deployments
- **Example**: `"192.168.1.101,192.168.1.102,192.168.1.103"`
* Notes:
* Only the master node can receive completion requests
* Always send requests to the master node (the first IP in the ips list)
* The master node will distribute workloads across all nodes

#### `tensor_parallel_size` Parameter
- **Type**: `integer`
- **Description**: Total number of GPUs across all nodes
- **Required**: Yes
- **Example**: For 2 nodes with 8 GPUs each, set to 16
### Parameter Description

#### `ips` Parameter
* **Type**: `string`
* **Format**: Comma-separated IPv4 addresses
* **Description**: Specifies the IP addresses of all nodes in the deployment group
* **Required**: Only for multi-node deployments
* **Example**: `"192.168.1.101,192.168.1.102,192.168.1.103"`

#### `tensor_parallel_size` Parameter
* **Type**: `integer`
* **Description**: Total number of GPUs across all nodes
* **Required**: Yes
* **Example**: For 2 nodes with 8 GPUs each, set to 16
16 changes: 5 additions & 11 deletions docs/zh/features/data_parallel_service.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,14 @@ FastDeploy 提供了splitwise scheduler,可以感知各个DP的负载状态,
具体调度流程如下图,用户随机请求ip 与端口,通过redis获取负载状态,将数据分发到负载较低的DP进行推理。
![数据调度架构图](./images/scheduler_img.png)


#### 离线推理
```python

prompts = [
"Hello, my name is",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
"我要采访一位科幻作家,创建一个包含5个问题的列表"
]

Expand Down Expand Up @@ -65,11 +64,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-ttl 9000
```


### 用户自行调度
FastDeploy 提供了multi_api_server,用户可以拉起多个api server,用户自行选择dp 进行请求,在该种情况下用户可以自行添加负载均衡模型进行调度。(目前该种方式只支持在线推理)


#### 在线推理

![数据调度架构图](./images/no_scheduler_img.png)
Expand All @@ -95,8 +92,6 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
- ports: 指定拉起的api server 的端口
- args: 指定拉起的api server 的参数



### 数据并行 + 分离式部署

具体可以参考[分离式部署](disaggregated.md#多机分离式部署)
Expand All @@ -106,8 +101,8 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
多机部署时需要确认当前网卡是否支持RDMA,并且需要集群中所有节点网络互通。

**注意**:
* `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡,多个网卡用逗号隔开。
* 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu` 或 `gpu`。
- `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡,多个网卡用逗号隔开。
- 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu` 或 `gpu`。

**prefill 实例**

Expand Down Expand Up @@ -163,4 +158,3 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-topic "test" \
--splitwise-role "decode"
```

4 changes: 2 additions & 2 deletions docs/zh/features/disaggregated.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
#### 前置依赖 Redis
* 使用`conda`安装

> **⚠️ 注意**
> **Redis 版本要求:6.2.0 及以上**
> **⚠️ 注意**
> **Redis 版本要求:6.2.0 及以上**
> 低于此版本可能不支持所需的命令。

```bash
Expand Down
28 changes: 13 additions & 15 deletions docs/zh/features/multi-node_deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,10 @@
多节点部署旨在解决单个机器GPU显存不足时,支持跨多台机器的张量并行执行。

## 环境准备
#### 网络要求
### 网络要求
1. 所有节点必须在同一本地网络中
2. 确保所有节点之间双向连通(可使用`ping`和`nc -zv`测试)


#### 软件要求
1. 所有节点安装相同版本的FastDeploy
2. [建议安装]安装并配置MPI(OpenMPI或MPICH)
Expand Down Expand Up @@ -52,22 +51,21 @@
```

* 注意:
- 只有主节点可以接收完成请求
- 请始终将请求发送到主节点(ips列表中的第一个IP)
- 主节点将在所有节点间分配工作负载
* 只有主节点可以接收完成请求
* 请始终将请求发送到主节点(ips列表中的第一个IP)
* 主节点将在所有节点间分配工作负载

### 参数说明

#### `ips`参数
- **类型**: `字符串`
- **格式**: 逗号分隔的IPv4地址
- **描述**: 指定部署组中所有节点的IP地址
- **必填**: 仅多节点部署时需要
- **示例**: `"192.168.1.101,192.168.1.102,192.168.1.103"`
* **类型**: `字符串`
* **格式**: 逗号分隔的IPv4地址
* **描述**: 指定部署组中所有节点的IP地址
* **必填**: 仅多节点部署时需要
* **示例**: `"192.168.1.101,192.168.1.102,192.168.1.103"`

#### `tensor_parallel_size`参数
- **类型**: `整数`
- **描述**: 所有节点上的GPU总数
- **必填**: 是
- **示例**: 对于2个节点各8个GPU,设置为16

* **类型**: `整数`
* **描述**: 所有节点上的GPU总数
* **必填**: 是
* **示例**: 对于2个节点各8个GPU,设置为16
25 changes: 18 additions & 7 deletions fastdeploy/entrypoints/engine_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import time
import traceback
import uuid
from copy import copy

import numpy as np

Expand Down Expand Up @@ -210,26 +211,36 @@ async def add_requests(self, task):

self.valid_parameters(task)
api_server_logger.debug(f"Receive task: {task}")
n = task.get("n", 1)
try:
if not self.enable_mm:
self.zmq_client.send_json(task)
request_id_idx = task.get("request_id")
parts = request_id_idx.rsplit("_", 1)
if len(parts) == 1:
self._send_task(task)
else:
self.zmq_client.send_pyobj(task)
request_id = parts[0]
index = int(parts[1])
for i in range(index * n, (index + 1) * n):
child_task = copy(task)
child_task["request_id"] = f"{request_id}_{i}"
self._send_task(child_task)
except Exception as e:
api_server_logger.error(f"zmq_client send task error: {e}, {str(traceback.format_exc())}")
raise EngineError(str(e), error_code=400)

def _send_task(self, task):
if not self.enable_mm:
self.zmq_client.send_json(task)
else:
self.zmq_client.send_pyobj(task)

def valid_parameters(self, data):
"""
Validate stream options
超参数(top_p、seed、frequency_penalty、temperature、presence_penalty)的校验逻辑
前置到了ChatCompletionRequest/CompletionRequest中
"""

if data.get("n") is not None:
if data["n"] != 1:
raise ParameterError("n", "n only support 1.")

if data.get("max_tokens") is not None:
if data["max_tokens"] < 1 or data["max_tokens"] >= self.max_model_len:
raise ParameterError("max_tokens", f"max_tokens can be defined [1, {self.max_model_len}).")
Expand Down
Loading
Loading