Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,36 +1,14 @@
# Quickstart: Singel Node based on Alibaba Cloud
# Quickstart: Singel Node Deployment Guide

## Environment Preparation
1. Purchase an Alibaba Cloud Server
- For a single-machine setup, consider a GPU instance with **NVIDIA V100**.
- **Recommendation:** When purchasing a GPU instance via the ECS console, it's advised to select the option to automatically install GPU drivers.
2. Remote Connect to the GPU Instance and access the machine terminal
3. Install NVIDIA Container Toolkit
1. Purchase a machine and install GPU drivers simultaneously
2. Connect remotely to the GPU instance and access the machine terminal
3. Install Docker environment and NVIDIA Container Toolkit
```shell
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

sudo yum install -y nvidia-container-toolkit
```
4. Install Docker Environment:refer to https://developer.aliyun.com/mirror/docker-ce/
```shell
# step 1: install necessary system tools
sudo yum install -y yum-utils

# Step 2: add software repository information
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

# Step 3: install Docker
sudo yum install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Step 4: start Docker service
sudo service docker start

# Verify that GPUs are visible
docker version
curl -fsSL https://your-domain.com/install_docker_aliyun.sh | sudo bash
https://github.com/alibaba/ROLL/blob/main/examples/quick_start/install_docker_nvidia_container_toolkit.sh
```


## Environment Configuration
```shell
# 1. Pull Docker image
Expand All @@ -42,13 +20,13 @@ sudo docker pull <image_address>
# torch2.5.1 + SGlang0.4.3: roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch251-sglang043
# torch2.5.1 + vLLM0.7.3: roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch251-vllm073
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下文以bbbbb为例,aaaaa


# 2. Start a Docker container with GPU support and keep the container running
# 2. Start a Docker container with GPU support, expose the port, and keep the container running
sudo docker images
sudo docker run -dit \
--gpus all \
--network=host \
-p 9001:22 \
--ipc=host \
--shm-size=2gb \
--shm-size=10gb \
<image_id> \
/bin/bash

Expand All @@ -65,7 +43,7 @@ nvidia-smi
apt update && apt install git -y
git clone https://github.com/alibaba/ROLL.git

# If Github is not accessible, download the zip file and unzip it
# If Github is not accessible, download the zip file directly and unzip
wget https://github.com/alibaba/ROLL/archive/refs/heads/main.zip
unzip main.zip

Expand All @@ -76,20 +54,18 @@ pip install -r requirements_torch260_sglang.txt -i https://mirrors.aliyun.com/py

## Pipeline Execution
```shell
# If you encounter "ModuleNotFoundError: No module named 'roll'", you need to add environment variables
export PYTHONPATH="/workspace/ROLL-main:$PYTHONPATH"
bash examples/quick_start/run_agentic_pipeline_frozen_lake_single_node_demo.sh
```

Example Log Screenshots during Pipeline Execution:
![log1](../../../static/img/log_1.png)

# Method 1: Specify the YAML file path, with the script directory (examples) as the root
python examples/start_agentic_pipeline.py --config_path qwen2.5-0.5B-agentic_ds --config_name agent_val_frozen_lake
![log2](../../../static/img/log_2.png)

# Method 2: Execute the shell script directly
bash examples/qwen2.5-0.5B-agentic_ds/run_agentic_pipeline_frozen_lake.sh
![log3](../../../static/img/log_3.png)

# Modify the configuration as needed
vim examples/qwen2.5-0.5B-agentic_ds/agent_val_frozen_lake.yaml
```

Key Configuration Modifications for Single V100 GPU Memory:
## Reference: V100 Single-GPU Memory Configuration Optimization
```yaml
# Reduce the system's expected number of GPUs from 8 to your actual 1 V100
num_gpus_per_node: 1
Expand Down Expand Up @@ -135,11 +111,3 @@ val_env_manager.tags: [SimpleSokoban, FrozenLake]
# Reduce the total number of training steps for quicker full pipeline runs, useful for rapid debugging.
max_steps: 100
```

Example Log Screenshots during Pipeline Execution:
![log1](../../../static/img/log_1.png)

![log2](../../../static/img/log_2.png)

![log3](../../../static/img/log_3.png)

Original file line number Diff line number Diff line change
@@ -1,38 +1,14 @@
# 快速上手:阿里云单机版部署指南
# 快速上手:单机版部署指南

## 准备环境
1. 购买阿里云服务器
- 单机版本 可选择 GPU:NVIDIA V100
- 建议您通过ECS控制台购买GPU实例时,同步选中安装GPU驱动
1. 购买机器,并同步安装GPU驱动
2. 远程连接GPU实例,进入机器终端
3. 安装 NVIDIA容器工具包
3. 安装 Docker环境 和 NVIDIA容器工具包
```shell
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

# 安装 NVIDIA Container Toolkit 软件包
sudo yum install -y nvidia-container-toolkit

```
4. 安装 Docker 环境:参考 https://developer.aliyun.com/mirror/docker-ce/
```shell
# step 1: 安装必要的一些系统工具
sudo yum install -y yum-utils

# Step 2: 添加软件源信息
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

# Step 3: 安装Docker
sudo yum install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Step 4: 开启Docker服务
sudo service docker start

# 安装校验
docker version
curl -fsSL https://your-domain.com/install_docker_aliyun.sh | sudo bash
https://github.com/alibaba/ROLL/blob/main/examples/quick_start/install_docker_nvidia_container_toolkit.sh
```


## 环境配置
```shell
# 1. 拉取docker镜像
Expand All @@ -44,13 +20,13 @@ sudo docker pull <image_address>
# torch2.5.1 + SGlang0.4.3: roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch251-sglang043
# torch2.5.1 + vLLM0.7.3: roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch251-vllm073

# 2. 启动一个docker容器,指定GPU支持,并始终保持容器运行
# 2. 启动一个docker容器,指定GPU支持,暴露容器端口,并始终保持容器运行
sudo docker images
sudo docker run -dit \
--gpus all \
--network=host \
-p 9001:22 \
--ipc=host \
--shm-size=2gb \
--shm-size=10gb \
<image_id> \
/bin/bash

Expand Down Expand Up @@ -78,20 +54,18 @@ pip install -r requirements_torch260_sglang.txt -i https://mirrors.aliyun.com/py

## pipeline运行
```shell
# 若执行报错 ModuleNotFoundError: No module named 'roll',需要添加环境变量
export PYTHONPATH="/workspace/ROLL-main:$PYTHONPATH"
bash examples/quick_start/run_agentic_pipeline_frozen_lake_single_node_demo.sh
```

pipeline运行中的log截图示例:
![log1](../../../static/img/log_1.png)

# 方法一:指定yaml文件路径,需要以脚本目录即examples为根目录
python examples/start_agentic_pipeline.py --config_path qwen2.5-0.5B-agentic_ds --config_name agent_val_frozen_lake
![log2](../../../static/img/log_2.png)

# 方法二:直接执行sh脚本
bash examples/qwen2.5-0.5B-agentic_ds/run_agentic_pipeline_frozen_lake.sh
![log3](../../../static/img/log_3.png)

# 根据需要修改config
vim examples/qwen2.5-0.5B-agentic_ds/agent_val_frozen_lake.yaml
```

单卡V100显存config修改要点:
## 参考:单卡V100显存 config修改要点
```yaml
# 将系统预期的GPU数量从8块减少到你实际拥有的1块V100
num_gpus_per_node: 1
Expand Down Expand Up @@ -138,10 +112,5 @@ val_env_manager.tags: [SimpleSokoban, FrozenLake]
max_steps: 100
```

pipeline运行中的log截图示例:
![log1](../../../static/img/log_1.png)

![log2](../../../static/img/log_2.png)

![log3](../../../static/img/log_3.png)

Loading