You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: You might encounter some issue in the current model convert script on AMD GPUs. You can go [here](https://huggingface.co/zyzshishui0627/models) to dowload the converted models.
80
+
Note: We implemented a dedicated AMD conversion script that forces a CPU-only conversion workflow using the Gloo backend to bypass hardware-specific issues. A GPU-based script for ROCm is currently in development.
80
81
81
82
⚠️ If you encounter an issue where slime cannot be found, please run `pip install -e .` in the slime directory.
82
83
83
84
84
85
### Example: Qwen3-4B
85
86
86
87
We provide examples to use [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), please refer to:
87
-
-[Example: Qwen3-4B Model](scripts/run-qwen3-4B-amd.sh): Just run `scripts/run-qwen3-4B-amd.sh`
88
+
-[Example: Qwen3-4B Model](../../../scripts/run-qwen3-4B-amd.sh): Just run `scripts/run-qwen3-4B-amd.sh`
88
89
89
-
⚠️ TODO: The [ROCm-version torch_memory_saver](https://github.com/yushengsu-thu/torch_memory_saver.git) does not seem to clear memory properly; thus, we set `--sglang-mem-fraction-static` as `0.4` currently. We will continue investigating and focus on ROCm's virtual memory management for further modifications.
90
-
91
-
⚠️ TODO: ROCM seems to not support `apex` yet. Thus, we need to disable `--no-gradient-accumulation-fusion` currently. We will continue investigating how to enable this.
90
+
⚠️ TODO: ROCM seems to not support `apex` yet. Thus, we need to disable gradient accumulation fusionby adding the `--no-gradient-accumulation-fusion` flag in the training script currently. We will continue investigating how to enable this.
92
91
93
92
⚠️ Note: The main difference between ROCm's training script and NVIDIA's script is that you need to set `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES` and `HIP_VISIBLE_DEVICES` for ray to function properly on AMD GPUs.
94
93
95
-
- We show the training script below:
94
+
- We show the training script below:
96
95
97
96
```bash
98
97
#!/bin/bash
99
98
100
-
####clear before training
99
+
# for rerun the task
101
100
pkill -9 sglang
102
101
sleep 3
103
102
ray stop --force
@@ -107,34 +106,35 @@ sleep 3
107
106
pkill -9 ray
108
107
pkill -9 python
109
108
109
+
110
110
set -euxo pipefail
111
111
112
-
### ROCm Support ###
113
-
SLIME_DIR="/home/yushensu/projects/slime"# Need to change to your own path
114
-
export SLIME_DIR=$SLIME_DIR
115
112
116
-
MODEL_DIR="/home/yushensu/projects/model"# Need to change to your own path
117
-
export MODEL_DIR=$MODEL_DIR
113
+
### AMD Support ###
114
+
SLIME_DIR="${SLIME_DIR:-/home/yushensu/projects/slime}"# Default path if not set in environment
115
+
export SLIME_DIR
116
+
117
+
MODEL_DIR="${MODEL_DIR:-/home/yushensu/projects/model}"# Default path if not set in environment
118
+
export MODEL_DIR
118
119
119
-
DATA_DIR="/home/yushensu/projects/data"#Need to change to your own path
120
-
export DATA_DIR=$DATA_DIR
120
+
DATA_DIR="${DATA_DIR:-/home/yushensu/projects/data}"#Default path if not set in environment
121
+
export DATA_DIR
121
122
122
123
# For AMD GPU
123
124
export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=${RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES:-"1"}# Must set to 1
124
125
export HIP_VISIBLE_DEVICES=${HIP_VISIBLE_DEVICES:-"0,1,2,3,4,5,6,7"}#You can choose which gpus to use
0 commit comments