Skip to content

Commit 44212d7

Browse files
authored
Update README.md (ROCm#309)
* Update README.md Updates to model name and BKC version * Update README.md Fixed spelling error. Added llama 3.3 support under What is New section
1 parent 8663822 commit 44212d7

File tree

1 file changed

+21
-23
lines changed

1 file changed

+21
-23
lines changed

docs/dev-docker/README.md

Lines changed: 21 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The performance data below was measured on a server with MI300X accelerators wit
2222

2323
| System | MI300X with 8 GPUs |
2424
|---|---|
25-
| BKC | 24.11 |
25+
| BKC | 24.13 |
2626
| ROCm | version ROCm 6.2.2 |
2727
| amdgpu | build 2009461 |
2828
| OS | Ubuntu 22.04 |
@@ -41,12 +41,13 @@ The performance data below was measured on a server with MI300X accelerators wit
4141

4242
## Pull latest
4343

44-
You can pull the image with `docker pull rocm/vllm-dev:20241114-tuned`
44+
You can pull the image with `docker pull rocm/vllm-dev:main`
4545

4646
### What is New
4747

4848
- MoE optimizations for Mixtral 8x22B, FP16
4949
- Llama 3.2 stability improvements
50+
- Llama 3.3 support
5051

5152

5253
Gemms are tuned using PyTorch's Tunable Ops feature (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md)
@@ -58,9 +59,9 @@ The gemms are automatically enabled in the docker image, and all stored gemm co
5859

5960
To make it easier to run fp8 Llama 3.1 models on MI300X, the quantized checkpoints are available on AMD Huggingface space as follows
6061

61-
- https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-FP8-KV
62-
- https://huggingface.co/amd/Meta-Llama-3.1-70B-Instruct-FP8-KV
63-
- https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-FP8-KV
62+
- https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV
63+
- https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV
64+
- https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV
6465
- https://huggingface.co/amd/grok-1-FP8-KV
6566

6667
Currently these models are private. Please join https://huggingface.co/amd to access.
@@ -72,7 +73,7 @@ These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For
7273
### Quantize your own models
7374
This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example.
7475

75-
Download the Model View the Meta-Llama-3.1-405B model at https://huggingface.co/meta-llama/Meta-Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access.
76+
Download the Model View the Llama-3.1-405B model at https://huggingface.co/meta-llama/Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access.
7677

7778
If you do not already have a HuggingFace token, open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.
7879

@@ -92,18 +93,18 @@ Create the directory for Llama 3.1 models (if it doesn't already exist)
9293

9394
Download the model
9495

95-
huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct --exclude "original/*" --local-dir /data/llama-3.1/Meta-Llama-3.1-405B-Instruct
96+
huggingface-cli download meta-llama/Llama-3.1-405B-Instruct --exclude "original/*" --local-dir /data/llama-3.1/Llama-3.1-405B-Instruct
9697

97-
Similarly, you can download Meta-Llama-3.1-70B and Meta-Llama-3.1-8B.
98+
Similarly, you can download Llama-3.1-70B and Llama-3.1-8B.
9899

99100
[Download and install Quark](https://quark.docs.amd.com/latest/install.html)
100101

101102
Run the quantization script in the example folder using the following command line:
102-
export MODEL_DIR = [local model checkpoint folder] or meta-llama/Meta-Llama-3.1-405B-Instruct
103+
export MODEL_DIR = [local model checkpoint folder] or meta-llama/Llama-3.1-405B-Instruct
103104
#### single GPU
104105
python3 quantize_quark.py \
105106
--model_dir $MODEL_DIR \
106-
--output_dir Meta-Llama-3.1-405B-Instruct-FP8-KV \
107+
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
107108
--quant_scheme w_fp8_a_fp8 \
108109
--kv_cache_dtype fp8 \
109110
--num_calib_data 128 \
@@ -113,7 +114,7 @@ export MODEL_DIR = [local model checkpoint folder] or meta-llama/Meta-Llama-3.1-
113114
#### If model size is too large for single GPU, please use multi GPU instead.
114115
python3 quantize_quark.py \
115116
--model_dir $MODEL_DIR \
116-
--output_dir Meta-Llama-3.1-405B-Instruct-FP8-KV \
117+
--output_dir Llama-3.1-405B-Instruct-FP8-KV \
117118
--quant_scheme w_fp8_a_fp8 \
118119
--kv_cache_dtype fp8 \
119120
--num_calib_data 128 \
@@ -131,7 +132,7 @@ Download and launch the docker,
131132
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
132133
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
133134
-v /data/llama-3.1:/data/llm \
134-
docker pull rocm/vllm-dev:20241114-tuned
135+
rocm/vllm-dev:main
135136

136137
### Benchmark with AMD vLLM Docker
137138

@@ -176,7 +177,7 @@ Below is a list of options which are useful:
176177
- **--max-seq-len-to-capture** : Maximum sequence length for which Hip-graphs are captured and utilized. It's recommended to use Hip-graphs for the best decode performance. The default value of this parameter is 8K, which is lower than the large context lengths supported by recent models such as LLama. Set this parameter to max-model-len or maximum context length supported by the model for best performance.
177178
- **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9. It's recommended to set this to 0.99 to increase KV cache space.
178179

179-
Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments. However, vLLM's benchmark_latency and benchmark_throughput command lines may not include all of these flags as command line arguments. In that case, it might be necessary to add these parameters to the LLMEngine instance constructor inside the benchmark script.
180+
Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments.
180181

181182
##### Online Gemm Tuning
182183
Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
@@ -268,16 +269,18 @@ If you want to run Meta-Llama-3.1-405B FP16, please run
268269
python /app/vllm/benchmarks/benchmark_throughput.py \
269270
--model /data/llm/Meta-Llama-3.1-405B-Instruct \
270271
--dtype float16 \
271-
--gpu-memory-utilization 0.99 \
272+
--gpu-memory-utilization 0.9 \
272273
--num-prompts 2000 \
273274
--distributed-executor-backend mp \
274275
--num-scheduler-steps 10 \
275276
--tensor-parallel-size 8 \
276277
--input-len 128 \
277278
--output-len 128 \
278-
--swapspace 16 \
279-
--max-model-length 8192 \
279+
--swap-space 16 \
280+
--max-model-len 8192 \
280281
--max-num-batched-tokens 65536 \
282+
--swap-space
283+
--max-model-len
281284
--gpu-memory-utilization 0.99
282285

283286
For fp8 quantized Llama3.18B/70B models:
@@ -410,19 +413,14 @@ Please refer to the MLPerf instructions for recreating the MLPerf numbers.
410413

411414
Updated:
412415

413-
vLLM: https://github.com/ROCm/vllm/commit/5362727ec366c1542b2be7a520e7c44e5cc3ce30
416+
vLLM: https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2e2b
414417
### Docker Manifest
415418

416419
To reproduce the release docker:
417420

418421
```
419422
git clone https://github.com/ROCm/vllm.git
420423
cd vllm
421-
git checkout 5362727ec366c1542b2be7a520e7c44e5cc3ce30
424+
git checkout 2c60adc83981ada77a77b2adda78ef109d2e2e2b
422425
docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
423426
```
424-
425-
For details on all the dependencies, please refer to: https://github.com/ROCm/vllm/blob/5362727ec366c1542b2be7a520e7c44e5cc3ce30/Dockerfile.rocm
426-
427-
428-

0 commit comments

Comments
 (0)