Skip to content

Commit b75a79e

Browse files
Merge branch 'meta-llama:main' into set-numpy-seed-in-finetuning
2 parents f521c93 + 4e6e7e4 commit b75a79e

File tree

15 files changed

+2026
-2
lines changed

15 files changed

+2026
-2
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1466,3 +1466,20 @@ OCRVQA
14661466
OCRVQADataCollator
14671467
ocrvqa
14681468
langchain
1469+
GiB
1470+
Terraform
1471+
gb
1472+
TPOT
1473+
ctrl
1474+
finetunes
1475+
llmcompressor
1476+
prefill
1477+
qps
1478+
terraform
1479+
tf
1480+
tmux
1481+
tpot
1482+
ttft
1483+
uv
1484+
8xL40S
1485+
xL
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Below are recipes for deploying common Llama workflows on [Crusoe's](https://crusoe.ai) high-performance, sustainable cloud. Each workflow corresponds to a subfolder with its own README and supplemental materials. Please reference the table below for hardware requirements.
2+
3+
| Workflow | Model(s) | VM type | Storage |
4+
|:----: | :----: | :----:| :----: |
5+
| [Serving Llama3.1 in FP8 with vLLM](vllm-fp8/) | [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct), [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | l40s-48gb.8x | 256 GiB Persistent Disk |
6+
7+
# Requirements
8+
First, ensure that you have a Crusoe account (you can sign up [here](https://console.crusoecloud.com/)). We will provision resources using Terraform, please ensure that your environment is configured and refer to the Crusoe [docs](https://github.com/crusoecloud/terraform-provider-crusoe?tab=readme-ov-file#getting-started) for guidance.
9+
10+
# Serving Models
11+
Some recipes in this repo require firewall rules to expose ports in order to reach the inference server. To manage firewall rules, please refer to our [networking documentation](https://docs.crusoecloud.com/networking/firewall-rules/managing-firewall-rules).
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
In this article, we will show how to benchmark FP8 models on L40S using the vLLM inference engine. At the end, you should have an understanding of how to use `llm-compressor` to create quantize existing Llama3 finetunes in higher precision to fp8, benchmark throughput and latency to compare performance, and finally serve models using `vllm`.
2+
3+
# Provisioning Resources
4+
First, navigate to this repository from your local machine. Update the corresponding variables in `locals` inside `main.tf` to match your environment (e.g. the path to your SSH key), then initialize the terraform project with `terraform init` and provision resources with `terraform apply`. Note that this will create a VM equipped with 8xL40S and a 256GB persistent disk. After the VM has been created, terraform will output the public IP address.
5+
6+
## Mount Storage
7+
`ssh` into your VM. Then, run the below commands to mount the attached disk to `/scratch`.
8+
```bash
9+
mkfs.ext4 /dev/vdb
10+
mkdir /scratch
11+
mount -t ext4 /dev/vdb /scratch
12+
cd /scratch
13+
```
14+
15+
# Install Dependencies
16+
We'll use [uv](https://github.com/astral-sh/uv) to install dependencies. First, install the tool with
17+
```bash
18+
apt-get update && apt-get install -y curl
19+
apt-get install tmux
20+
curl -LsSf https://astral.sh/uv/install.sh | sh
21+
source $HOME/.cargo/env
22+
```
23+
24+
Now, clone the recipes and navigate to this tutorial. Initialize the virtual environment and install dependencies:
25+
```bash
26+
git clone https://github.com/meta-llama/llama-recipes.git
27+
cd llama-recipes/recipes/3p_integrations/crusoe/vllm-fp8/
28+
uv add vllm setuptools
29+
```
30+
31+
# Run Benchmarks
32+
Before starting the vLLM server, we'll configure HuggingFace to save to our shared disk, specify the model tag, and set tensor parallelism to 1.
33+
```bash
34+
export HF_HOME=/scratch/
35+
export MODEL=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic
36+
export TP_SIZE=1
37+
```
38+
Now, we'll use tmux to run our server inside of a detachable session.
39+
```bash
40+
tmux new -s server
41+
uv run vllm serve $MODEL --enable-chunked-prefill --disable-log-requests --tensor-parallel-size $TP_SIZE
42+
```
43+
vLLM will download the model from HF and serve it on port 8000. Now, detach from the tmux session (`ctrl+b` then `d`) and we'll simulate a client.
44+
```bash
45+
tmux new -s client
46+
chmod +x run_benchmark.sh
47+
./run_benchmark.sh
48+
```
49+
Let's inspect the benchmark script to see what's going on.
50+
```bash
51+
TOTAL_SECONDS=120
52+
QPS_RATES=("1" "3" "5" "7" "9")
53+
54+
for QPS in ${QPS_RATES[@]}; do
55+
NUM_PROMPTS=$((TOTAL_SECONDS * QPS))
56+
echo "===== RUNNING NUM_PROMPTS = $NUM_PROMPTS QPS = $QPS ====="
57+
58+
uv run benchmarks/benchmark_serving.py \
59+
--model $MODEL \
60+
--dataset-name sonnet --sonnet-input-len 550 --sonnet-output-len 150 --dataset-path benchmarks/sonnet.txt \
61+
--num-prompts $NUM_PROMPTS --request-rate $QPS --save-result
62+
done
63+
```
64+
This is a convenience wrapper that re-runs the vLLM `benchmarks/benchmark_serving.py` with queries-per-second (QPS) gradually increasing from 1 to 9 and saves the results. After each run completes, a JSON will appear in the same directory containing inference statistics.
65+
66+
# Results
67+
We repeated the above benchmark across the fp8 and fp16 versions of both Llama3.1 8B and 70B.
68+
69+
![TPOT vs QPS](assets/tpot_vs_qps_chart.png "TPOT vs QPS")
70+
In the above chart, we compare time-per-output-token (TPOT) across different QPS volumes. For fp16 70B we run across 8 GPUs while in fp8 we only use 4 and we still maintain the same TPOT range. The 8B models are run across 1 GPU though fp8 is noticeably faster.
71+
72+
![TPOT vs QPS](assets/ttft_vs_qps_chart.png "TTFT vs QPS")
73+
Looking at our time-to-first-token (TTFT), we observe the same trends. Even though the fp8 70B is run across half as many GPUs, its TTFT is roughly the same as the fp16 version on 8.
74+
75+
# Converting Llama3 models to FP8
76+
If you wish to convert your existing finetunes to FP8, we can easily achieve this using [llmcompressor](https://github.com/vllm-project/llm-compressor).
77+
```bash
78+
uv add llmcompressor
79+
uv run convert_hf_to_fp8.py NousResearch/Hermes-3-Llama-3.1-70B
80+
```
81+
82+
To use the converted model, update `$MODEL` to your absolute path for the converted version, then rerun `uv run vllm serve $MODEL --enable-chunked-prefill --disable-log-requests --tensor-parallel-size $TP_SIZE`. Now, we have a vLLM server up with our converted finetune and can rerun our previous benchmarks to verify performance.
83+
84+
# Cleaning up
85+
To clean up the resources we've provisioned, we can simply run `terraform destroy` from within this repository on your local machine.
332 KB
Loading
288 KB
Loading

0 commit comments

Comments
 (0)