Skip to content

Commit 8cfe955

Browse files
Merge branch 'tokasaurus_blog' of github.com:ScalingIntelligence/scalingintelligence.github.io into tokasaurus_blog
2 parents a91cd68 + 0822936 commit 8cfe955

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

_blogs/tokasaurus.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Tokasaurus can also efficiently serve bigger models across multiple GPUs! Here,
8787

8888
### Pipeline Parallelism for the GPU Poor
8989

90-
One of our original goals with Tokasaurus was to efficiently run multi-GPU inference on our lab’s L40S GPUs, which don’t have fast inter-GPU NVLink connections. Without NVLink, the communication costs incurred running TP across a node of eight GPUs are substantial. Therefore, efficient support for PP (which requires much less inter-GPU communication) was a high priority. PP needs a large batch in order to run efficiently, since batches from the manager are subdivided into microbatches that are spread out across pipeline stages. When optimizing for throughput, we’re generally already using the largest batch size that fits in GPU memory, so PP is often a natural fit for throughput-focused workloads. When benchmarking against vLLM’s pipeline implementation using Llama-3.1-70B on eight L40S GPUs, Tokasaurus improves throughput by over 3x:
90+
One of our original goals with Tokasaurus was to efficiently run multi-GPU inference on our lab’s L40S GPUs, which don’t have fast inter-GPU NVLink connections. Without NVLink, the communication costs incurred running TP across a node of eight GPUs are substantial. Therefore, efficient support for PP (which requires much less inter-GPU communication) was a high priority. PP needs a large batch in order to run efficiently, since batches from the manager are subdivided into microbatches that are spread out across pipeline stages. When optimizing for throughput, we’re generally already using the largest batch size that fits in GPU memory, so PP is often a natural fit for throughput-focused workloads. When benchmarking against vLLM’s and SGLang’s pipeline parallel implementations using Llama-3.1-70B on eight L40S GPUs, Tokasaurus improves throughput by over 3x:
9191

9292
<div style="display: flex; gap: 16px; align-items: center;">
9393
<img src="/imgs/blog/tokasaurus/pipeline.png" alt="Tokasaurus small models" style="max-width: 98%; height: auto; display: block;">
@@ -119,7 +119,7 @@ Tokasaurus is written in pure Python (although we do use attention and sampling
119119

120120
The commands for reproducing our benchmarks are available [here](https://github.com/ScalingIntelligence/tokasaurus/blob/main/logs/blog_commands.md). For each benchmark, we configure all engines with the same KV cache size and maximum number of running requests. We’ve made a best effort to tune each engine’s remaining parameters. We report the average throughput across runs after completing a warmup run. For each benchmark, all engines are run on the same machine.
121121

122-
We use this script from [SGLang](https://github.com/sgl-project/sglang/blob/7e257cd666c0d639626487987ea8e590da1e9395/python/sglang/bench_serving.py) for our ShareGPT benchmarks and [this custom script](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/monkeys_gsm8k.py) for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. `LLM.generate()`) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase (thanks to the vLLM team for the tip!).
122+
We use [this script](https://github.com/sgl-project/sglang/blob/7e257cd666c0d639626487987ea8e590da1e9395/python/sglang/bench_serving.py) from SGLang for our ShareGPT benchmarks and [this custom script](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/monkeys_gsm8k.py) for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. `LLM.generate()`) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase (thanks to the vLLM team for the tip!).
123123

124124
## Acknowledgements
125125

0 commit comments

Comments
 (0)