Merge branch 'tokasaurus_blog' of github.com:ScalingIntelligence/scalingintelligence.github.io into tokasaurus_blog

BradleyBrown19 · BradleyBrown19 · commit 8cfe955fb201 · 2025-06-05T13:54:08.000-07:00
diff --git a/_blogs/tokasaurus.md b/_blogs/tokasaurus.md
@@ -87,7 +87,7 @@ Tokasaurus can also efficiently serve bigger models across multiple GPUs! Here,
 
 ### Pipeline Parallelism for the GPU Poor
 
-One of our original goals with Tokasaurus was to efficiently run multi-GPU inference on our lab’s L40S GPUs, which don’t have fast inter-GPU NVLink connections. Without NVLink, the communication costs incurred running TP across a node of eight GPUs are substantial. Therefore, efficient support for PP (which requires much less inter-GPU communication) was a high priority. PP needs a large batch in order to run efficiently, since batches from the manager are subdivided into microbatches that are spread out across pipeline stages. When optimizing for throughput, we’re generally already using the largest batch size that fits in GPU memory, so PP is often a natural fit for throughput-focused workloads. When benchmarking against vLLM’s pipeline implementation using Llama-3.1-70B on eight L40S GPUs, Tokasaurus improves throughput by over 3x:
+One of our original goals with Tokasaurus was to efficiently run multi-GPU inference on our lab’s L40S GPUs, which don’t have fast inter-GPU NVLink connections. Without NVLink, the communication costs incurred running TP across a node of eight GPUs are substantial. Therefore, efficient support for PP (which requires much less inter-GPU communication) was a high priority. PP needs a large batch in order to run efficiently, since batches from the manager are subdivided into microbatches that are spread out across pipeline stages. When optimizing for throughput, we’re generally already using the largest batch size that fits in GPU memory, so PP is often a natural fit for throughput-focused workloads. When benchmarking against vLLM’s and SGLang’s pipeline parallel implementations using Llama-3.1-70B on eight L40S GPUs, Tokasaurus improves throughput by over 3x:
 
 <div style="display: flex; gap: 16px; align-items: center;">
   <img src="/imgs/blog/tokasaurus/pipeline.png" alt="Tokasaurus small models" style="max-width: 98%; height: auto; display: block;">
@@ -119,7 +119,7 @@ Tokasaurus is written in pure Python (although we do use attention and sampling
 
 The commands for reproducing our benchmarks are available [here](https://github.com/ScalingIntelligence/tokasaurus/blob/main/logs/blog_commands.md). For each benchmark, we configure all engines with the same KV cache size and maximum number of running requests. We’ve made a best effort to tune each engine’s remaining parameters. We report the average throughput across runs after completing a warmup run. For each benchmark, all engines are run on the same machine.
 
-We use this script from [SGLang](https://github.com/sgl-project/sglang/blob/7e257cd666c0d639626487987ea8e590da1e9395/python/sglang/bench_serving.py) for our ShareGPT benchmarks and [this custom script](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/monkeys_gsm8k.py) for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. `LLM.generate()`) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase (thanks to the vLLM team for the tip!).
+We use [this script](https://github.com/sgl-project/sglang/blob/7e257cd666c0d639626487987ea8e590da1e9395/python/sglang/bench_serving.py) from SGLang for our ShareGPT benchmarks and [this custom script](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/monkeys_gsm8k.py) for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. `LLM.generate()`) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase (thanks to the vLLM team for the tip!).
 
 ## Acknowledgements