This release brings several new features and improvements for vLLM TPU Inference.
Highlights
Ironwood Support All relevant dependencies have been rolled up to support Ironwood (v7x). CI/CD has been updated to reflect this change.
For further information on different build requirements for v7x compared to previous TPU generations (v6e and prior), please see the following documentation:
QuickStart
TPU Setup
P/D Disaggregated Serving over DCN Ray-based prefill/decode disaggregation with KV cache transfer over DCN.
Multi-Lora for Pytorch Models Multi-Lora support has landed for Pytorch model definitions from vLLM. JAX-native solution will be supported shortly.
Run:AI Model Streaming Run:AI model streamer is a direct Google Cloud Storage model download accelerator. It has been demonstrated to be the easiest and fastest way to pull models into GPU memory from GCS. We are now providing customers the same experience on TPUs.
What's Changed
- [Misc] Update tpu-info by @kyuyeunk in #1214
- fix: update nightly date format to YYYYMMDD by @ylangtsou in #1213
- [Kernel] Remove KV masking by performing full bkv fetches in the first 2 steps by @yaochengji in #1240
- Refactor moe codebase by @kyuyeunk in #1199
- [multihost] Add NEW_MODEL_DESIGN to additional_env_vars by @Lumosis in #1236
- [DP] Add model DP support for JAX GPT-OSS by @wenxindongwork in #1247
- Fix circle reference which cause tpu_platform failed to import by @mrjunwan-lang in #1251
- [Bug fix] Fix DP + Hybrid KV cache numerics by @wenxindongwork in #1249
- add get_kv_connector_handshake_metadata in tpu_worker by @mrjunwan-lang in #1254
- Integrate MLA v1 into DeepSeek-v3 by @gpolovets1 in #1190
- Fix bug that PP assign wrong rank in distributed TP by @mrjunwan-lang in #1256
- [Disagg] local disagg e2e test by @sixiang-google in #1237
- Fix image tests. by @QiliangCui in #1253
- Fixing a few failures in tests/test_quantization.py. by @gpolovets1 in #1258
- [RPA] Revert previous changes due to numeric issue by @kyuyeunk in #1242
- [Misc] Update torchax with fp4 support by @kyuyeunk in #1257
- Update support matrices by @boe20211 in #1232
- Update request_distribution in DP input preparation by @wenxindongwork in #1211
- Fix FP8 dtype type mismatch issue by @helloworld1 in #1235
- Add disagg test to v6e-8 queue by @sixiang-google in #1259
- Add an argument to TpuPlatform.get_attn_backend_cls to adopt interfac… by @QiliangCui in #1263
- Update README.md by @bvrockwell in #1197
- Backward compatibility for NEW_MODEL_DESIGN=True by @wenxindongwork in #1267
- Delete b/ from PR template. by @QiliangCui in #1268
- [Disagg] Refined e2e test cleanup by @sixiang-google in #1265
- Remove a branch with pl.when in fetching bkv by @rupengliu-meta in #1239
- Add a lora perf test by @vanbasten23 in #1272
- Fix moe layer from upstream change by @kyuyeunk in #1274
- [RPA] Pipeline flash attention in default kernel by @jrplatin in #1203
- First check-in to add ci/cd test on tpuv7x by @QiliangCui in #1270
- clear xla compilation cache before each disagg server launch by @sixiang-google in #1271
- Reduce image size and enhance caching by @wdhongtw in #1245
- [Kernel][FusedMoE] Fix MoE crash and hang issues by @bythew3i in #1252
- [Quantization] Add option to bypass quantized matmul kernel for W8A8-FP8 Compressed Tensors by @jrplatin in #1273
- Replacing bit_width() with itemized_bits(). by @aman2930 in #1264
- Enable All Tests on TPUv7 by @QiliangCui in #1279
- add github action for check ready label by @boe20211 in #1269
- [Bugfix][Depreacate] Update for vllm v0.13 by @kyuyeunk in #1284
- Add default 'auto' MODEL_IMPL_TYPE that resolves based on architecture by @xingliu14 in #1255
- [Misc] Fix how model dtype is being configured by @kyuyeunk in #1286
- [Bugfix][Refactor] Fix compressed tensor moe init by @kyuyeunk in #1283
- update run_in_docker script for running on local env by @ernie-chang in #1243
- Remove pip install from setup_docker_env.sh. by @QiliangCui in #1292
- Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1295
- Revert "Update multihost disagg sh to prepare integrate with buildkit… by @QiliangCui in #1297
- [Misc] Disable torchax.tensor logger warning by @kyuyeunk in #1301
- Support overriding logic for hybrid kv cache padding by @kyuyeunk in #1285
- Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1304
- Update disagg multi host script health check logic by @mrjunwan-lang in #1306
- [Misc][RPA] Update to use logger in kernel_hd64.py by @kyuyeunk in #1302
- Update libtpu version for tpuv7. by @QiliangCui in #1305
- Fix a test pipeline bug and add TODO. by @QiliangCui in #1309
- Avoid installing CUDA related stuff by @wdhongtw in #1246
- [Kernel][Misc] Remove jax.named_scope by @kyuyeunk in #1278
- Use 50 bit uuid for KV transfer key to avoid GO trunk the int in GKE by @mrjunwan-lang in #1310
- Use AttentionSelectorConfig in get_attn_backend_cls by @karan in #1313
- Add Quantized Weights Support for MoE Layers by @kyuyeunk in #1300
- Fix for vLLM's benchmarking case change. by @patemotter in #1316
- Enable Pipeline Parallelism on Jax models by @Chenyaaang in #1077
- Restrict PP size to either 1 or host size in ray by @Chenyaaang in #1318
- Fix the lora column_parallel_packed test on v7x by @vanbasten23 in #1314
- Add dummy placeholder for unsupported models in the support matrix by @boe20211 in #1291
- Fix lora layer unit tests for v7x2. by @vanbasten23 in #1319
- Use vllm models when PP is enabled by @Chenyaaang in #1321
- Fix model loader unit test by @Chenyaaang in #1324
- Integrate the E2E multi-host disagg serving into buildkite by @mrjunwan-lang in #1323
- support fp8 compressed-tensors moe by @coolkp in #1320
- [RPA] Optimize masking and sliding window by @kyuyeunk in #1325
- Fix scale sharding in ep case by @coolkp in #1326
- [Torchax] fp8 quantization skeleton by @xingliu14 in #1307
- Update tuned block size by considering the sliding window. by @vanbasten23 in #1328
- Add appache license. by @QiliangCui in #1339
- Add pre-commit for adding license. by @QiliangCui in #1344
- [Misc] Fix tpu platform init failure when vllm_config is not fully initialized by @sixiang-google in #1335
- Precompile functions with large vocab_size tensors before allocating KV cache to avoid OOM by @wenxindongwork in #1341
- Use Topology Order to map KV cache P/D mapping by @mrjunwan-lang in #1338
- [DP] Add correctness and performance tests by @wenxindongwork in #1348
- update llama3 pp test before pp is submitted in vllm by @Chenyaaang in #1345
- [CI] Fix TPU7x e2e multi modality test by @kwang3939 in #1347
- Allow pytest to correctly discover all tests by @wdhongtw in #1303
- [DP] Reduce DP scheduling overhead via multiprocessing by @wenxindongwork in #1340
- [Misc] Remove outdated flags in buildkite command by @sixiang-google in #1346
- [Refactoring] use
jax.shard_mapinstead of experimental one by @lk-chen in #1334 - Consolidate quantization logics into a single file by @kyuyeunk in #1350
- [Bug fix] Use tpu_v6e_8_queue for DP CI tests by @wenxindongwork in #1353
- [DeepSeek] Support TPU-Friendly Checkpoints + Add DeepSeek Testing by @jrplatin in #1332
- [Bug fix] update pipeline metadata parsing to skip license headers by @boe20211 in #1355
- [JAX Pre-Compilation] Skip pre-compilation for
_precompile_sampling/_precompile_gather_logprobsin TPU Worker by @jrplatin in #1352 - add pp missing args default value for qwix jit by @sixiang-google in #1357
- change naming of get_input_embedding according to upstream by @sixiang-google in #1358
- Set tensor_parallel_size in speculative_decoding_test to mitigate the call recursive issue. by @QiliangCui in #1360
- [Refactor] Remove redundant KV cache quantization logic by @jrplatin in #1361
- Lower expectation for v7. by @QiliangCui in #1362
- change naming of get_multimodal_embeddings based on vllm upstream by @kwang3939 in #1363
- Add tests for hybrid kv cache by @Chenyaaang in #1359
- Attention DP for Torchax backend by @wenxindongwork in #1322
- [Kernel] Simplify the MLA bkv loading logic by @yaochengji in #1331
- [Misc] Fix spacing of log message by @kyuyeunk in #1364
- Correct the type hint of KV cache in TPU runner by @wdhongtw in #1365
- Use v6e8 for hybrid kv cache e2e test by @Chenyaaang in #1367
- [DeepSeek] Use FP4 checkpoint instead of FP8 for E2E tests + add more clear warnings by @jrplatin in #1366
New Contributors
- @helloworld1 made their first contribution in #1235
- @wdhongtw made their first contribution in #1245
- @ernie-chang made their first contribution in #1243
- @coolkp made their first contribution in #1320
- @lk-chen made their first contribution in #1334
Full Changelog: v0.12.0...v0.13.2