Release v0.13.2 · vllm-project/tpu-inference

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Ironwood Support All relevant dependencies have been rolled up to support Ironwood (v7x). CI/CD has been updated to reflect this change.

For further information on different build requirements for v7x compared to previous TPU generations (v6e and prior), please see the following documentation:
QuickStart
TPU Setup

P/D Disaggregated Serving over DCN Ray-based prefill/decode disaggregation with KV cache transfer over DCN.

Multi-Lora for Pytorch Models Multi-Lora support has landed for Pytorch model definitions from vLLM. JAX-native solution will be supported shortly.

Run:AI Model Streaming Run:AI model streamer is a direct Google Cloud Storage model download accelerator. It has been demonstrated to be the easiest and fastest way to pull models into GPU memory from GCS. We are now providing customers the same experience on TPUs.

What's Changed

[Misc] Update tpu-info by @kyuyeunk in #1214
fix: update nightly date format to YYYYMMDD by @ylangtsou in #1213
[Kernel] Remove KV masking by performing full bkv fetches in the first 2 steps by @yaochengji in #1240
Refactor moe codebase by @kyuyeunk in #1199
[multihost] Add NEW_MODEL_DESIGN to additional_env_vars by @Lumosis in #1236
[DP] Add model DP support for JAX GPT-OSS by @wenxindongwork in #1247
Fix circle reference which cause tpu_platform failed to import by @mrjunwan-lang in #1251
[Bug fix] Fix DP + Hybrid KV cache numerics by @wenxindongwork in #1249
add get_kv_connector_handshake_metadata in tpu_worker by @mrjunwan-lang in #1254
Integrate MLA v1 into DeepSeek-v3 by @gpolovets1 in #1190
Fix bug that PP assign wrong rank in distributed TP by @mrjunwan-lang in #1256
[Disagg] local disagg e2e test by @sixiang-google in #1237
Fix image tests. by @QiliangCui in #1253
Fixing a few failures in tests/test_quantization.py. by @gpolovets1 in #1258
[RPA] Revert previous changes due to numeric issue by @kyuyeunk in #1242
[Misc] Update torchax with fp4 support by @kyuyeunk in #1257
Update support matrices by @boe20211 in #1232
Update request_distribution in DP input preparation by @wenxindongwork in #1211
Fix FP8 dtype type mismatch issue by @helloworld1 in #1235
Add disagg test to v6e-8 queue by @sixiang-google in #1259
Add an argument to TpuPlatform.get_attn_backend_cls to adopt interfac… by @QiliangCui in #1263
Update README.md by @bvrockwell in #1197
Backward compatibility for NEW_MODEL_DESIGN=True by @wenxindongwork in #1267
Delete b/ from PR template. by @QiliangCui in #1268
[Disagg] Refined e2e test cleanup by @sixiang-google in #1265
Remove a branch with pl.when in fetching bkv by @rupengliu-meta in #1239
Add a lora perf test by @vanbasten23 in #1272
Fix moe layer from upstream change by @kyuyeunk in #1274
[RPA] Pipeline flash attention in default kernel by @jrplatin in #1203
First check-in to add ci/cd test on tpuv7x by @QiliangCui in #1270
clear xla compilation cache before each disagg server launch by @sixiang-google in #1271
Reduce image size and enhance caching by @wdhongtw in #1245
[Kernel][FusedMoE] Fix MoE crash and hang issues by @bythew3i in #1252
[Quantization] Add option to bypass quantized matmul kernel for W8A8-FP8 Compressed Tensors by @jrplatin in #1273
Replacing bit_width() with itemized_bits(). by @aman2930 in #1264
Enable All Tests on TPUv7 by @QiliangCui in #1279
add github action for check ready label by @boe20211 in #1269
[Bugfix][Depreacate] Update for vllm v0.13 by @kyuyeunk in #1284
Add default 'auto' MODEL_IMPL_TYPE that resolves based on architecture by @xingliu14 in #1255
[Misc] Fix how model dtype is being configured by @kyuyeunk in #1286
[Bugfix][Refactor] Fix compressed tensor moe init by @kyuyeunk in #1283
update run_in_docker script for running on local env by @ernie-chang in #1243
Remove pip install from setup_docker_env.sh. by @QiliangCui in #1292
Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1295
Revert "Update multihost disagg sh to prepare integrate with buildkit… by @QiliangCui in #1297
[Misc] Disable torchax.tensor logger warning by @kyuyeunk in #1301
Support overriding logic for hybrid kv cache padding by @kyuyeunk in #1285
Update multihost disagg sh to prepare integrate with buildkite by @mrjunwan-lang in #1304
Update disagg multi host script health check logic by @mrjunwan-lang in #1306
[Misc][RPA] Update to use logger in kernel_hd64.py by @kyuyeunk in #1302
Update libtpu version for tpuv7. by @QiliangCui in #1305
Fix a test pipeline bug and add TODO. by @QiliangCui in #1309
Avoid installing CUDA related stuff by @wdhongtw in #1246
[Kernel][Misc] Remove jax.named_scope by @kyuyeunk in #1278
Use 50 bit uuid for KV transfer key to avoid GO trunk the int in GKE by @mrjunwan-lang in #1310
Use AttentionSelectorConfig in get_attn_backend_cls by @karan in #1313
Add Quantized Weights Support for MoE Layers by @kyuyeunk in #1300
Fix for vLLM's benchmarking case change. by @patemotter in #1316
Enable Pipeline Parallelism on Jax models by @Chenyaaang in #1077
Restrict PP size to either 1 or host size in ray by @Chenyaaang in #1318
Fix the lora column_parallel_packed test on v7x by @vanbasten23 in #1314
Add dummy placeholder for unsupported models in the support matrix by @boe20211 in #1291
Fix lora layer unit tests for v7x2. by @vanbasten23 in #1319
Use vllm models when PP is enabled by @Chenyaaang in #1321
Fix model loader unit test by @Chenyaaang in #1324
Integrate the E2E multi-host disagg serving into buildkite by @mrjunwan-lang in #1323
support fp8 compressed-tensors moe by @coolkp in #1320
[RPA] Optimize masking and sliding window by @kyuyeunk in #1325
Fix scale sharding in ep case by @coolkp in #1326
[Torchax] fp8 quantization skeleton by @xingliu14 in #1307
Update tuned block size by considering the sliding window. by @vanbasten23 in #1328
Add appache license. by @QiliangCui in #1339
Add pre-commit for adding license. by @QiliangCui in #1344
[Misc] Fix tpu platform init failure when vllm_config is not fully initialized by @sixiang-google in #1335
Precompile functions with large vocab_size tensors before allocating KV cache to avoid OOM by @wenxindongwork in #1341
Use Topology Order to map KV cache P/D mapping by @mrjunwan-lang in #1338
[DP] Add correctness and performance tests by @wenxindongwork in #1348
update llama3 pp test before pp is submitted in vllm by @Chenyaaang in #1345
[CI] Fix TPU7x e2e multi modality test by @kwang3939 in #1347
Allow pytest to correctly discover all tests by @wdhongtw in #1303
[DP] Reduce DP scheduling overhead via multiprocessing by @wenxindongwork in #1340
[Misc] Remove outdated flags in buildkite command by @sixiang-google in #1346
[Refactoring] use jax.shard_map instead of experimental one by @lk-chen in #1334
Consolidate quantization logics into a single file by @kyuyeunk in #1350
[Bug fix] Use tpu_v6e_8_queue for DP CI tests by @wenxindongwork in #1353
[DeepSeek] Support TPU-Friendly Checkpoints + Add DeepSeek Testing by @jrplatin in #1332
[Bug fix] update pipeline metadata parsing to skip license headers by @boe20211 in #1355
[JAX Pre-Compilation] Skip pre-compilation for _precompile_sampling / _precompile_gather_logprobs in TPU Worker by @jrplatin in #1352
add pp missing args default value for qwix jit by @sixiang-google in #1357
change naming of get_input_embedding according to upstream by @sixiang-google in #1358
Set tensor_parallel_size in speculative_decoding_test to mitigate the call recursive issue. by @QiliangCui in #1360
[Refactor] Remove redundant KV cache quantization logic by @jrplatin in #1361
Lower expectation for v7. by @QiliangCui in #1362
change naming of get_multimodal_embeddings based on vllm upstream by @kwang3939 in #1363
Add tests for hybrid kv cache by @Chenyaaang in #1359
Attention DP for Torchax backend by @wenxindongwork in #1322
[Kernel] Simplify the MLA bkv loading logic by @yaochengji in #1331
[Misc] Fix spacing of log message by @kyuyeunk in #1364
Correct the type hint of KV cache in TPU runner by @wdhongtw in #1365
Use v6e8 for hybrid kv cache e2e test by @Chenyaaang in #1367
[DeepSeek] Use FP4 checkpoint instead of FP8 for E2E tests + add more clear warnings by @jrplatin in #1366

New Contributors

@helloworld1 made their first contribution in #1235
@wdhongtw made their first contribution in #1245
@ernie-chang made their first contribution in #1243
@coolkp made their first contribution in #1320
@lk-chen made their first contribution in #1334

Full Changelog: v0.12.0...v0.13.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.13.2

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

What's Changed

New Contributors

Contributors

Uh oh!