Skip to content

v0.13.2

Latest

Choose a tag to compare

@CienetStingLin CienetStingLin released this 30 Dec 09:05
· 655 commits to main since this release

This release brings several new features and improvements for vLLM TPU Inference.

Highlights

Ironwood Support All relevant dependencies have been rolled up to support Ironwood (v7x). CI/CD has been updated to reflect this change.

For further information on different build requirements for v7x compared to previous TPU generations (v6e and prior), please see the following documentation:
QuickStart
TPU Setup

P/D Disaggregated Serving over DCN Ray-based prefill/decode disaggregation with KV cache transfer over DCN.

Multi-Lora for Pytorch Models Multi-Lora support has landed for Pytorch model definitions from vLLM. JAX-native solution will be supported shortly.

Run:AI Model Streaming Run:AI model streamer is a direct Google Cloud Storage model download accelerator. It has been demonstrated to be the easiest and fastest way to pull models into GPU memory from GCS. We are now providing customers the same experience on TPUs.

What's Changed

New Contributors

Full Changelog: v0.12.0...v0.13.2