Releases: aws-neuron/upstreaming-to-vllm
Neuron 2.26.1
Release: 2.26.1
Neuron SDK 2.26.1 Inference + vLLM 0.9.0 Integration (V0 Architecture)
NxD Inference (NxDI) v2.26.1 is now supported on branch neuron-2.26.1 in this fork.
What's New
- Minor bug fixes
Contributors
Neuron 2.26.0
Release: 2.26.0
Neuron SDK 2.26.0 Inference + vLLM 0.9.0 Integration (V0 Architecture)
NxD Inference (NxDI) v2.26.0 is now supported on branch neuron-2.26 in this fork.
What's New
- Beta support for Llama 4, including Scout and Maverick models. Currently we require users to compile the model outside of vLLM and specify the compiled model path using the
NEURON_COMPILED_ARTIFACTSenvironment variable. This limitation will be addressed in a future release. - Other minor fixes and improvements.
Contributors
@aws-bowencc @aws-yishanm @sssrijan-amazon @aarondou @elaineyz @aws-satyajith @aws-aymahg @kannakAWS @feiwx-cloud @aws-luof @yahavb @rohis06-aws
Neuron 2.25.0
Release: 2.25.0
Neuron SDK 2.25.0 Inference + vLLM 0.9.0 Integration (V0 Architecture)
NxD Inference (NxDI) v2.25.0 is now supported on branch neuron-2.25 in this fork.
What's New
- Qwen3 dense models (0.6B to 32B parameters)
- Added Disaggregated Inference support for multiple prefill and multiple decode workers (xPyD)
- Other minor fixes and improvements.
Contributors
@aws-bowencc @aws-yishanm @sssrijan-amazon @aarondou @aws-navyadhara @elaineyz @aws-satyajith @chongmni-aws @ethanqh-aws @rohis06-aws @aws-aymahg @shawnzxf
Neuron 2.24.0
Release: 2.24.0
Neuron SDK 2.24.0 Inference + vLLM 0.7.2 Integration (V0 Architecture)
NxD Inference (NxDI) v2.24.0 is now supported on branch neuron-2.24-vllm-v0.7.2 in this fork.
What's New
- Expanded model support for Qwen2.5 text models
- Automatic Prefix Caching (APC) support. For more information, see NxDI Prefix Caching feature guide and tutorial
- Disaggregated inference (DI) support (Beta).
- Other minor fixes and improvements.
Other changes
- Starting in release 2.24, the Neuron initialization in vLLM code no longer enables sequence parallel by default. This is to ensure better compatibility with models and configurations where sequence parallelism is not well supported. If you previously relied on the Neuron vLLM code to specify sequence parallel, you may now see increased TTFT times. To re-enable sequence parallelism, you can pass
--override-neuron-config "{\"sequence_parallel_enabled\":true}.
Contributors
@shubhamchandak94 @aws-bowencc @AakashShetty-aws @shawnzxf @rohis06-aws @aws-yishanm @sssrijan-amazon @aws-luof @aarondou @aws-navyadhara @elaineyz @aws-satyajith @aws-cph @Zha0q1(emeritus)
Neuron 2.23.0
Release: 2.23.0
Neuron SDK 2.23.0 Inference + vLLM 0.9.0 Integration (V0 Architecture)
This release marks full support for [Neuron SDK 2.23.0 inference libraries] with vLLM 0.9.0 (V0 Architecture). Neuronx Distributed (NxD) Inference is the recommended path for multi-chip inference on AWS Trainium and Inferentia.
Highlights
- NxD Inference (NxDI) v2.23.0 is now fully compatibility with vLLM 0.9.0 with environment variable
VLLM_USE_V1=0. - Support for speculative decoding and dynamic on-device sampling for latency-optimized generation.
- Expanded model support including LLaMA 3.2 multi-modal models and Multi-LoRA inference.
Contributors
@aarondou @AakashShetty-aws @aws-navyadhara @aws-satyajith @aws-tailinpa (emeritus) @aws-yishanm @chongmni-aws @elaineyz @liangfu @mrinalks @sssrijan-amazon
nxd-v0.1.0
This release supports DBRX model for NxD+vLLM as a new feature.
It is based on the vLLM v0.3.2
What's Changed
New Contributors
Full Changelog: https://github.com/aws-neuron/upstreaming-to-vllm/commits/nxd-v0.1.0