Skip to content

Commit d7cb694

Browse files
authored
Merge branch 'vllm-project:main' into serialize-multimodal-kwargs
2 parents bce2f07 + 966c742 commit d7cb694

File tree

18 files changed

+1323
-31
lines changed

18 files changed

+1323
-31
lines changed

.buildkite/test-pipeline.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -400,8 +400,9 @@ steps:
400400
- pytest -v -s models/test_transformers.py
401401
- pytest -v -s models/test_registry.py
402402
# V1 Test: https://github.com/vllm-project/vllm/issues/14531
403-
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'not llama4'
403+
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'not llama4 and not plamo2'
404404
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'llama4'
405+
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'plamo2'
405406

406407
- label: Language Models Test (Standard) # 32min
407408
#mirror_hardwares: [amd]
@@ -411,6 +412,8 @@ steps:
411412
- tests/models/embedding/language
412413
- tests/models/encoder_decoder/language
413414
commands:
415+
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
416+
- pip install causal-conv1d
414417
- pytest -v -s models/decoder_only/language -m 'core_model or quant_model'
415418
- pytest -v -s models/embedding/language -m core_model
416419

@@ -422,6 +425,8 @@ steps:
422425
- tests/models/embedding/language
423426
- tests/models/encoder_decoder/language
424427
commands:
428+
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
429+
- pip install causal-conv1d
425430
- pytest -v -s models/decoder_only/language -m 'not core_model and not quant_model'
426431
- pytest -v -s models/embedding/language -m 'not core_model'
427432

docker/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,7 @@ if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
240240
uv pip install --system https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.1.post2/flashinfer_python-0.2.1.post2+cu124torch2.6-cp38-abi3-linux_x86_64.whl ; \
241241
fi
242242
COPY examples examples
243+
COPY benchmarks benchmarks
243244

244245
# Although we build Flashinfer with AOT mode, there's still
245246
# some issues w.r.t. JIT compilation. Therefore we need to

docs/source/features/lora.md

Lines changed: 56 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -106,19 +106,18 @@ curl http://localhost:8000/v1/completions \
106106

107107
## Dynamically serving LoRA Adapters
108108

109-
In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
110-
LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
111-
to change models on-the-fly is needed.
109+
In addition to serving LoRA adapters at server startup, the vLLM server supports dynamically configuring LoRA adapters at runtime through dedicated API endpoints and plugins. This feature can be particularly useful when the flexibility to change models on-the-fly is needed.
112110

113111
Note: Enabling this feature in production environments is risky as users may participate in model adapter management.
114112

115-
To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
116-
is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
113+
To enable dynamic LoRA configuration, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
114+
is set to `True`.
117115

118116
```bash
119117
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
120118
```
121119

120+
### Using API Endpoints
122121
Loading a LoRA Adapter:
123122

124123
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
@@ -153,6 +152,58 @@ curl -X POST http://localhost:8000/v1/unload_lora_adapter \
153152
}'
154153
```
155154

155+
### Using Plugins
156+
Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter.
157+
158+
You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds.
159+
160+
You can either install existing plugins or implement your own.
161+
162+
Steps to implement your own LoRAResolver plugin:
163+
1. Implement the LoRAResolver interface.
164+
165+
Example of a simple S3 LoRAResolver implementation:
166+
167+
```python
168+
import os
169+
import s3fs
170+
from vllm.lora.request import LoRARequest
171+
from vllm.lora.resolver import LoRAResolver
172+
173+
class S3LoRAResolver(LoRAResolver):
174+
def __init__(self):
175+
self.s3 = s3fs.S3FileSystem()
176+
self.s3_path_format = os.getenv("S3_PATH_TEMPLATE")
177+
self.local_path_format = os.getenv("LOCAL_PATH_TEMPLATE")
178+
179+
async def resolve_lora(self, base_model_name, lora_name):
180+
s3_path = self.s3_path_format.format(base_model_name=base_model_name, lora_name=lora_name)
181+
local_path = self.local_path_format.format(base_model_name=base_model_name, lora_name=lora_name)
182+
183+
# Download the LoRA from S3 to the local path
184+
await self.s3._get(
185+
s3_path, local_path, recursive=True, maxdepth=1
186+
)
187+
188+
lora_request = LoRARequest(
189+
lora_name=lora_name,
190+
lora_path=local_path,
191+
lora_int_id=abs(hash(lora_name))
192+
)
193+
return lora_request
194+
```
195+
196+
2. Register LoRAResolver plugin.
197+
198+
```python
199+
from vllm.lora.resolver import LoRAResolverRegistry
200+
201+
s3_resolver = S3LoRAResolver()
202+
LoRAResolverRegistry.register_resolver("s3_resolver", s3_resolver)
203+
```
204+
205+
For more details, refer to the [vLLM's Plugins System](../design/plugin_system.md).
206+
156207
## New format for `--lora-modules`
157208

158209
In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:

docs/source/models/supported_models.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -497,6 +497,11 @@ See [this page](#generative-models) for more information on how to use generativ
497497
* `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc.
498498
*
499499
* ✅︎
500+
- * `Plamo2ForCausalLM`
501+
* PLaMo2
502+
* `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc.
503+
*
504+
*
500505
- * `QWenLMHeadModel`
501506
* Qwen
502507
* `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.

requirements/test.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ torch==2.6.0
2727
torchaudio==2.6.0
2828
torchvision==0.21.0
2929
transformers_stream_generator # required for qwen-vl test
30+
mamba_ssm # required for plamo2 test
3031
matplotlib # required for qwen-vl test
3132
mistral_common[opencv] >= 1.5.4 # required for pixtral test
3233
num2words # required for smolvlm test

requirements/test.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ einops==0.8.0
111111
# via
112112
# -r requirements/test.in
113113
# encodec
114+
# mamba-ssm
114115
# vector-quantize-pytorch
115116
# vocos
116117
einx==0.3.0
@@ -233,6 +234,8 @@ lxml==5.3.0
233234
# via
234235
# blobfile
235236
# sacrebleu
237+
mamba-ssm==2.2.4
238+
# via -r requirements/test.in
236239
markdown-it-py==3.0.0
237240
# via rich
238241
markupsafe==3.0.2
@@ -268,6 +271,8 @@ mypy-extensions==1.0.0
268271
# via black
269272
networkx==3.2.1
270273
# via torch
274+
ninja==1.11.1.3
275+
# via mamba-ssm
271276
nltk==3.9.1
272277
# via rouge-score
273278
num2words==0.5.14
@@ -360,6 +365,7 @@ packaging==24.1
360365
# fastparquet
361366
# huggingface-hub
362367
# lazy-loader
368+
# mamba-ssm
363369
# matplotlib
364370
# peft
365371
# plotly
@@ -571,6 +577,7 @@ sentencepiece==0.2.0
571577
# via mistral-common
572578
setuptools==75.8.0
573579
# via
580+
# mamba-ssm
574581
# pytablewriter
575582
# torch
576583
shellingham==1.5.4
@@ -627,6 +634,7 @@ torch==2.6.0
627634
# encodec
628635
# fastsafetensors
629636
# lm-eval
637+
# mamba-ssm
630638
# peft
631639
# runai-model-streamer
632640
# sentence-transformers
@@ -664,6 +672,7 @@ transformers==4.51.1
664672
# -r requirements/test.in
665673
# genai-perf
666674
# lm-eval
675+
# mamba-ssm
667676
# peft
668677
# sentence-transformers
669678
# transformers-stream-generator

0 commit comments

Comments
 (0)