Skip to content

Commit ccafe5e

Browse files
authored
Merge branch 'main' into gemma3n-lora
2 parents dec277b + 5438967 commit ccafe5e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1884
-300
lines changed

.buildkite/test-pipeline.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -566,8 +566,7 @@ steps:
566566
- tests/models/multimodal
567567
commands:
568568
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
569-
- pytest -v -s models/multimodal/processing --ignore models/multimodal/processing/test_tensor_schema.py
570-
- pytest -v -s models/multimodal/processing/test_tensor_schema.py
569+
- pytest -v -s models/multimodal/processing
571570

572571
- label: Multi-Modal Models Test (Standard)
573572
mirror_hardwares: [amdexperimental]
@@ -770,6 +769,11 @@ steps:
770769
- pytest -v -s plugins_tests/test_platform_plugins.py
771770
- pip uninstall vllm_add_dummy_platform -y
772771
# end platform plugin tests
772+
# begin io_processor plugins test, all the code in between uses the prithvi_io_processor plugin
773+
- pip install -e ./plugins/prithvi_io_processor_plugin
774+
- pytest -v -s plugins_tests/test_io_processor_plugins.py
775+
- pip uninstall prithvi_io_processor_plugin -y
776+
# end io_processor plugins test
773777
# other tests continue here:
774778
- pytest -v -s plugins_tests/test_scheduler_plugins.py
775779
- pip install -e ./plugins/vllm_add_dummy_model
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# IO Processor Plugins
2+
3+
IO Processor plugins are a feature that allows pre and post processing of the model input and output for pooling models. The idea is that users are allowed to pass a custom input to vLLM that is converted into one or more model prompts and fed to the model `encode` method. One potential use-case of such plugins is that of using vLLM for generating multi-modal data. Say users feed an image to vLLM and get an image in output.
4+
5+
When performing an inference with IO Processor plugins, the prompt type is defined by the plugin and the same is valid for the final request output. vLLM does not perform any validation of input/output data, and it is up to the plugin to ensure the correct data is being fed to the model and returned to the user. As of now these plugins support only pooling models and can be triggerd via the `encode` method in `LLM` and `AsyncLLM`, or in online serving mode via the `/pooling` endpoint.
6+
7+
## Writing an IO Processor Plugin
8+
9+
IO Processor plugins implement the `IOProcessor` interface (<gh-file:vllm/plugins/io_processors/interface.py>):
10+
11+
```python
12+
IOProcessorInput = TypeVar('IOProcessorInput')
13+
IOProcessorOutput = TypeVar('IOProcessorOutput')
14+
15+
class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]):
16+
17+
def __init__(self, vllm_config: VllmConfig):
18+
self.vllm_config = vllm_config
19+
20+
@abstractmethod
21+
def pre_process(
22+
self,
23+
prompt: IOProcessorInput,
24+
request_id: Optional[str] = None,
25+
**kwargs,
26+
) -> Union[PromptType, Sequence[PromptType]]:
27+
raise NotImplementedError
28+
29+
async def pre_process_async(
30+
self,
31+
prompt: IOProcessorInput,
32+
request_id: Optional[str] = None,
33+
**kwargs,
34+
) -> Union[PromptType, Sequence[PromptType]]:
35+
return self.pre_process(prompt, request_id, **kwargs)
36+
37+
@abstractmethod
38+
def post_process(self,
39+
model_output: Sequence[PoolingRequestOutput],
40+
request_id: Optional[str] = None,
41+
**kwargs) -> IOProcessorOutput:
42+
raise NotImplementedError
43+
44+
async def post_process_async(
45+
self,
46+
model_output: AsyncGenerator[tuple[int, PoolingRequestOutput]],
47+
request_id: Optional[str] = None,
48+
**kwargs,
49+
) -> IOProcessorOutput:
50+
collected_output = [item async for i, item in model_output]
51+
return self.post_process(collected_output, request_id, **kwargs)
52+
53+
@abstractmethod
54+
def parse_request(self, request: Any) -> IOProcessorInput:
55+
raise NotImplementedError
56+
57+
@abstractmethod
58+
def output_to_response(
59+
self, plugin_output: IOProcessorOutput) -> IOProcessorResponse:
60+
raise NotImplementedError
61+
```
62+
63+
The `parse_request` method is used for validating the user prompt and converting it into the input expected by the `pre_process`/`pre_process_async` methods.
64+
The `pre_process*` methods take the validated plugin input to generate vLLM's model prompts for regular inference.
65+
The `post_process*` methods take `PoolingRequestOutput` objects as input and generate a custom plugin output.
66+
67+
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/io_processor_pooling` serving endpoint is [here](../../vllm/entrypoints/openai/serving_pooling_with_io_plugin.py).
68+
69+
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/christian-pinto/prithvi_io_processor_plugin). Please, also refer to our [online](../../examples/online_serving/prithvi_geospatial_mae.py) and [offline](../../examples/offline_inference/prithvi_geospatial_mae_io_processor.py) inference examples.
70+
71+
## Using an IO Processor plugin
72+
73+
IO Processor plugins are loaded at engine startup and there are two methods for specifying the name of the plugin to be loaded:
74+
75+
1. Via vLLM's `EngineArgs`: setting the `io_processor_plugin` argument in the `EngineArgs` used to initialize the `AsyncLLM`. The same can be achieved by passing the `io_processor_plugin` argument to `LLM` in offline mode, or by passing the `--io-processor-plugin` argument in serving mode.
76+
2. Via the model HF configuration: adding an `io_processor_plugin` field to the model config (config.json).
77+
78+
The order also determines method priority. i.e., setting the plugin name via `EngineArgs` will override any plugin name specified in the model HF config (config.json).

docs/design/plugin_system.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ Every plugin has three parts:
4949

5050
- **Platform plugins** (with group name `vllm.platform_plugins`): The primary use case for these plugins is to register custom, out-of-the-tree platforms into vLLM. The plugin function should return `None` when the platform is not supported in the current environment, or the platform class's fully qualified name when the platform is supported.
5151

52+
- **IO Processor plugins** (with group name `vllm.io_processor_plugins`): The primary use case for these plugins is to register custom pre/post processing of the model prompt and model output for poling models. The plugin function returns the IOProcessor's class fully qualified name.
53+
5254
## Guidelines for Writing Plugins
5355

5456
- **Being re-entrant**: The function specified in the entry point should be re-entrant, meaning it can be called multiple times without causing issues. This is necessary because the function might be called multiple times in some processes.
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
import base64
4+
import os
5+
6+
import torch
7+
8+
from vllm import LLM
9+
from vllm.pooling_params import PoolingParams
10+
11+
# This example shows how to perform an offline inference that generates
12+
# multimodal data. In this specific case this example will take a geotiff
13+
# image as input, process it using the multimodal data processor, and
14+
# perform inference.
15+
# Reuirement - install plugin at:
16+
# https://github.com/christian-pinto/prithvi_io_processor_plugin
17+
18+
19+
def main():
20+
torch.set_default_dtype(torch.float16)
21+
image_url = "https://huggingface.co/christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM/resolve/main/India_900498_S2Hand.tif" # noqa: E501
22+
23+
img_prompt = dict(
24+
data=image_url,
25+
data_format="url",
26+
image_format="tiff",
27+
out_data_format="b64_json",
28+
)
29+
30+
llm = LLM(
31+
model="christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM",
32+
skip_tokenizer_init=True,
33+
trust_remote_code=True,
34+
enforce_eager=True,
35+
# Limit the maximum number of parallel requests
36+
# to avoid the model going OOM.
37+
# The maximum number depends on the available GPU memory
38+
max_num_seqs=32,
39+
io_processor_plugin="prithvi_to_tiff_india",
40+
)
41+
42+
pooling_params = PoolingParams(task="encode", softmax=False)
43+
pooler_output = llm.encode(
44+
img_prompt,
45+
pooling_params=pooling_params,
46+
)
47+
output = pooler_output[0].outputs
48+
49+
print(output)
50+
decoded_data = base64.b64decode(output.data)
51+
52+
file_path = os.path.join(os.getcwd(), "offline_prediction.tiff")
53+
with open(file_path, "wb") as f:
54+
f.write(decoded_data)
55+
56+
print(f"Output file path: {file_path}")
57+
58+
59+
if __name__ == "__main__":
60+
main()

examples/online_serving/kv_events_subscriber.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,12 @@ class BlockStored(KVCacheEvent):
2727
token_ids: list[int]
2828
block_size: int
2929
lora_id: Optional[int]
30+
medium: Optional[str]
3031

3132

3233
class BlockRemoved(KVCacheEvent):
3334
block_hashes: list[int]
35+
medium: Optional[str]
3436

3537

3638
class AllBlocksCleared(KVCacheEvent):
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
4+
import base64
5+
import os
6+
7+
import requests
8+
9+
# This example shows how to perform an online inference that generates
10+
# multimodal data. In this specific case this example will take a geotiff
11+
# image as input, process it using the multimodal data processor, and
12+
# perform inference.
13+
# Reuirements :
14+
# - install plugin at:
15+
# https://github.com/christian-pinto/prithvi_io_processor_plugin
16+
# - start vllm in serving mode with the below args
17+
# --model='christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM'
18+
# --task embed --trust-remote-code
19+
# --skip-tokenizer-init --enforce-eager
20+
# --io-processor-plugin prithvi_to_tiff_india
21+
22+
23+
def main():
24+
image_url = "https://huggingface.co/christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM/resolve/main/India_900498_S2Hand.tif" # noqa: E501
25+
server_endpoint = "http://localhost:8000/pooling"
26+
27+
request_payload_url = {
28+
"data": {
29+
"data": image_url,
30+
"data_format": "url",
31+
"image_format": "tiff",
32+
"out_data_format": "b64_json",
33+
},
34+
"priority": 0,
35+
"model": "christian-pinto/Prithvi-EO-2.0-300M-TL-VLLM",
36+
}
37+
38+
ret = requests.post(server_endpoint, json=request_payload_url)
39+
40+
print(f"response.status_code: {ret.status_code}")
41+
print(f"response.reason:{ret.reason}")
42+
43+
response = ret.json()
44+
45+
decoded_image = base64.b64decode(response["data"]["data"])
46+
47+
out_path = os.path.join(os.getcwd(), "online_prediction.tiff")
48+
49+
with open(out_path, "wb") as f:
50+
f.write(decoded_image)
51+
52+
53+
if __name__ == "__main__":
54+
main()

tests/conftest.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1120,6 +1120,9 @@ def _apply_model(self):
11201120

11211121
return self.llm.llm_engine.collective_rpc(_apply_model)
11221122

1123+
def get_llm(self) -> LLM:
1124+
return self.llm
1125+
11231126
def __enter__(self):
11241127
return self
11251128

0 commit comments

Comments
 (0)