Skip to content

Commit 4050bae

Browse files
authored
[Doc] Update plugin doc (#28532)
Signed-off-by: wangxiyuan <[email protected]>
1 parent f1805db commit 4050bae

File tree

3 files changed

+101
-4
lines changed

3 files changed

+101
-4
lines changed

docs/design/plugin_system.md

Lines changed: 96 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ The community frequently requests the ability to extend vLLM with custom feature
44

55
## How Plugins Work in vLLM
66

7-
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview](arch_overview.md)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
7+
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview](arch_overview.md)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_plugins_by_group][vllm.plugins.load_plugins_by_group] function in the `vllm.plugins` module.
88

99
## How vLLM Discovers Plugins
1010

@@ -57,6 +57,100 @@ Every plugin has three parts:
5757

5858
- **Being re-entrant**: The function specified in the entry point should be re-entrant, meaning it can be called multiple times without causing issues. This is necessary because the function might be called multiple times in some processes.
5959

60+
### Platform plugins guidelines
61+
62+
1. Create a platform plugin project, for example, `vllm_add_dummy_platform`. The project structure should look like this:
63+
64+
```shell
65+
vllm_add_dummy_platform/
66+
├── vllm_add_dummy_platform/
67+
│ ├── __init__.py
68+
│ ├── my_dummy_platform.py
69+
│ ├── my_dummy_worker.py
70+
│ ├── my_dummy_attention.py
71+
│ ├── my_dummy_device_communicator.py
72+
│ ├── my_dummy_custom_ops.py
73+
├── setup.py
74+
```
75+
76+
2. In the `setup.py` file, add the following entry point:
77+
78+
```python
79+
setup(
80+
name="vllm_add_dummy_platform",
81+
...
82+
entry_points={
83+
"vllm.platform_plugins": [
84+
"my_dummy_platform = vllm_add_dummy_platform:register"
85+
]
86+
},
87+
...
88+
)
89+
```
90+
91+
Please make sure `vllm_add_dummy_platform:register` is a callable function and returns the platform class's fully qualified name. for example:
92+
93+
```python
94+
def register():
95+
return "vllm_add_dummy_platform.my_dummy_platform.MyDummyPlatform"
96+
```
97+
98+
3. Implement the platform class `MyDummyPlatform` in `my_dummy_platform.py`. The platform class should inherit from `vllm.platforms.interface.Platform`. Please follow the interface to implement the functions one by one. There are some important functions and properties that should be implemented at least:
99+
100+
- `_enum`: This property is the device enumeration from [PlatformEnum][vllm.platforms.interface.PlatformEnum]. Usually, it should be `PlatformEnum.OOT`, which means the platform is out-of-tree.
101+
- `device_type`: This property should return the type of the device which pytorch uses. For example, `"cpu"`, `"cuda"`, etc.
102+
- `device_name`: This property is set the same as `device_type` usually. It's mainly used for logging purposes.
103+
- `check_and_update_config`: This function is called very early in the vLLM's initialization process. It's used for plugins to update the vllm configuration. For example, the block size, graph mode config, etc, can be updated in this function. The most important thing is that the **worker_cls** should be set in this function to let vLLM know which worker class to use for the worker process.
104+
- `get_attn_backend_cls`: This function should return the attention backend class's fully qualified name.
105+
- `get_device_communicator_cls`: This function should return the device communicator class's fully qualified name.
106+
107+
4. Implement the worker class `MyDummyWorker` in `my_dummy_worker.py`. The worker class should inherit from [WorkerBase][vllm.v1.worker.worker_base.WorkerBase]. Please follow the interface to implement the functions one by one. Basically, all interfaces in the base class should be implemented, since they are called here and there in vLLM. To make sure a model can be executed, the basic functions should be implemented are:
108+
109+
- `init_device`: This function is called to set up the device for the worker.
110+
- `initialize_cache`: This function is called to set cache config for the worker.
111+
- `load_model`: This function is called to load the model weights to device.
112+
- `get_kv_cache_spaces`: This function is called to generate the kv cache spaces for the model.
113+
- `determine_available_memory`: This function is called to profiles the peak memory usage of the model to determine how much memory can be used for KV cache without OOMs.
114+
- `initialize_from_config`: This function is called to allocate device KV cache with the specified kv_cache_config
115+
- `execute_model`: This function is called every step to inference the model.
116+
117+
Additional functions that can be implemented are:
118+
119+
- If the plugin wants to support sleep mode feature, please implement the `sleep` and `wakeup` functions.
120+
- If the plugin wants to support graph mode feature, please implement the `compile_or_warm_up_model` function.
121+
- If the plugin wants to support speculative decoding feature, please implement the `take_draft_token_ids` function.
122+
- If the plugin wants to support lora feature, please implement the `add_lora`,`remove_lora`,`list_loras` and `pin_lora` functions.
123+
- If the plugin wants to support data parallelism feature, please implement the `execute_dummy_batch` functions.
124+
125+
Please look at the worker base class [WorkerBase][vllm.v1.worker.worker_base.WorkerBase] for more functions that can be implemented.
126+
127+
5. Implement the attention backend class `MyDummyAttention` in `my_dummy_attention.py`. The attention backend class should inherit from [AttentionBackend][vllm.attention.backends.abstract.AttentionBackend]. It's used to calculate attentions with your device. Take `vllm.v1.attention.backends` as examples, it contains many attention backend implementations.
128+
129+
6. Implement custom ops for high performance. Most ops can be ran by pytorch native implementation, while the performance may not be good. In this case, you can implement specific custom ops for your plugins. Currently, there are kinds of custom ops vLLM supports:
130+
131+
- pytorch ops
132+
there are 3 kinds of pytorch ops:
133+
134+
- `communicator ops`: Device communicator op. Such as all-reduce, all-gather, etc.
135+
Please implement the device communicator class `MyDummyDeviceCommunicator` in `my_dummy_device_communicator.py`. The device communicator class should inherit from [DeviceCommunicatorBase][vllm.distributed.device_communicators.base_device_communicator.DeviceCommunicatorBase].
136+
- `common ops`: Common ops. Such as matmul, softmax, etc.
137+
Please implement the common ops by register oot way. See more detail in [CustomOp][vllm.model_executor.custom_op.CustomOp] class.
138+
- `csrc ops`: C++ ops. This kind of ops are implemented in C++ and are registered as torch custom ops.
139+
Following csrc module and `vllm._custom_ops` to implement your ops.
140+
141+
- triton ops
142+
Custom way doesn't work for triton ops now.
143+
144+
7. (optional) Implement other plugable modules, such as lora, graph backend, quantization, mamba attention backend, etc.
145+
60146
## Compatibility Guarantee
61147

62-
vLLM guarantees the interface of documented plugins, such as `ModelRegistry.register_model`, will always be available for plugins to register models. However, it is the responsibility of plugin developers to ensure their plugins are compatible with the version of vLLM they are targeting. For example, `"vllm_add_dummy_model.my_llava:MyLlava"` should be compatible with the version of vLLM that the plugin targets. The interface for the model may change during vLLM's development.
148+
vLLM guarantees the interface of documented plugins, such as `ModelRegistry.register_model`, will always be available for plugins to register models. However, it is the responsibility of plugin developers to ensure their plugins are compatible with the version of vLLM they are targeting. For example, `"vllm_add_dummy_model.my_llava:MyLlava"` should be compatible with the version of vLLM that the plugin targets.
149+
150+
The interface for the model/module may change during vLLM's development. If you see any deprecation log info, please upgrade your plugin to the latest version.
151+
152+
## Deprecation announcement
153+
154+
!!! warning "Deprecations"
155+
- `use_v1` parameter in `Platform.get_attn_backend_cls` is deprecated. It will be removed in v0.13.0 or v1.0.0.
156+
- `_Backend` in `vllm.attention` is deprecated. It will be removed in v0.13.0 or v1.0.0. Please use `vllm.attention.backends.registry.register_backend` to add new attention backend to `AttentionBackendEnum` instead.

vllm/plugins/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@
1717
# Platform plugins group will be loaded in all processes when
1818
# `vllm.platforms.current_platform` is called and the value not initialized,
1919
PLATFORM_PLUGINS_GROUP = "vllm.platform_plugins"
20+
# Stat logger plugins group will be loaded in process0 only when serve vLLM with
21+
# async mode.
22+
STAT_LOGGER_PLUGINS_GROUP = "vllm.stat_logger_plugins"
2023

2124
# make sure one process only loads plugins once
2225
plugins_loaded = False

vllm/v1/metrics/loggers.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
KVConnectorPrometheus,
1717
)
1818
from vllm.logger import init_logger
19-
from vllm.plugins import load_plugins_by_group
19+
from vllm.plugins import STAT_LOGGER_PLUGINS_GROUP, load_plugins_by_group
2020
from vllm.v1.engine import FinishReason
2121
from vllm.v1.metrics.prometheus import unregister_vllm_metrics
2222
from vllm.v1.metrics.stats import (
@@ -67,7 +67,7 @@ def record_sleep_state(self, is_awake: int, level: int): # noqa
6767
def load_stat_logger_plugin_factories() -> list[StatLoggerFactory]:
6868
factories: list[StatLoggerFactory] = []
6969

70-
for name, plugin_class in load_plugins_by_group("vllm.stat_logger_plugins").items():
70+
for name, plugin_class in load_plugins_by_group(STAT_LOGGER_PLUGINS_GROUP).items():
7171
if not isinstance(plugin_class, type) or not issubclass(
7272
plugin_class, StatLoggerBase
7373
):

0 commit comments

Comments
 (0)