|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Introducing vLLM Hardware Plugin and Best Practice with Ascend NPU" |
| 4 | +author: "vLLM Ascend Team" |
| 5 | +image: /assets/logos/vllm-logo-only-light.png |
| 6 | +--- |
| 7 | + |
| 8 | +Since December 2024, through the joint efforts of the vLLM community and the vLLM Ascend team, we have completed the **Hardware Pluggable** RFC. This proposal allows hardware integration into vLLM in a decoupled manner, enabling rapid and modular support for different hardware platforms. The RFC has now taken initial shape. This blog post focuses on how the vLLM Hardware Plugin works and shares best practice for supporting Ascend NPU through the plugin mechanism. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## Why vLLM Hardware Plugin? |
| 13 | + |
| 14 | +Currently, vLLM already supports multiple backends. However, as the number of vLLM backends continues to grow, several challenges have emerged: |
| 15 | + |
| 16 | +- **Increased Code Complexity**: Each hardware backend has its own `Executor`, `Worker`, `Runner`, and `Attention` components. This has made the vLLM codebase more complex, with non-generic backend-specific code scattered throughout the project. |
| 17 | +- **High Maintenance Costs**: The cost of maintaining backends is high, not only for the backend developers but also for the vLLM community. When backend maintainers are unavailable, the limited bandwidth of community contributors makes it difficult to add new features efficiently. |
| 18 | +- **Lack of Extensibility**: While vLLM follows a well-structured layered design by implementing backends through `Executor`, `Worker`, `Runner`, and `Attention`, supporting new hardware often requires invasive modifications or patching rather than dynamic registration. This makes adding new backends cumbersome. |
| 19 | + |
| 20 | +Recognizing the need for a flexible and modular approach to integrating hardware backends, we identified hardware pluginization as a feasible solution: |
| 21 | + |
| 22 | +- **Decoupled Codebase**: The hardware backend plugin code remains independent, making the vLLM core code cleaner and more maintainable. |
| 23 | +- **Reduced Maintenance Burden**: vLLM developers can focus on generic features without being overwhelmed by the differences caused by backend-specific implementations. |
| 24 | +- **Faster Expansion and Iteration**: Each backend can be maintained independently to ensure stability, and new backends can be integrated quickly. |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +## What is the vLLM Hardware Plugin? |
| 29 | + |
| 30 | +Before introducing the vLLM Hardware Plugin, let's first look at two prerequisite RFCs: |
| 31 | + |
| 32 | +- [[RFC] vLLM Plugin System](https://github.com/vllm-project/vllm/issues/7131): This RFC introduces a plugin-based approach to support various customization requirements, allowing users to define custom models, executors, schedulers, etc. |
| 33 | +- [[RFC] Make vLLM Device-Agnostic for Diverse Hardware Support](https://github.com/vllm-project/vllm/issues/9268) and ([vllm-project/vllm#6080](https://github.com/vllm-project/vllm/pull/6080)): This RFC introduces the **platform** submodule, which centralizes hardware-related implementations to reduce conditional logic in the main codebase and lays the foundation for modularization. |
| 34 | + |
| 35 | +Building on these RFCs, we proposed [[RFC] Hardware Pluggable](https://github.com/vllm-project/vllm/issues/11162), which integrates the `Platform` module into vLLM as a plugin. Additionally, we refactored `Executor`, `Worker`, `ModelRunner`, `AttentionBackend`, and `Communicator` to support hardware plugins more flexibly. |
| 36 | + |
| 37 | +Currently, the vLLM team, in collaboration with vLLM Ascend developers, has successfully implemented the initial version of this RFC. We also validated the functionality through the [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) project. Using this plugin mechanism, we successfully integrated vLLM with the Ascend NPU backend. |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## How to Add Backend Support with vLLM Hardware Plugin |
| 42 | + |
| 43 | +This section will dive into integrating a New Backend via the Hardware Plugin in both developer and user perspective. |
| 44 | + |
| 45 | +### Developer Perspective |
| 46 | + |
| 47 | +To integrate a new backend into vLLM using the Hardware Plugin, follow these steps: |
| 48 | + |
| 49 | +#### Step 1: Create a New Project and Initialize the Platform |
| 50 | + |
| 51 | +Start by creating a Python project for the new backend and adding a `platform.py` file. Then, import the `Platform` class from `vllm.platforms` and implement the required attributes and methods. |
| 52 | + |
| 53 | +You can refer to the [`platform.py`](https://github.com/vllm-project/vllm-ascend/blob/72a43a61d8d2193dddbfcc60578fd642008225a5/vllm_ascend/platform.py#L52) in vLLM Ascend project for an example. |
| 54 | + |
| 55 | +#### Step 2: Implement Custom Worker, Model Runner, Attention Backend, and Communicator Modules |
| 56 | + |
| 57 | +Depending on the new backend’s requirements, implement the following modules: |
| 58 | + |
| 59 | +```python |
| 60 | +from vllm.worker.worker_base import WorkerBase |
| 61 | +from vllm.worker.model_runner_base import ModelRunnerBase |
| 62 | +from vllm.attention.backends.abstract import AttentionBackend |
| 63 | +from vllm.distributed.device_communicators.base_communicator import CommunicatorBase |
| 64 | +``` |
| 65 | + |
| 66 | +Each of these classes has a corresponding base class in vLLM. Again, you can refer to [vLLM Ascend’s implementation](https://github.com/vllm-project/vllm-ascend/tree/main/vllm_ascend) for an example. |
| 67 | + |
| 68 | +#### Step 3: Register the Plugin |
| 69 | + |
| 70 | +Register the plugin in `setup.py` using Python’s entry point mechanism: |
| 71 | + |
| 72 | +```python |
| 73 | +setup( |
| 74 | + entry_points={'vllm.platform_plugins': ["{your_platform_name} = {code_path}:{register_function}"]} |
| 75 | +) |
| 76 | +``` |
| 77 | + |
| 78 | +- `{your_platform_name}`: The name of the new backend (can be arbitrary). |
| 79 | +- `{code_path}`: The path to the main Python module. |
| 80 | +- `{register_function}`: The register function, which returns the path of `Platform` class defined in step 1. |
| 81 | + |
| 82 | +Refer to [`setup.py`](https://github.com/vllm-project/vllm-ascend/blob/72a43a61d8d2193dddbfcc60578fd642008225a5/setup.py#L102) in vLLM Ascend for a practical example. |
| 83 | + |
| 84 | +#### Step 4 (Optional): Implement Custom Quantization Algorithms and Model |
| 85 | + |
| 86 | +vLLM supports both dynamic registration of quantization algorithms and model. New backends can implemente them on demand. |
| 87 | + |
| 88 | +**Registering a Custom Quantization Algorithm** |
| 89 | + |
| 90 | +Quantization algorithms can be imported by `from vllm.model_executor.layers.quantization import register_quantization_config` and supported as decorators on new quantization methods. For example: |
| 91 | + |
| 92 | +```python |
| 93 | +from vllm.model_executor.layers.quantization import register_quantization_config |
| 94 | +from vllm.model_executor.layers.quantization.base_config import QuantizationConfig |
| 95 | + |
| 96 | +@register_quantization_config("my_quantization_method") |
| 97 | +class MyQuantizationConfig(QuantizationConfig): |
| 98 | + ... |
| 99 | +``` |
| 100 | + |
| 101 | +**Registering a Custom Model** |
| 102 | + |
| 103 | +New models can be dynamically registered to vLLM via the `ModelRegistry` method: |
| 104 | + |
| 105 | +```python |
| 106 | +from vllm import ModelRegistry |
| 107 | + |
| 108 | +if "MyLlava" not in ModelRegistry.get_supported_archs(): |
| 109 | + ModelRegistry.register_model("MyLlava", "vllm_add_dummy_model.my_llava:MyLlava") |
| 110 | +``` |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +### User Perspective |
| 115 | + |
| 116 | +Taking vLLM Ascend as an example, you only need to install vllm and vllm-ascend to complete the installation: |
| 117 | + |
| 118 | +```bash |
| 119 | +pip install vllm vllm-ascend |
| 120 | +``` |
| 121 | + |
| 122 | +On startup, you will observe the following logs, which means the backend plugin is working properly: |
| 123 | + |
| 124 | +```bash |
| 125 | +INFO 02-06 15:49:01 __init__.py:30] Available plugins for group vllm.platform_plugins: |
| 126 | +INFO 02-06 15:49:01 __init__.py:32] name=ascend, value=vllm_ascend:register |
| 127 | +… … |
| 128 | +INFO 02-06 15:49:01 __init__.py:44] plugin ascend loaded. |
| 129 | +INFO 02-06 15:49:01 __init__.py:181] Platform plugin ascend is activated |
| 130 | +``` |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## What’s Next? |
| 135 | + |
| 136 | +Moving forward, we will continue collaborating with developers in the vLLM community to enhance the following aspects: |
| 137 | + |
| 138 | +1. Continuous enhancements to the V1 Egine. |
| 139 | +2. Expanding plugin support for more modules and features, such as scheduler and custom operators. |
| 140 | +3. Better user experience and higher performance. |
| 141 | + |
| 142 | +We encourage everyone to try out this new feature! If you have any questions, join the [vLLM Slack](https://inviter.co/vllm-slack) and participate in the **#sig-extensible-hardware** channel for discussions. 🚀 |
0 commit comments