-
Notifications
You must be signed in to change notification settings - Fork 588
Description
Motivation.
vllm-ascend is a Python project, with only a small number of custom operators written in the csrc directory. The general process for calling these operators is vllm-ascend -> PTA -> CANN (ATB/aclnn).
Adding or modifying operators in PTA and CANN requires prior planning, and the integration cycle is relatively long. This often leads to situations where the versions of PTA and CANN cannot keep up with the rapid iteration development needs of vllm-ascend.
Currently, vllm-ascend has dependencies on the versions of PTA and CANN, and it is necessary to establish a direct operator invocation process.
- Scenario 1: No operator implementation exists in CANN, causing a blockage due to the CANN version.
- Scenario 2: An operator implementation exists in CANN, but the PTA has not yet integrated the operator, causing a blockage due to the PTA version.
Proposed Change.
Scenario 1
Reuse the framework of the custom operator in vllm-ascend. The operator implementation is directly carried by vllm-ascend and is uniformly placed under the csrc directory. If there is a separate build script, it needs to be adapted within the code project.
The general process is as follows:
- Trigger the CMake build command in setup.py.
- Control dependency management, build rules, etc., in CMakeLists.txt.
- csrc directory.
The specific implementation of the operator is located in the kernels and mla_preprocess directories.
torch_binding.cpp registers the custom operator.
torch_binding_meta.cpp registers the meta implementation.
The functionality is already available; refer to the PRs:
add mla_preprocess kernel #3226
adapt the mla_v1 with themla_preprocesskernel #3397
Scenario 2
Solution 1:
Refer to the code framework of PTA op_plugin for calling the aclnn operator, and adapt the operator invocation logic encapsulated by PTA to vllm-ascend (the two-stage interface for calling the operator has been encapsulated into macros, so developers do not need to pay attention to the first stage interface aclxxXxxGetWorkspaceSize).
Objective: If a new aclnn operator needs to be integrated, the integration process will be basically the same as that of PTA. Developers only need to focus on implementing the invocation of the aclnn operator and do not need to modify any code related to the engineering part.
Advantages: The PTA framework is already mature and stable. If it can be fully migrated to vllm_ascend, the probability of encountering issues is relatively low.
Disadvantages: The PTA framework is complex, and the migration and adaptation process will depend heavily on many basic files in PTA. Previous attempts at simple adaptation have encountered many dependency issues. If any dependencies are missed, problems may arise later. Therefore, experts familiar with the PTA framework will be needed to support the adaptation process.
Solution 2:
Do not migrate the PTA framework; instead, directly call the two-stage interface of the operator in vllm-ascend.
- Call the first stage interface.
- Allocate device memory based on the workspace size calculated by the first stage interface.
- Call the second stage interface.
This approach may lead to two issues:
- Task queue invocation order problem:
Calling the operator must go through the task queue. If it bypasses the task queue, the execution order of operators may become chaotic. - Memory management problem:
IfaclrtMallocis used to allocate tiling space, occasional issues may occur. This problem does not occur when using PyTorch for allocation.
Advantages: Simple implementation, no need to migrate the framework in PTA, thus avoiding various dependency issues. The current process of integrating the MLAPO operator into vllm-ascend follows this approach.
Disadvantages: There is some uncertainty, as it is unclear whether other issues might exist.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response