[RFC]: vLLM-Ascend Operator Direct Tuning

### Motivation.

vllm-ascend is a Python project, with only a small number of custom operators written in the csrc directory. The general process for calling these operators is vllm-ascend -> PTA -> CANN (ATB/aclnn).
Adding or modifying operators in PTA and CANN requires prior planning, and the integration cycle is relatively long. This often leads to situations where the versions of PTA and CANN cannot keep up with the rapid iteration development needs of vllm-ascend.
Currently, vllm-ascend has dependencies on the versions of PTA and CANN, and it is necessary to establish a direct operator invocation process.
- Scenario 1: No operator implementation exists in CANN, causing a blockage due to the CANN version.
- Scenario 2: An operator implementation exists in CANN, but the PTA has not yet integrated the operator, causing a blockage due to the PTA version.

### Proposed Change.

#### Scenario 1
Reuse the framework of the custom operator in vllm-ascend. The operator implementation is directly carried by vllm-ascend and is uniformly placed under the `csrc` directory. If there is a separate build script, it needs to be adapted within the code project.
The general process is as follows:
1) Trigger the CMake build command in setup.py.
2) Control dependency management, build rules, etc., in CMakeLists.txt.
3) csrc directory.
The specific implementation of the operator is located in the kernels and mla_preprocess directories.
torch_binding.cpp registers the custom operator.
torch_binding_meta.cpp registers the meta implementation.
The functionality is already available; refer to the PRs:
https://github.com/vllm-project/vllm-ascend/pull/3226
https://github.com/vllm-project/vllm-ascend/pull/3397

#### Scenario 2
##### Solution 1:
Refer to the code framework of PTA op_plugin for calling the aclnn operator, and adapt the operator invocation logic encapsulated by PTA to vllm-ascend (the two-stage interface for calling the operator has been encapsulated into macros, so developers do not need to pay attention to the first stage interface `aclxxXxxGetWorkspaceSize`).

Objective: If a new aclnn operator needs to be integrated, the integration process will be basically the same as that of PTA. Developers only need to focus on implementing the invocation of the aclnn operator and do not need to modify any code related to the engineering part.

Advantages: The PTA framework is already mature and stable. If it can be fully migrated to vllm_ascend, the probability of encountering issues is relatively low.

Disadvantages: The PTA framework is complex, and the migration and adaptation process will depend heavily on many basic files in PTA. Previous attempts at simple adaptation have encountered many dependency issues. If any dependencies are missed, problems may arise later. Therefore, experts familiar with the PTA framework will be needed to support the adaptation process.

##### Solution 2:
Do not migrate the PTA framework; instead, directly call the two-stage interface of the operator in vllm-ascend.
1) Call the first stage interface.
2) Allocate device memory based on the workspace size calculated by the first stage interface.
3) Call the second stage interface.


This approach may lead to two issues:
1. Task queue invocation order problem:
Calling the operator must go through the task queue. If it bypasses the task queue, the execution order of operators may become chaotic.
2. Memory management problem:
If `aclrtMalloc` is used to allocate tiling space, occasional issues may occur. This problem does not occur when using PyTorch for allocation.

Advantages: Simple implementation, no need to migrate the framework in PTA, thus avoiding various dependency issues. The current process of integrating the MLAPO operator into vllm-ascend follows this approach.
Disadvantages: There is some uncertainty, as it is unclear whether other issues might exist.

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: vLLM-Ascend Operator Direct Tuning #4298

Motivation.

Proposed Change.

Scenario 1

Scenario 2

Solution 1:

Solution 2:

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: vLLM-Ascend Operator Direct Tuning #4298

Description

Motivation.

Proposed Change.

Scenario 1

Scenario 2

Solution 1:

Solution 2:

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions