refine design and migration

SigureMo · SigureMo · commit 09b453bf21bd · 2025-12-03T22:02:35.000+08:00
diff --git a/docs/guides/custom_op/cross_ecosystem_custom_op/design_and_migration_cn.md b/docs/guides/custom_op/cross_ecosystem_custom_op/design_and_migration_cn.md
@@ -4,7 +4,10 @@
 
 为了方便 PyTorch 自定义算子快速接入 PaddlePaddle 框架，我们提供了如下图所示的兼容机制：
 
-![跨生态自定义算子兼容机制示意图](./images/cross-ecosystem-custom-op-compatible.drawio.png)
+<figure align="center">
+    <img src="https://github.com/PaddlePaddle/docs/blob/develop/docs/guides/custom_op/cross_ecosystem_custom_op/images/cross-ecosystem-custom-op-compatible.drawio.png?raw=true" width="700" alt='missing' align="center"/>
+    <figcaption><center>跨生态自定义算子兼容机制示意图</center></figcaption>
+</figure>
 
 正如图上所示，我们自底向上提供了如下几层支持：
 
@@ -262,23 +265,33 @@ using at::Tensor;
 
 不过目前兼容层还在持续完善中，部分常见 API 尚未覆盖到，此时就会出现编译错误，你可以根据编译错误提示来定位并修复相关代码。
 
-以 `Tensor.reshape` 为例，假设用户在自定义算子中使用了该 API，但 Paddle 没有提供该 API 的兼容实现，就会出现编译错误，此时我们可以选择临时取出 `at::Tensor` 内部的 `paddle::Tensor`，并使用 PaddlePaddle 提供的等效 API 来实现该功能：
+以 `torch::empty` 为例，假设算子库中使用了该 API，但 Paddle 没有提供该 API 的兼容实现，就会出现编译错误：
+
+```text
+/workspace/cross-ecosystem-custom-op-example/csrc/muladd.cc: In function ‘at::Tensor muladd_cpu(at::Tensor, const at::Tensor&, double)’:
+/workspace/cross-ecosystem-custom-op-example/csrc/muladd.cc:54:30: error: ‘empty’ is not a member of ‘torch’
+   54 |   at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
+      |                              ^~~~~
+```
+
+此时我们可以选择将 PyTorch 的 structs 转换为 Paddle 的 structs，并用 PaddlePaddle 提供的等效 API 来实现该功能：
+
+即将下面的代码：
 
 ```cpp
 // PyTorch 原代码
-at::IntArrayRef sizes = {2, 3, 4};
-at::Tensor reshaped_tensor = x.reshape(sizes);
+at::Tensor result = torch::empty(a_contig.sizes(), a_contig.options());
 ```
 
 我们可以将其替换为：
 
 ```cpp
 // 替换为 PaddlePaddle 等效实现
-at::IntArrayRef sizes = {2, 3, 4};
-auto paddle_tensor = x._PD_GetInner();  // 获取内部 paddle::Tensor
-auto paddle_sizes = sizes._PD_ToPaddleIntArray();  // 转换为 paddle::IntArray
-auto paddle_reshaped_tensor = paddle::experimental::reshape(paddle_tensor, paddle_sizes);  // 使用 PaddlePaddle reshape API
-at::Tensor reshaped_tensor(paddle_reshaped_tensor);  // 包装回 at::Tensor
+auto paddle_size = a_contig.sizes()._PD_ToPaddleIntArray();  // 将 PyTorch IntArrayRef 转为 Paddle IntArray
+auto paddle_dtype = compat::_PD_AtenScalarTypeToPhiDataType(a_contig.dtype());  // 将 PyTorch ScalarType 转为 Paddle DataType
+auto paddle_place = a_contig.options()._PD_GetPlace();  // 将 PyTorch Device 转为 Paddle Place
+auto paddle_result = paddle::experimental::empty(paddle_size, paddle_dtype, paddle_place);  // 调用 PaddlePaddle 的 empty API
+at::Tensor result(paddle_result);  // 将 Paddle Tensor 包装为 PyTorch Tensor
 ```
 
 更多 PaddlePaddle C++ API 的使用方式可参考 [PaddlePaddle C++ 自定义算子文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/custom_op/new_cpp_op_cn.html)。通过这种方式，你可以逐步修复编译错误，直至自定义算子能够成功编译通过。
diff --git a/docs/guides/custom_op/cross_ecosystem_custom_op/user_guide_cn.md b/docs/guides/custom_op/cross_ecosystem_custom_op/user_guide_cn.md
@@ -4,30 +4,32 @@
 
 随着大模型技术的快速发展，自定义算子（Custom Operator）已成为优化模型性能、扩展框架功能的关键手段。目前，PyTorch 生态中积累了大量高质量的自定义算子库和基于 Kernel DSL（如 Triton、TileLang）的算子实现。为了打破生态壁垒，帮助用户低成本地将这些优质算子资源迁移至 PaddlePaddle 框架，我们推出了一套跨生态自定义算子兼容机制。该机制支持用户在 PaddlePaddle 中直接使用 PyTorch 生态的自定义算子库和 Kernel DSL，从而大幅降低迁移成本，提升开发效率。
 
-## 自定义算子库
+## 外部算子库
 
 目前 PyTorch 生态中存在大量高质量的自定义算子库（如 FlashInfer、FlashMLA 等），这些算子库通常基于 CUDA/C++ 编写并封装为 Python 扩展。为了复用这些现有的算子库，我们提供了兼容性支持，使得用户可以直接在 PaddlePaddle 中安装并使用这些库，而无需进行繁琐的代码移植。
 
 ### 安装方式
 
-对于使用基于兼容性方案的跨生态自定义算子库，一般情况下只需要 clone 后通过 pip 安装对应的算子库即可使用。下面以 `FlashInfer` 为例说明安装方式：
+对于使用基于兼容性方案的跨生态自定义算子库，一般情况分为两种安装方式：源码安装和 PyPI 安装。大部分算子库都托管在 GitHub 上，用户可以根据具体算子库的安装说明进行安装。下面以两个典型的算子库为例，介绍安装方式：
 
-```bash
-pip install paddlepaddle_gpu  # Install PaddlePaddle with GPU support, refer to https://www.paddlepaddle.org.cn/install/quick for more details
-git clone https://github.com/PFCCLab/flashinfer.git
-cd flashinfer
-git submodule update --init
-pip install apache-tvm-ffi>=0.1.2  # Use TVM FFI 0.1.2 or above
-pip install filelock jinja2  # Install tools for jit compilation
-# Install FlashInfer
-pip install --no-build-isolation . -v
-```
+- 源码安装（以 `FlashInfer` 为例）：
 
-对于部分已经发布到 PyPI 的自定义算子库，也可以直接通过 pip 安装。下面以 `TorchCodec` 为例：
+    ```bash
+    pip install paddlepaddle_gpu  # Install PaddlePaddle with GPU support, refer to https://www.paddlepaddle.org.cn/install/quick for more details
+    git clone https://github.com/PFCCLab/flashinfer.git
+    cd flashinfer
+    git submodule update --init
+    pip install apache-tvm-ffi>=0.1.2  # Use TVM FFI 0.1.2 or above
+    pip install filelock jinja2  # Install tools for jit compilation
+    # Install FlashInfer
+    pip install --no-build-isolation . -v
+    ```
 
-```bash
-pip install paddlecodec
-```
+- PyPI 安装（以 `TorchCodec` 为例）：
+
+    ```bash
+    pip install paddlecodec
+    ```
 
 个别算子库可能会有特殊的安装方式，请参考对应算子库 repo 中的 `README.md` 进行安装。
 
@@ -116,6 +118,10 @@ import paddle
 # 限定生效范围在 TileLang 模块
 paddle.compat.enable_torch_proxy(scope={"tilelang"})
 
+import tilelang
+import tilelang.language as T
+import numpy as np
+
 # 之后使用方式与官方 PyTorch 生态下保持一致
 @tilelang.jit
 def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float"):