Add python interface documentation.

gongchensu · gongchensu · commit 9f5f5ceda352 · 2025-11-12T11:30:59.000+08:00
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 ## 文档目录
 
 ### InfiniCore
-- [`Python APIs`]
+- [`Python APIs`](python/README.md)
 
 - [`C++ APIs`]
 
@@ -22,7 +22,7 @@
 - [`InfiniCCL`]：统一集合通信库，提供常用的集合通信功能，包括点对点、广播、聚合等。
 
 
-[`Python APIs`]:README.md
+[`Python APIs`]:python/README.md
 [`C++ APIs`]:README.md
 [`InfiniRT`]:/infinirt/README.md
 [`InfiniOP`]:/infiniop/README.md
diff --git a/python/README.md b/python/README.md
@@ -0,0 +1,112 @@
+# `infinicore` Python 前端
+
+*InfiniCore* 提供了与 C++ 前端一致的 Python 封装，位于 `python/infinicore/`。该模块通过 `pybind11` 将核心张量、算子与设备上下文暴露给 Python，便于在推理框架或调试脚本中快速集成。
+
+## 模块结构
+
+| 符号 | 说明 |
+| --- | --- |
+| `device` | 设备句柄类 (`python/infinicore/device.py`)，支持 `"cuda:0"`、`device(\"cpu\", 0)` 等写法或复用已有实例。 |
+| `dtype` / `float16` 等 | 数据类型枚举 (`python/infinicore/dtype.py`)。 |
+| `Tensor` | 张量包装类 (`python/infinicore/tensor.py`)，内部封装底层 `_infinicore` 对象。 |
+| `empty` / `zeros` / `ones` / `empty_like` 等 | 张量构造函数（`python/infinicore/tensor.py`），默认要求显式传入 `dtype` 与 `device`。 |
+| 顶层算子（`add`、`matmul`、`rearrange`、`attention`） | 暴露在 `infinicore` 命名空间下，对应实现位于 `python/infinicore/ops/`。 |
+| `infinicore.nn` | 神经网络相关模块集合，未来可扩展更多组件。 |
+| `infinicore.nn.functional` | 函数式算子集合 (`python/infinicore/nn/functional.py`)。 |
+| `use_ntops` / `infinicore.ntops` | 若系统安装 `ntops` 包，将自动置位 `use_ntops=True` 并暴露原始模块。 |
+
+所有符号在包的 `__init__.py` 中进行了显式导出，可直接通过 `import infinicore as ic` 后使用。
+
+相关导出定义见 `InfiniCore/python/infinicore/__init__.py`。
+
+## 张量与构造函数
+
+`Tensor` 是对底层 `_infinicore.Tensor` 的 Python 包装，常用接口包括：
+
+- `shape` / `ndim` / `size(dim)` / `stride(dim)`：获取张量维度与步长信息。
+- `dtype` / `device`：返回 `dtype` 与 `device` 包装类。
+- `numel()` / `is_contiguous()`：查看张量元素数量与存储布局。
+- `copy_(src)` / `to(...)`：执行数据拷贝与跨设备搬运。
+- `contiguous()` / `permute(dims)` / `view(shape)` / `as_strided(size, stride)`：布局调整与视图操作。
+- `debug(filename=None)`：将张量内容打印或输出到二进制文件。
+
+常用构造函数包括 `empty`、`strided_empty`、`zeros`、`ones`、`from_blob`、`strided_from_blob`、`empty_like` 等：
+
+```python
+import infinicore as ic
+
+cpu = ic.device("cpu")
+a = ic.empty((4, 8), dtype=ic.float16, device=cpu)
+b = ic.ones((4, 8), dtype=ic.float16, device=cpu)
+a.copy_(b)
+```
+
+> 注意：这些函数要求显式传入 `dtype` 与 `device`，避免隐式从 PyTorch/TensorFlow 对象推断。
+
+## 顶层算子 (`infinicore.*`)
+
+以下函数直接通过 `infinicore` 命名空间导出，全部支持可选的 `out` 关键字参数以复用缓冲区：
+
+| 函数 | 定义位置 | 说明 |
+| --- | --- | --- |
+| `add(input, other, *, out=None)` | `python/infinicore/ops/add.py` | 按元素加法，兼容广播与非连续张量。 |
+| `matmul(input, other, *, out=None)` | `python/infinicore/ops/matmul.py` | GEMM 封装，底层复用 *InfiniOP* 描述符。 |
+| `rearrange(input, other=None, *, out=None)` | `python/infinicore/ops/rearrange.py` | 生成连续副本或重排数据；`other` 参数当前保留未用。 |
+| `attention(q, k, v, k_cache, v_cache, pos, *, out=None)` | `python/infinicore/ops/attention.py` | 解码阶段注意力，管理 KV cache 并支持可选输出复用。 |
+
+更详细的参数说明与示例请参考 [`infinicore.ops` 文档](ops/README.md)。
+
+更多示例见下方端到端样例或源码注释。
+
+## 函数式算子 (`infinicore.nn.functional`)
+
+函数式 API 集中在 `infinicore.nn.functional`：
+
+| 函数 | 定义位置 | 说明 |
+| --- | --- | --- |
+| `causal_softmax(input, out=None)` | `python/infinicore/nn/functional.py` | 对最后一维执行因果掩码 Softmax，可原位写入 `out`。 |
+| `rms_norm(input, normalized_shape, weight, eps=1e-5, *, out=None)` | 同上 | Root Mean Square LayerNorm，执行前会断言 `normalized_shape == weight.shape`。 |
+| `silu(input, inplace=False, *, out=None)` | 同上 | SiLU 激活；`inplace=True` 时直接覆盖输入，且在满足条件时可委托 `ntops`。 |
+| `swiglu(input, other, *, out=None)` | 同上 | SwiGLU 前向门控，要求 `input` 与 `other` 形状、dtype、设备一致。 |
+
+`silu` 会优先调用 `ntops` 后端以复用已有优化，`swiglu`、`rms_norm` 等函数内部调用 `_infinicore` 绑定完成计算。详细说明与注意事项见 [`nn.functional` 文档](nn/functional/README.md)。
+
+## 运行时上下文
+
+- `_infinicore` 在进程内维护运行时状态；创建张量时请显式传入 `device`，并保持算子的所有输入位于同一设备。
+- 如需强制同步，可调用 `infinicore.lib._infinicore.sync_stream()`、`sync_device()` 等底层绑定。
+- 在同一执行流内串行调用算子通常无需额外同步。
+
+## 与 `ntops` 的协作
+
+- 导入 `ntops` 成功后，`infinicore.use_ntops` 会被设置为 `True`，并可通过 `infinicore.ntops` 访问原始模块。
+- `nn.functional.silu` 在 `use_ntops=True` 且设备类型为 `"cuda"`/`"musa"` 且未传 `out` 时，会委托 `ntops.torch.silu`。
+- 若想强制禁用，可直接设置 `infinicore.use_ntops = False`。
+
+## 端到端示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+
+q = ic.empty((8, 1, 128), dtype=ic.float16, device=device)
+k = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+v = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+k_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+v_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+
+out = ic.attention(q, k, v, k_cache, v_cache, pos=0)
+
+if ic.use_ntops:
+    # 在部分设备上，SiLU 会委托给 ntops 的高性能实现
+    out = ic.nn.functional.silu(out)
+
+out.debug()
+```
+
+## 相关链接
+
+- [`infinicore.ops 顶层算子`](ops/README.md)
+- [`nn.functional 函数式文档`](nn/functional/README.md)
+- [`InfiniOP` 统一算子库](/infiniop/README.md)
diff --git a/python/nn/functional/README.md b/python/nn/functional/README.md
@@ -0,0 +1,71 @@
+# `infinicore.nn.functional` 函数式接口
+
+`infinicore.nn.functional` 集中收录 PyTorch 风格的函数式算子封装。实现位于 `InfiniCore/python/infinicore/nn/functional.py`，依赖 `_infinicore` C++ 绑定并复用运行时上下文。
+
+## 公共约定
+
+- 所有函数都返回 `infinicore.Tensor`；当提供 `out`/`inplace` 等参数时会复用已有缓冲区。
+- 输入张量需由 `infinicore` 创建（或至少携带 `_underlying` 指针），否则无法与底层运行时交互。
+- 若函数内部调用 `_infinicore.*_` 原位接口，需确保输出张量与输入形状、dtype 一致。
+
+## API 详情
+
+### `causal_softmax(input: Tensor, out: Optional[Tensor] = None) -> Tensor`
+
+- 功能：对最后一维执行因果掩码 Softmax，常用于自回归注意力。
+- 参数：
+  - `input`：任意形状张量，末维视为序列长度。
+  - `out`：可选输出张量，若提供需与 `input` 形状、dtype 相同。
+- 行为：
+  - 未提供 `out` 时，内部创建新张量并返回；
+  - 提供 `out` 时调用 `_infinicore.causal_softmax_` 完成原位写入。
+
+### `rms_norm(input: Tensor, normalized_shape: list[int], weight: Tensor, eps: float = 1e-5, *, out: Optional[Tensor] = None) -> Tensor`
+
+- 功能：实现 Root Mean Square LayerNorm。
+- 参数：
+  - `normalized_shape`：与 `weight.shape` 完全一致的列表，仅用于形状校验；
+  - `weight`：缩放系数张量，dtype 支持 `float16`/`bfloat16`/`float32`；
+  - `eps`：数值稳定项，默认为 `1e-5`；
+  - `out`：可选输出张量。
+- 断言：`normalized_shape == weight.shape`，不满足将触发异常。
+- 底层调用 `_infinicore.rms_norm` / `_infinicore.rms_norm_`，当提供 `out` 时跳过内存分配。
+
+### `silu(input: Tensor, inplace: bool = False, *, out: Optional[Tensor] = None) -> Tensor`
+
+- 功能：逐元素应用 SiLU（Sigmoid Linear Unit）。
+- 优化：
+  - 当 `infinicore.use_ntops` 为真且设备类型属于 `{"cuda", "musa"}` 且未指定 `out` 时，会委托 `ntops.torch.silu`，以复用已有优化实现。
+- 分支：
+  - `inplace=True`：直接调用 `_infinicore.silu_`，结果写回 `input`；
+  - 指定 `out`：将结果写入 `out`；
+  - 默认路径：返回新张量。
+
+### `swiglu(input: Tensor, other: Tensor, *, out: Optional[Tensor] = None) -> Tensor`
+
+- 功能：实现 SwiGLU 前向（`input` 与 `other` 分别对应激活与门控分支）。
+- 要求：`input` 与 `other` 形状、dtype、设备一致。
+- 输出：
+  - 未提供 `out` 时调用 `_infinicore.swiglu` 返回新张量；
+  - 提供 `out` 时调用 `_infinicore.swiglu_` 原位写入。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+x = ic.empty((4, 1024), dtype=ic.float16, device=ic.device("cuda:0"))
+w = ic.empty((1024,), dtype=ic.float16, device=x.device)
+
+normed = F.rms_norm(x, normalized_shape=list(w.shape), weight=w)
+activated = F.silu(normed)
+gated = F.swiglu(activated, ic.empty_like(activated))
+
+probs = F.causal_softmax(gated, out=ic.empty_like(gated))
+```
+
+## 相关链接
+
+- [`Python API 总览`](../../README.md)
+- [`ntops` 协作接口说明](../../README.md#与-ntops-的协作)
diff --git a/python/ops/README.md b/python/ops/README.md
@@ -0,0 +1,74 @@
+# `infinicore.ops` 顶层算子
+
+该模块通过 pybind11 将常用算子直接暴露在 `infinicore` 命名空间下，对应源码位于 `InfiniCore/python/infinicore/ops/`。所有函数均支持可选的 `out` 参数以复用输出缓冲区。
+
+## 通用注意事项
+
+- 所有参数必须是 `infinicore.Tensor` 实例（或至少携带 `_underlying` 指针），否则无法传递到底层 `_infinicore`。
+- `out` 张量（若提供）需要和输出形状、dtype、设备完全匹配。
+- 算子执行依赖底层 `_infinicore` 运行时；请在创建张量时传入期望的 `device`，保持所有输入处于同一设备上。
+
+## API 列表
+
+### `add(input: Tensor, other: Tensor, *, out: Optional[Tensor] = None) -> Tensor`
+
+- **功能**：逐元素加法，支持广播。
+- **实现**：`InfiniCore/python/infinicore/ops/add.py`。
+- **行为**：
+  - 默认返回新张量；
+  - 提供 `out` 时调用 `_infinicore.add_` 原位写入。
+
+### `matmul(input: Tensor, other: Tensor, *, out: Optional[Tensor] = None) -> Tensor`
+
+- **功能**：矩阵乘法/GEMM。
+- **实现**：`InfiniCore/python/infinicore/ops/matmul.py`。
+- **约束**：
+  - 输入形状、dtype 需满足 *InfiniOP* GEMM 的要求；
+  - 当 `out` 提供时，负责在调用前完成初始化（特别是 `beta != 0` 场景）。
+
+### `rearrange(input: Tensor, other: Optional[Tensor] = None, *, out: Optional[Tensor] = None) -> Tensor`
+
+- **功能**：生成连续副本或重排张量布局。
+- **实现**：`InfiniCore/python/infinicore/ops/rearrange.py`。
+- **说明**：`other` 参数当前保留未用；传入 `out` 时将结果写入既有张量。
+
+### `attention(q: Tensor, k: Tensor, v: Tensor, k_cache: Tensor, v_cache: Tensor, pos: int, *, out: Optional[Tensor] = None) -> Tensor`
+
+- **功能**：解码阶段注意力，读取/更新 KV cache。
+- **实现**：`InfiniCore/python/infinicore/ops/attention.py`。
+- **参数要求**：
+  - `q`/`k`/`v`：`(n_head, seq_len, head_dim)`；
+  - `k_cache`/`v_cache`：`(n_kv_head, cache_len, head_dim)`，`pos + seq_len <= cache_len`；
+  - `pos`：写入位置索引。
+- **行为**：
+  - 默认创建新张量返回；
+  - 指定 `out` 时与底层 `_infinicore.attention_` 对齐，复用输出缓冲。
+
+## 示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+a = ic.ones((4, 8), dtype=ic.float16, device=device)
+b = ic.ones((4, 8), dtype=ic.float16, device=device)
+
+ic.add(a, b, out=a)  # 原位累加
+c = ic.matmul(a, b.permute([1, 0]))  # (4, 8) @ (8, 4)
+
+contiguous = ic.rearrange(c)
+
+attn_out = ic.attention(
+    q=contiguous,
+    k=contiguous,
+    v=contiguous,
+    k_cache=ic.empty((4, 128, contiguous.shape[-1]), dtype=contiguous.dtype, device=contiguous.device),
+    v_cache=ic.empty((4, 128, contiguous.shape[-1]), dtype=contiguous.dtype, device=contiguous.device),
+    pos=0,
+)
+```
+
+## 相关链接
+
+- [`Python API 总览`](../README.md)
+- [`nn.functional 函数式接口`](../nn/functional/README.md)