InfiniTensor · gongchensu · Nov 12, 2025
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 ## 文档目录
 
 ### InfiniCore
-- [`Python APIs`]
+- [`Python APIs`](python/README.md)
 
 - [`C++ APIs`]
 
@@ -22,7 +22,7 @@
 - [`InfiniCCL`]：统一集合通信库，提供常用的集合通信功能，包括点对点、广播、聚合等。
 
 
-[`Python APIs`]:README.md
+[`Python APIs`]:python/README.md
 [`C++ APIs`]:README.md
 [`InfiniRT`]:/infinirt/README.md
 [`InfiniOP`]:/infiniop/README.md

diff --git a/python/README.md b/python/README.md
@@ -0,0 +1,115 @@
+# `infinicore` Python 前端
+
+*InfiniCore* 提供了与 C++ 前端一致的 Python 封装，位于 `python/infinicore/`。该模块通过 `pybind11` 将核心张量、算子与设备上下文暴露给 Python，便于在推理框架或调试脚本中快速集成。
+
+## 模块结构
+
+| 符号 | 说明 |
+| --- | --- |
+| `device` | 设备句柄类 (`python/infinicore/device.py`)，支持 `"cuda:0"`、`device(\"cpu\", 0)` 等写法或复用已有实例。 |
+| `dtype` / `float16` 等 | 数据类型枚举 (`python/infinicore/dtype.py`)。 |
+| `Tensor` | 张量包装类 (`python/infinicore/tensor.py`)，内部封装底层 `_infinicore` 对象。 |
+| `empty` / `zeros` / `ones` / `empty_like` 等 | 张量构造函数（`python/infinicore/tensor.py`），默认要求显式传入 `dtype` 与 `device`。 |
+| 顶层算子（`add`、`matmul`、`rearrange`、`attention`） | 暴露在 `infinicore` 命名空间下，对应实现位于 `python/infinicore/ops/`。 |
+| `infinicore.nn` | 神经网络相关模块集合，未来可扩展更多组件。 |
+| `infinicore.nn.functional` | 函数式算子集合 (`python/infinicore/nn/functional.py`)。 |
+| `use_ntops` / `infinicore.ntops` | 若系统安装 `ntops` 包，将自动置位 `use_ntops=True` 并暴露原始模块。 |
+
+所有符号在包的 `__init__.py` 中进行了显式导出，可直接通过 `import infinicore as ic` 后使用。
+
+相关导出定义见 `InfiniCore/python/infinicore/__init__.py`。
+
+## API 索引
+
+- **核心对象**
+  - [`device`](device/README.md)
+  - [`dtype`](dtype/README.md)
+  - [`Tensor`](tensor/README.md)
+  - 构造函数：[`empty`](tensor/README.md#构造函数)、[`zeros`](tensor/README.md#构造函数)、[`ones`](tensor/README.md#构造函数)、[`empty_like`](tensor/README.md#构造函数)、[`from_blob`](tensor/README.md#构造函数)
+  - 常用方法：[`copy_`](tensor/README.md#tensor-类)、[`to`](tensor/README.md#tensor-类)、[`permute`](tensor/README.md#tensor-类)、[`view`](tensor/README.md#tensor-类)、[`debug`](tensor/README.md#tensor-类)
+  - [`use_ntops` 协作说明](#与-ntops-的协作)
+- **顶层算子**
+  - [`add`](ops/add/README.md)
+  - [`matmul`](ops/matmul/README.md)
+  - [`rearrange`](ops/rearrange/README.md)
+  - [`attention`](ops/attention/README.md)
+- **函数式算子**
+  - [`causal_softmax`](nn/functional/causal_softmax/README.md)
+  - [`rms_norm`](nn/functional/rms_norm/README.md)
+  - [`silu`](nn/functional/silu/README.md)
+  - [`swiglu`](nn/functional/swiglu/README.md)
+- **更多参考**
+  - [`infinicore.ops` 索引](ops/README.md)
+  - [`nn` 模块概览](nn/README.md)
+
+## 张量与构造函数
+
+`Tensor` 是对底层 `_infinicore.Tensor` 的 Python 包装，常用接口包括：
+
+- `shape` / `ndim` / `size(dim)` / `stride(dim)`：获取张量维度与步长信息。
+- `dtype` / `device`：返回 `dtype` 与 `device` 包装类。
+- `numel()` / `is_contiguous()`：查看张量元素数量与存储布局。
+- `copy_(src)` / `to(...)`：执行数据拷贝与跨设备搬运。
+- `contiguous()` / `permute(dims)` / `view(shape)` / `as_strided(size, stride)`：布局调整与视图操作。
+- `debug(filename=None)`：将张量内容打印或输出到二进制文件。
+
+常用构造函数包括 `empty`、`strided_empty`、`zeros`、`ones`、`from_blob`、`strided_from_blob`、`empty_like` 等：
+
+```python
+import infinicore as ic
+
+cpu = ic.device("cpu")
+a = ic.empty((4, 8), dtype=ic.float16, device=cpu)
+b = ic.ones((4, 8), dtype=ic.float16, device=cpu)
+a.copy_(b)
+```
+
+> 注意：这些函数要求显式传入 `dtype` 与 `device`，避免隐式从 PyTorch/TensorFlow 对象推断。
+
+## 顶层算子 (`infinicore.*`)
+
+详见 [`ops` 文档索引`](ops/README.md) 及各算子文档。
+
+## 函数式算子 (`infinicore.nn.functional`)
+
+详见 [`nn.functional` 文档](nn/functional/README.md) 及子目录。
+
+## 运行时上下文
+
+- `_infinicore` 在进程内维护运行时状态；创建张量时请显式传入 `device`，并保持算子的所有输入位于同一设备。
+- 如需强制同步，可调用 `infinicore.lib._infinicore.sync_stream()`、`sync_device()` 等底层绑定。
+- 在同一执行流内串行调用算子通常无需额外同步。
+
+## 与 `ntops` 的协作
+
+- 导入 `ntops` 成功后，`infinicore.use_ntops` 会被设置为 `True`，并可通过 `infinicore.ntops` 访问原始模块。
+- `nn.functional.silu` 在 `use_ntops=True` 且设备类型为 `"cuda"`/`"musa"` 且未传 `out` 时，会委托 `ntops.torch.silu`。
+- 若想强制禁用，可直接设置 `infinicore.use_ntops = False`。
+
+## 端到端示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+
+q = ic.empty((8, 1, 128), dtype=ic.float16, device=device)
+k = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+v = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+k_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+v_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+
+out = ic.attention(q, k, v, k_cache, v_cache, pos=0)
+
+if ic.use_ntops:
+    # 在部分设备上，SiLU 会委托给 ntops 的高性能实现
+    out = ic.nn.functional.silu(out)
+
+out.debug()
+```
+
+## 相关链接
+
+- [`infinicore.ops 顶层算子`](ops/README.md)
+- [`nn.functional 函数式文档`](nn/functional/README.md)
+- [`InfiniOP` 统一算子库](/infiniop/README.md)
diff --git a/python/device/README.md b/python/device/README.md
@@ -0,0 +1,35 @@
+# `infinicore.device`
+
+设备句柄类，定义于 `InfiniCore/python/infinicore/device.py`，用于在 Python 端选择和描述运行时设备。
+
+## 构造方式
+
+```python
+from infinicore import device
+
+cpu = device()                 # 默认 "cpu"
+cuda0 = device("cuda:0")       # 字符串形式指定
+cuda1 = device("cuda", 1)      # 类型 + index
+clone = device(cuda0)          # 从已有实例拷贝
+```
+
+- `type`：支持 `"cpu"`、`"cuda"`、`"mlu"`、`"npu"`、`"musa"` 等，具体取决于 `_infinicore` 编译时支持。
+- `index`：可选整型索引，字符串中已包含 `":"` 时禁止再传入。
+- 传入已有 `device` 实例时会拷贝其 `type`/`index`。
+
+## 属性与方法
+
+- `type` / `index`：公开的设备类型与序号。
+- `__repr__()` / `__str__()`：打印友好格式，如 `device(type='cuda', index=0)` 或 `"cuda:0"`。
+- `_underlying`：内部 `_infinicore.Device` 对象，供底层 API 使用。
+
+## 与运行时的关系
+
+- `device` 实例用于所有张量构造函数及顶层算子，确保 `_infinicore` 在正确的设备上执行。
+- 若需要从底层 `_infinicore.Device` 转换为 Python 对象，可使用 `device._from_infinicore_device`（内部方法）。
+
+## 相关链接
+
+- [`Tensor` 构造函数](../tensor/README.md#构造函数)
+- [`运行时上下文`](../README.md#运行时上下文)
+
diff --git a/python/dtype/README.md b/python/dtype/README.md
@@ -0,0 +1,30 @@
+# `infinicore.dtype` 与标量类型
+
+`InfiniCore/python/infinicore/dtype.py` 导出与 C++ 端一致的标量类型枚举。通过 `from infinicore import dtype, float16, int32, ...` 可直接访问。
+
+## 常用类型列表
+
+- `dtype`：枚举工厂，可用于创建自定义 dtype 或从 `_underlying` 还原。
+- 浮点类型：`float`, `float16`, `float32`, `float64`, `half`, `bfloat16`, `double`.
+- 复数类型：`cfloat`, `cdouble`, `complex32`, `complex64`, `complex128`.
+- 整型：`int`, `int8`, `int16`, `int32`, `int64`, `short`, `long`, `uint8`.
+- 布尔：`bool`.
+
+所有类型对象都暴露 `_underlying` 属性，用于在调用底层 `_infinicore` 接口时传递。
+
+## 示例
+
+```python
+import infinicore as ic
+
+cpu = ic.device("cpu")
+a = ic.empty((4, 8), dtype=ic.float16, device=cpu)
+
+if a.dtype is ic.float16:
+    print("half precision tensor")
+```
+
+## 相关链接
+
+- [`Tensor` 构造函数](../tensor/README.md#构造函数)
+- [`infinicore` 模块概览](../README.md)
diff --git a/python/nn/README.md b/python/nn/README.md
@@ -0,0 +1,25 @@
+# `infinicore.nn` 模块
+
+`infinicore.nn` 聚合了面向神经网络的辅助模块，目前主要暴露函数式算子集合 `infinicore.nn.functional`，并预留位置扩展其他组件（如模块化层、优化器等）。
+
+## 模块结构
+
+| 子模块 | 说明 |
+| --- | --- |
+| `functional` | 函数式算子集合，接口文档见 [`functional/README.md`](functional/README.md)。 |
+
+## 使用示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+x = ic.ones((4, 1024), dtype=ic.float16, device=ic.device("cuda:0"))
+normed = F.rms_norm(x, normalized_shape=[1024], weight=ic.ones((1024,), dtype=x.dtype, device=x.device))
+activated = F.silu(normed)
+```
+
+## 相关链接
+
+- [`nn.functional 函数式文档`](functional/README.md)
+- [`Python API 总览`](../README.md)
diff --git a/python/nn/functional/README.md b/python/nn/functional/README.md
@@ -0,0 +1,37 @@
+# `infinicore.nn.functional` 函数式接口
+
+`infinicore.nn.functional` 集中收录 PyTorch 风格的函数式算子封装。实现位于 `InfiniCore/python/infinicore/nn/functional.py`，依赖 `_infinicore` C++ 绑定并复用运行时上下文。
+
+## 公共约定
+
+- 所有函数都返回 `infinicore.Tensor`；当提供 `out`/`inplace` 等参数时会复用已有缓冲区。
+- 输入张量需由 `infinicore` 创建（或至少携带 `_underlying` 指针），否则无法与底层运行时交互。
+- 若函数内部调用 `_infinicore.*_` 原位接口，需确保输出张量与输入形状、dtype 一致。
+
+## API 详情
+
+- [`causal_softmax`](causal_softmax/README.md)：因果掩码 Softmax。
+- [`rms_norm`](rms_norm/README.md)：Root Mean Square LayerNorm。
+- [`silu`](silu/README.md)：SiLU（Sigmoid Linear Unit）激活。
+- [`swiglu`](swiglu/README.md)：SwiGLU 前向门控。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+x = ic.empty((4, 1024), dtype=ic.float16, device=ic.device("cuda:0"))
+w = ic.empty((1024,), dtype=ic.float16, device=x.device)
+
+normed = F.rms_norm(x, normalized_shape=list(w.shape), weight=w)
+activated = F.silu(normed)
+gated = F.swiglu(activated, ic.empty_like(activated))
+
+probs = F.causal_softmax(gated, out=ic.empty_like(gated))
+```
+
+## 相关链接
+
+- [`Python API 总览`](../../README.md)
+- [`ntops` 协作接口说明](../../README.md#与-ntops-的协作)
diff --git a/python/nn/functional/causal_softmax/README.md b/python/nn/functional/causal_softmax/README.md
@@ -0,0 +1,32 @@
+# `nn.functional.causal_softmax`
+
+对最后一维应用因果掩码 Softmax，用于自回归注意力场景。函数定义位于 `InfiniCore/python/infinicore/nn/functional.py`。
+
+## 函数签名
+
+```python
+def causal_softmax(input: Tensor, out: Optional[Tensor] = None) -> Tensor
+```
+
+- `input`：任意维度张量，末维视为序列维，将应用因果掩码。
+- `out`：可选输出张量，若提供需与 `input` 形状、`dtype`、`device` 完全一致。
+
+默认返回新张量；提供 `out` 时调用 `_infinicore.causal_softmax_` 原位写入。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+device = ic.device("cuda:0")
+logits = ic.empty((4, 128), dtype=ic.float16, device=device)
+
+probs = F.causal_softmax(logits)
+F.causal_softmax(logits, out=logits)  # 原位写回
+```
+
+## 相关链接
+
+- [`nn.functional` 文档](../README.md)
+- [`infinicore.attention` 算子](../../../ops/attention/README.md)
diff --git a/python/nn/functional/rms_norm/README.md b/python/nn/functional/rms_norm/README.md
@@ -0,0 +1,43 @@
+# `nn.functional.rms_norm`
+
+实现 Root Mean Square LayerNorm。函数定义位于 `InfiniCore/python/infinicore/nn/functional.py`。
+
+## 函数签名
+
+```python
+def rms_norm(
+    input: Tensor,
+    normalized_shape: list[int],
+    weight: Tensor,
+    eps: float = 1e-5,
+    *,
+    out: Optional[Tensor] = None,
+) -> Tensor
+```
+
+- `input`：待归一化张量，末维通常为隐藏维度。
+- `normalized_shape`：期望归一化的维度大小列表，会与 `weight.shape` 进行严格比较。
+- `weight`：缩放系数张量，维度需与 `normalized_shape` 匹配。
+- `eps`：数值稳定项，默认 `1e-5`。
+- `out`：可选输出张量；若提供需与 `input` 形状、`dtype`、`device` 一致。
+
+函数首先断言 `normalized_shape == weight.shape`，然后调用 `_infinicore.rms_norm` / `_infinicore.rms_norm_` 完成计算。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+device = ic.device("cuda:0")
+x = ic.ones((4, 1024), dtype=ic.float16, device=device)
+gamma = ic.ones((1024,), dtype=ic.float16, device=device)
+
+y = F.rms_norm(x, normalized_shape=list(gamma.shape), weight=gamma, eps=1e-5)
+F.rms_norm(x, normalized_shape=[1024], weight=gamma, out=x)  # 原位写回
+```
+
+## 相关链接
+
+- [`nn.functional` 文档](../README.md)
+- [`Tensor` 构造函数](../../../README.md#张量与构造函数)
diff --git a/python/nn/functional/silu/README.md b/python/nn/functional/silu/README.md
@@ -0,0 +1,42 @@
+# `nn.functional.silu`
+
+Sigmoid Linear Unit (SiLU) 激活函数。实现位于 `InfiniCore/python/infinicore/nn/functional.py`。
+
+## 函数签名
+
+```python
+def silu(
+    input: Tensor,
+    inplace: bool = False,
+    *,
+    out: Optional[Tensor] = None,
+) -> Tensor
+```
+
+- `input`：待激活张量。
+- `inplace`：是否原地写回到 `input`。
+- `out`：可选输出张量，若提供需与 `input` 形状、`dtype`、`device` 一致。
+
+### 行为说明
+
+- 当 `inplace=True` 时，直接调用 `_infinicore.silu_` 写回 `input` 并返回。
+- 当未提供 `out` 且 `infinicore.use_ntops=True` 且设备类型为 `"cuda"` 或 `"musa"` 时，会委托 `ntops.torch.silu` 以复用优化实现。
+- 其他情况下调用 `_infinicore.silu` 或 `_infinicore.silu_` 完成计算。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+device = ic.device("cuda:0")
+x = ic.ones((4, 8), dtype=ic.float16, device=device)
+
+y = F.silu(x)                 # 返回新张量
+F.silu(x, inplace=True)       # 原地更新
+```
+
+## 相关链接
+
+- [`nn.functional` 文档](../README.md)
+- [`use_ntops` 协作说明](../../../README.md#与-ntops-的协作)