diff --git a/README.md b/README.md
index f40d463..79a17fc 100644
--- a/README.md
+++ b/README.md
@@ -9,7 +9,7 @@
 ## 文档目录
 
 ### InfiniCore
-- [`Python APIs`]
+- [`Python APIs`](python/README.md)
 
 - [`C++ APIs`]
 
@@ -22,7 +22,7 @@
 - [`InfiniCCL`]：统一集合通信库，提供常用的集合通信功能，包括点对点、广播、聚合等。
 
 
-[`Python APIs`]:README.md
+[`Python APIs`]:python/README.md
 [`C++ APIs`]:README.md
 [`InfiniRT`]:/infinirt/README.md
 [`InfiniOP`]:/infiniop/README.md
diff --git a/python/README.md b/python/README.md
new file mode 100644
index 0000000..d6d51e4
--- /dev/null
+++ b/python/README.md
@@ -0,0 +1,115 @@
+# `infinicore` Python 前端
+
+*InfiniCore* 提供了与 C++ 前端一致的 Python 封装，位于 `python/infinicore/`。该模块通过 `pybind11` 将核心张量、算子与设备上下文暴露给 Python，便于在推理框架或调试脚本中快速集成。
+
+## 模块结构
+
+| 符号 | 说明 |
+| --- | --- |
+| `device` | 设备句柄类 (`python/infinicore/device.py`)，支持 `"cuda:0"`、`device(\"cpu\", 0)` 等写法或复用已有实例。 |
+| `dtype` / `float16` 等 | 数据类型枚举 (`python/infinicore/dtype.py`)。 |
+| `Tensor` | 张量包装类 (`python/infinicore/tensor.py`)，内部封装底层 `_infinicore` 对象。 |
+| `empty` / `zeros` / `ones` / `empty_like` 等 | 张量构造函数（`python/infinicore/tensor.py`），默认要求显式传入 `dtype` 与 `device`。 |
+| 顶层算子（`add`、`matmul`、`rearrange`、`attention`） | 暴露在 `infinicore` 命名空间下，对应实现位于 `python/infinicore/ops/`。 |
+| `infinicore.nn` | 神经网络相关模块集合，未来可扩展更多组件。 |
+| `infinicore.nn.functional` | 函数式算子集合 (`python/infinicore/nn/functional.py`)。 |
+| `use_ntops` / `infinicore.ntops` | 若系统安装 `ntops` 包，将自动置位 `use_ntops=True` 并暴露原始模块。 |
+
+所有符号在包的 `__init__.py` 中进行了显式导出，可直接通过 `import infinicore as ic` 后使用。
+
+相关导出定义见 `InfiniCore/python/infinicore/__init__.py`。
+
+## API 索引
+
+- **核心对象**
+  - [`device`](device/README.md)
+  - [`dtype`](dtype/README.md)
+  - [`Tensor`](tensor/README.md)
+  - 构造函数：[`empty`](tensor/README.md#构造函数)、[`zeros`](tensor/README.md#构造函数)、[`ones`](tensor/README.md#构造函数)、[`empty_like`](tensor/README.md#构造函数)、[`from_blob`](tensor/README.md#构造函数)
+  - 常用方法：[`copy_`](tensor/README.md#tensor-类)、[`to`](tensor/README.md#tensor-类)、[`permute`](tensor/README.md#tensor-类)、[`view`](tensor/README.md#tensor-类)、[`debug`](tensor/README.md#tensor-类)
+  - [`use_ntops` 协作说明](#与-ntops-的协作)
+- **顶层算子**
+  - [`add`](ops/add/README.md)
+  - [`matmul`](ops/matmul/README.md)
+  - [`rearrange`](ops/rearrange/README.md)
+  - [`attention`](ops/attention/README.md)
+- **函数式算子**
+  - [`causal_softmax`](nn/functional/causal_softmax/README.md)
+  - [`rms_norm`](nn/functional/rms_norm/README.md)
+  - [`silu`](nn/functional/silu/README.md)
+  - [`swiglu`](nn/functional/swiglu/README.md)
+- **更多参考**
+  - [`infinicore.ops` 索引](ops/README.md)
+  - [`nn` 模块概览](nn/README.md)
+
+## 张量与构造函数
+
+`Tensor` 是对底层 `_infinicore.Tensor` 的 Python 包装，常用接口包括：
+
+- `shape` / `ndim` / `size(dim)` / `stride(dim)`：获取张量维度与步长信息。
+- `dtype` / `device`：返回 `dtype` 与 `device` 包装类。
+- `numel()` / `is_contiguous()`：查看张量元素数量与存储布局。
+- `copy_(src)` / `to(...)`：执行数据拷贝与跨设备搬运。
+- `contiguous()` / `permute(dims)` / `view(shape)` / `as_strided(size, stride)`：布局调整与视图操作。
+- `debug(filename=None)`：将张量内容打印或输出到二进制文件。
+
+常用构造函数包括 `empty`、`strided_empty`、`zeros`、`ones`、`from_blob`、`strided_from_blob`、`empty_like` 等：
+
+```python
+import infinicore as ic
+
+cpu = ic.device("cpu")
+a = ic.empty((4, 8), dtype=ic.float16, device=cpu)
+b = ic.ones((4, 8), dtype=ic.float16, device=cpu)
+a.copy_(b)
+```
+
+> 注意：这些函数要求显式传入 `dtype` 与 `device`，避免隐式从 PyTorch/TensorFlow 对象推断。
+
+## 顶层算子 (`infinicore.*`)
+
+详见 [`ops` 文档索引`](ops/README.md) 及各算子文档。
+
+## 函数式算子 (`infinicore.nn.functional`)
+
+详见 [`nn.functional` 文档](nn/functional/README.md) 及子目录。
+
+## 运行时上下文
+
+- `_infinicore` 在进程内维护运行时状态；创建张量时请显式传入 `device`，并保持算子的所有输入位于同一设备。
+- 如需强制同步，可调用 `infinicore.lib._infinicore.sync_stream()`、`sync_device()` 等底层绑定。
+- 在同一执行流内串行调用算子通常无需额外同步。
+
+## 与 `ntops` 的协作
+
+- 导入 `ntops` 成功后，`infinicore.use_ntops` 会被设置为 `True`，并可通过 `infinicore.ntops` 访问原始模块。
+- `nn.functional.silu` 在 `use_ntops=True` 且设备类型为 `"cuda"`/`"musa"` 且未传 `out` 时，会委托 `ntops.torch.silu`。
+- 若想强制禁用，可直接设置 `infinicore.use_ntops = False`。
+
+## 端到端示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+
+q = ic.empty((8, 1, 128), dtype=ic.float16, device=device)
+k = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+v = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+k_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+v_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+
+out = ic.attention(q, k, v, k_cache, v_cache, pos=0)
+
+if ic.use_ntops:
+    # 在部分设备上，SiLU 会委托给 ntops 的高性能实现
+    out = ic.nn.functional.silu(out)
+
+out.debug()
+```
+
+## 相关链接
+
+- [`infinicore.ops 顶层算子`](ops/README.md)
+- [`nn.functional 函数式文档`](nn/functional/README.md)
+- [`InfiniOP` 统一算子库](/infiniop/README.md)
diff --git a/python/device/README.md b/python/device/README.md
new file mode 100644
index 0000000..c7d16fa
--- /dev/null
+++ b/python/device/README.md
@@ -0,0 +1,35 @@
+# `infinicore.device`
+
+设备句柄类，定义于 `InfiniCore/python/infinicore/device.py`，用于在 Python 端选择和描述运行时设备。
+
+## 构造方式
+
+```python
+from infinicore import device
+
+cpu = device()                 # 默认 "cpu"
+cuda0 = device("cuda:0")       # 字符串形式指定
+cuda1 = device("cuda", 1)      # 类型 + index
+clone = device(cuda0)          # 从已有实例拷贝
+```
+
+- `type`：支持 `"cpu"`、`"cuda"`、`"mlu"`、`"npu"`、`"musa"` 等，具体取决于 `_infinicore` 编译时支持。
+- `index`：可选整型索引，字符串中已包含 `":"` 时禁止再传入。
+- 传入已有 `device` 实例时会拷贝其 `type`/`index`。
+
+## 属性与方法
+
+- `type` / `index`：公开的设备类型与序号。
+- `__repr__()` / `__str__()`：打印友好格式，如 `device(type='cuda', index=0)` 或 `"cuda:0"`。
+- `_underlying`：内部 `_infinicore.Device` 对象，供底层 API 使用。
+
+## 与运行时的关系
+
+- `device` 实例用于所有张量构造函数及顶层算子，确保 `_infinicore` 在正确的设备上执行。
+- 若需要从底层 `_infinicore.Device` 转换为 Python 对象，可使用 `device._from_infinicore_device`（内部方法）。
+
+## 相关链接
+
+- [`Tensor` 构造函数](../tensor/README.md#构造函数)
+- [`运行时上下文`](../README.md#运行时上下文)
+
diff --git a/python/dtype/README.md b/python/dtype/README.md
new file mode 100644
index 0000000..2a4b828
--- /dev/null
+++ b/python/dtype/README.md
@@ -0,0 +1,30 @@
+# `infinicore.dtype` 与标量类型
+
+`InfiniCore/python/infinicore/dtype.py` 导出与 C++ 端一致的标量类型枚举。通过 `from infinicore import dtype, float16, int32, ...` 可直接访问。
+
+## 常用类型列表
+
+- `dtype`：枚举工厂，可用于创建自定义 dtype 或从 `_underlying` 还原。
+- 浮点类型：`float`, `float16`, `float32`, `float64`, `half`, `bfloat16`, `double`.
+- 复数类型：`cfloat`, `cdouble`, `complex32`, `complex64`, `complex128`.
+- 整型：`int`, `int8`, `int16`, `int32`, `int64`, `short`, `long`, `uint8`.
+- 布尔：`bool`.
+
+所有类型对象都暴露 `_underlying` 属性，用于在调用底层 `_infinicore` 接口时传递。
+
+## 示例
+
+```python
+import infinicore as ic
+
+cpu = ic.device("cpu")
+a = ic.empty((4, 8), dtype=ic.float16, device=cpu)
+
+if a.dtype is ic.float16:
+    print("half precision tensor")
+```
+
+## 相关链接
+
+- [`Tensor` 构造函数](../tensor/README.md#构造函数)
+- [`infinicore` 模块概览](../README.md)
diff --git a/python/nn/README.md b/python/nn/README.md
new file mode 100644
index 0000000..1760ab1
--- /dev/null
+++ b/python/nn/README.md
@@ -0,0 +1,25 @@
+# `infinicore.nn` 模块
+
+`infinicore.nn` 聚合了面向神经网络的辅助模块，目前主要暴露函数式算子集合 `infinicore.nn.functional`，并预留位置扩展其他组件（如模块化层、优化器等）。
+
+## 模块结构
+
+| 子模块 | 说明 |
+| --- | --- |
+| `functional` | 函数式算子集合，接口文档见 [`functional/README.md`](functional/README.md)。 |
+
+## 使用示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+x = ic.ones((4, 1024), dtype=ic.float16, device=ic.device("cuda:0"))
+normed = F.rms_norm(x, normalized_shape=[1024], weight=ic.ones((1024,), dtype=x.dtype, device=x.device))
+activated = F.silu(normed)
+```
+
+## 相关链接
+
+- [`nn.functional 函数式文档`](functional/README.md)
+- [`Python API 总览`](../README.md)
diff --git a/python/nn/functional/README.md b/python/nn/functional/README.md
new file mode 100644
index 0000000..2b0f94d
--- /dev/null
+++ b/python/nn/functional/README.md
@@ -0,0 +1,37 @@
+# `infinicore.nn.functional` 函数式接口
+
+`infinicore.nn.functional` 集中收录 PyTorch 风格的函数式算子封装。实现位于 `InfiniCore/python/infinicore/nn/functional.py`，依赖 `_infinicore` C++ 绑定并复用运行时上下文。
+
+## 公共约定
+
+- 所有函数都返回 `infinicore.Tensor`；当提供 `out`/`inplace` 等参数时会复用已有缓冲区。
+- 输入张量需由 `infinicore` 创建（或至少携带 `_underlying` 指针），否则无法与底层运行时交互。
+- 若函数内部调用 `_infinicore.*_` 原位接口，需确保输出张量与输入形状、dtype 一致。
+
+## API 详情
+
+- [`causal_softmax`](causal_softmax/README.md)：因果掩码 Softmax。
+- [`rms_norm`](rms_norm/README.md)：Root Mean Square LayerNorm。
+- [`silu`](silu/README.md)：SiLU（Sigmoid Linear Unit）激活。
+- [`swiglu`](swiglu/README.md)：SwiGLU 前向门控。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+x = ic.empty((4, 1024), dtype=ic.float16, device=ic.device("cuda:0"))
+w = ic.empty((1024,), dtype=ic.float16, device=x.device)
+
+normed = F.rms_norm(x, normalized_shape=list(w.shape), weight=w)
+activated = F.silu(normed)
+gated = F.swiglu(activated, ic.empty_like(activated))
+
+probs = F.causal_softmax(gated, out=ic.empty_like(gated))
+```
+
+## 相关链接
+
+- [`Python API 总览`](../../README.md)
+- [`ntops` 协作接口说明](../../README.md#与-ntops-的协作)
diff --git a/python/nn/functional/causal_softmax/README.md b/python/nn/functional/causal_softmax/README.md
new file mode 100644
index 0000000..b53c458
--- /dev/null
+++ b/python/nn/functional/causal_softmax/README.md
@@ -0,0 +1,32 @@
+# `nn.functional.causal_softmax`
+
+对最后一维应用因果掩码 Softmax，用于自回归注意力场景。函数定义位于 `InfiniCore/python/infinicore/nn/functional.py`。
+
+## 函数签名
+
+```python
+def causal_softmax(input: Tensor, out: Optional[Tensor] = None) -> Tensor
+```
+
+- `input`：任意维度张量，末维视为序列维，将应用因果掩码。
+- `out`：可选输出张量，若提供需与 `input` 形状、`dtype`、`device` 完全一致。
+
+默认返回新张量；提供 `out` 时调用 `_infinicore.causal_softmax_` 原位写入。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+device = ic.device("cuda:0")
+logits = ic.empty((4, 128), dtype=ic.float16, device=device)
+
+probs = F.causal_softmax(logits)
+F.causal_softmax(logits, out=logits)  # 原位写回
+```
+
+## 相关链接
+
+- [`nn.functional` 文档](../README.md)
+- [`infinicore.attention` 算子](../../../ops/attention/README.md)
diff --git a/python/nn/functional/rms_norm/README.md b/python/nn/functional/rms_norm/README.md
new file mode 100644
index 0000000..c2d459c
--- /dev/null
+++ b/python/nn/functional/rms_norm/README.md
@@ -0,0 +1,43 @@
+# `nn.functional.rms_norm`
+
+实现 Root Mean Square LayerNorm。函数定义位于 `InfiniCore/python/infinicore/nn/functional.py`。
+
+## 函数签名
+
+```python
+def rms_norm(
+    input: Tensor,
+    normalized_shape: list[int],
+    weight: Tensor,
+    eps: float = 1e-5,
+    *,
+    out: Optional[Tensor] = None,
+) -> Tensor
+```
+
+- `input`：待归一化张量，末维通常为隐藏维度。
+- `normalized_shape`：期望归一化的维度大小列表，会与 `weight.shape` 进行严格比较。
+- `weight`：缩放系数张量，维度需与 `normalized_shape` 匹配。
+- `eps`：数值稳定项，默认 `1e-5`。
+- `out`：可选输出张量；若提供需与 `input` 形状、`dtype`、`device` 一致。
+
+函数首先断言 `normalized_shape == weight.shape`，然后调用 `_infinicore.rms_norm` / `_infinicore.rms_norm_` 完成计算。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+device = ic.device("cuda:0")
+x = ic.ones((4, 1024), dtype=ic.float16, device=device)
+gamma = ic.ones((1024,), dtype=ic.float16, device=device)
+
+y = F.rms_norm(x, normalized_shape=list(gamma.shape), weight=gamma, eps=1e-5)
+F.rms_norm(x, normalized_shape=[1024], weight=gamma, out=x)  # 原位写回
+```
+
+## 相关链接
+
+- [`nn.functional` 文档](../README.md)
+- [`Tensor` 构造函数](../../../README.md#张量与构造函数)
diff --git a/python/nn/functional/silu/README.md b/python/nn/functional/silu/README.md
new file mode 100644
index 0000000..9403c0d
--- /dev/null
+++ b/python/nn/functional/silu/README.md
@@ -0,0 +1,42 @@
+# `nn.functional.silu`
+
+Sigmoid Linear Unit (SiLU) 激活函数。实现位于 `InfiniCore/python/infinicore/nn/functional.py`。
+
+## 函数签名
+
+```python
+def silu(
+    input: Tensor,
+    inplace: bool = False,
+    *,
+    out: Optional[Tensor] = None,
+) -> Tensor
+```
+
+- `input`：待激活张量。
+- `inplace`：是否原地写回到 `input`。
+- `out`：可选输出张量，若提供需与 `input` 形状、`dtype`、`device` 一致。
+
+### 行为说明
+
+- 当 `inplace=True` 时，直接调用 `_infinicore.silu_` 写回 `input` 并返回。
+- 当未提供 `out` 且 `infinicore.use_ntops=True` 且设备类型为 `"cuda"` 或 `"musa"` 时，会委托 `ntops.torch.silu` 以复用优化实现。
+- 其他情况下调用 `_infinicore.silu` 或 `_infinicore.silu_` 完成计算。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+device = ic.device("cuda:0")
+x = ic.ones((4, 8), dtype=ic.float16, device=device)
+
+y = F.silu(x)                 # 返回新张量
+F.silu(x, inplace=True)       # 原地更新
+```
+
+## 相关链接
+
+- [`nn.functional` 文档](../README.md)
+- [`use_ntops` 协作说明](../../../README.md#与-ntops-的协作)
diff --git a/python/nn/functional/swiglu/README.md b/python/nn/functional/swiglu/README.md
new file mode 100644
index 0000000..08a66a8
--- /dev/null
+++ b/python/nn/functional/swiglu/README.md
@@ -0,0 +1,39 @@
+# `nn.functional.swiglu`
+
+SwiGLU（Swish-Gated Linear Unit）函数式实现，定义在 `InfiniCore/python/infinicore/nn/functional.py`。
+
+## 函数签名
+
+```python
+def swiglu(
+    input: Tensor,
+    other: Tensor,
+    *,
+    out: Optional[Tensor] = None,
+) -> Tensor
+```
+
+- `input`：激活分支张量。
+- `other`：门控分支张量。要求与 `input` 在形状、`dtype`、`device` 完全一致。
+- `out`：可选输出张量，若提供需满足上述一致性条件。
+
+内部调用 `_infinicore.swiglu` / `_infinicore.swiglu_` 完成计算；未提供 `out` 时返回新张量，否则原地写入。
+
+## 示例
+
+```python
+import infinicore as ic
+from infinicore.nn import functional as F
+
+device = ic.device("cuda:0")
+a = ic.ones((4, 8), dtype=ic.float16, device=device)
+b = ic.ones((4, 8), dtype=ic.float16, device=device)
+
+out = F.swiglu(a, b)
+F.swiglu(a, b, out=a)  # 原位更新
+```
+
+## 相关链接
+
+- [`nn.functional` 文档](../README.md)
+- [`infinicore.ops` 索引](../../../ops/README.md)
diff --git a/python/ops/README.md b/python/ops/README.md
new file mode 100644
index 0000000..1aedb0d
--- /dev/null
+++ b/python/ops/README.md
@@ -0,0 +1,45 @@
+# `infinicore.ops` 顶层算子
+
+该模块通过 pybind11 将常用算子直接暴露在 `infinicore` 命名空间下，对应源码位于 `InfiniCore/python/infinicore/ops/`。所有函数均支持可选的 `out` 参数以复用输出缓冲区。
+
+## 通用注意事项
+
+- 所有参数必须是 `infinicore.Tensor` 实例（或至少携带 `_underlying` 指针），否则无法传递到底层 `_infinicore`。
+- `out` 张量（若提供）需要和输出形状、dtype、设备完全匹配。
+- 算子执行依赖底层 `_infinicore` 运行时；请在创建张量时传入期望的 `device`，保持所有输入处于同一设备上。
+
+## 文档索引
+
+- [`add`](add/README.md)
+- [`matmul`](matmul/README.md)
+- [`rearrange`](rearrange/README.md)
+- [`attention`](attention/README.md)
+
+## 示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+a = ic.ones((4, 8), dtype=ic.float16, device=device)
+b = ic.ones((4, 8), dtype=ic.float16, device=device)
+
+ic.add(a, b, out=a)  # 原位累加
+c = ic.matmul(a, b.permute([1, 0]))  # (4, 8) @ (8, 4)
+
+contiguous = ic.rearrange(c)
+
+attn_out = ic.attention(
+    q=contiguous,
+    k=contiguous,
+    v=contiguous,
+    k_cache=ic.empty((4, 128, contiguous.shape[-1]), dtype=contiguous.dtype, device=contiguous.device),
+    v_cache=ic.empty((4, 128, contiguous.shape[-1]), dtype=contiguous.dtype, device=contiguous.device),
+    pos=0,
+)
+```
+
+## 相关链接
+
+- [`Python API 总览`](../README.md)
+- [`nn.functional 函数式接口`](../nn/functional/README.md)
diff --git a/python/ops/add/README.md b/python/ops/add/README.md
new file mode 100644
index 0000000..2d3e7da
--- /dev/null
+++ b/python/ops/add/README.md
@@ -0,0 +1,33 @@
+# `infinicore.add`
+
+逐元素加法算子，支持广播与非连续张量。定义于 `InfiniCore/python/infinicore/ops/add.py`。
+
+## 函数签名
+
+```python
+def add(input: Tensor, other: Tensor, *, out: Optional[Tensor] = None) -> Tensor
+```
+
+- `input`：左操作数张量。
+- `other`：右操作数张量，可与 `input` 形状相同或可广播到 `input`。
+- `out`：可选输出张量，若提供需与结果形状、`dtype`、`device` 一致。
+
+若未提供 `out`，函数会创建并返回新张量；提供 `out` 时将调用 `_infinicore.add_` 在原地写入。
+
+## 示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+a = ic.ones((4, 8), dtype=ic.float16, device=device)
+b = ic.ones((1, 8), dtype=ic.float16, device=device)  # 可广播
+
+out = ic.add(a, b)        # 返回新张量
+ic.add(a, b, out=a)       # 原位累加
+```
+
+## 相关链接
+
+- [`infinicore.ops` 索引](../README.md)
+- [`Tensor` 构造函数](../../README.md#张量与构造函数)
diff --git a/python/ops/attention/README.md b/python/ops/attention/README.md
new file mode 100644
index 0000000..8c4261b
--- /dev/null
+++ b/python/ops/attention/README.md
@@ -0,0 +1,53 @@
+# `infinicore.attention`
+
+解码阶段注意力算子，负责在 KV cache 中增量写入并返回当前 step 的输出。实现位置：`InfiniCore/python/infinicore/ops/attention.py`。
+
+## 函数签名
+
+```python
+def attention(
+    q: Tensor,
+    k: Tensor,
+    v: Tensor,
+    k_cache: Tensor,
+    v_cache: Tensor,
+    pos: int,
+    *,
+    out: Optional[Tensor] = None,
+) -> Tensor
+```
+
+- `q`：查询张量，形状一般为 `(n_q_head, seq_len, head_dim)`。
+- `k` / `v`：本 step 新增的 Key/Value，形状 `(n_kv_head, seq_len, head_dim)`。
+- `k_cache` / `v_cache`：缓存张量，形状 `(n_kv_head, cache_len, head_dim)`，需保证 `pos + seq_len <= cache_len`。
+- `pos`：写入位置索引（已填充 token 数）。
+- `out`：可选输出张量，若提供需与输出形状 `(seq_len, n_q_head, head_dim)`、`dtype`、`device` 完全一致。
+
+默认情况下函数会创建新张量并返回；提供 `out` 时调用 `_infinicore.attention_` 原位写入。
+
+## 行为说明
+
+- 输入张量可为非连续布局，底层会自动处理。
+- 支持分组 Query Attention（GQA），当 `n_q_head` 为 `n_kv_head` 的整数倍时自动映射。
+- KV cache 在调用期间会写入 `[pos : pos + seq_len)` 区间，调用者需维护 `pos` 的累加。
+
+## 示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+
+q = ic.empty((8, 1, 128), dtype=ic.float16, device=device)
+k = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+v = ic.empty((2, 1, 128), dtype=ic.float16, device=device)
+k_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+v_cache = ic.empty((2, 128, 128), dtype=ic.float16, device=device)
+
+out = ic.attention(q, k, v, k_cache, v_cache, pos=0)
+```
+
+## 相关链接
+
+- [`infinicore.ops` 索引](../README.md)
+- [`nn.functional` 文档](../../nn/functional/README.md)
diff --git a/python/ops/matmul/README.md b/python/ops/matmul/README.md
new file mode 100644
index 0000000..6509c46
--- /dev/null
+++ b/python/ops/matmul/README.md
@@ -0,0 +1,39 @@
+# `infinicore.matmul`
+
+矩阵乘法/GEMM 前端，封装 `InfiniCore/python/infinicore/ops/matmul.py` 中的 pybind11 绑定。
+
+## 函数签名
+
+```python
+def matmul(input: Tensor, other: Tensor, *, out: Optional[Tensor] = None) -> Tensor
+```
+
+- `input`：左乘矩阵，形状需满足 GEMM 要求，可包含批维。
+- `other`：右乘矩阵，与 `input` 的维度兼容。
+- `out`：可选输出张量；若提供需与结果形状、`dtype`、`device` 完全一致。
+
+默认返回新张量；当提供 `out` 时调用 `_infinicore.matmul_` 原地写入。底层会复用 *InfiniOP* GEMM 描述符完成计算。
+
+## 输入要求
+
+- 支持常见数据类型（如 `float16`、`float32`、`bfloat16`）。
+- 支持批量维度；所有批维需两输入对齐。
+- 当 `out` 与输入不处于同一设备或数据类型不匹配时，底层会抛出异常。
+
+## 示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+a = ic.ones((4, 8), dtype=ic.float16, device=device)
+b = ic.ones((8, 16), dtype=ic.float16, device=device)
+
+c = ic.matmul(a, b)                 # 创建新张量
+ic.matmul(a, b, out=c)              # 原位复用输出缓冲
+```
+
+## 相关链接
+
+- [`infinicore.ops` 索引](../README.md)
+- [`Tensor` 构造函数](../../README.md#张量与构造函数)
diff --git a/python/ops/rearrange/README.md b/python/ops/rearrange/README.md
new file mode 100644
index 0000000..67c5b44
--- /dev/null
+++ b/python/ops/rearrange/README.md
@@ -0,0 +1,38 @@
+# `infinicore.rearrange`
+
+调整张量布局或生成连续副本的辅助算子，定义于 `InfiniCore/python/infinicore/ops/rearrange.py`。
+
+## 函数签名
+
+```python
+def rearrange(input: Tensor, other: Optional[Tensor] = None, *, out: Optional[Tensor] = None) -> Tensor
+```
+
+- `input`：源张量。
+- `other`：当前保留未使用，作为后续扩展占位。
+- `out`：可选输出张量，若提供需与 `input` 形状、`dtype`、`device` 一致。
+
+默认返回新张量；当提供 `out` 时调用 `_infinicore.rearrange_` 将结果写入既有缓冲区。
+
+## 常见用途
+
+- 将任意步长的张量转换为连续布局。
+- 为跨设备拷贝或下游算子准备符合要求的存储格式。
+- 与 `Tensor.contiguous()` 行为一致，但可选择写入指定输出。
+
+## 示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+x = ic.empty((4, 8), dtype=ic.float16, device=device)
+
+y = ic.rearrange(x)      # 返回连续副本
+ic.rearrange(x, out=x)   # 原位整理（若底层支持）
+```
+
+## 相关链接
+
+- [`infinicore.ops` 索引](../README.md)
+- [`Tensor` 常用方法](../../README.md#张量与构造函数)
diff --git a/python/tensor/README.md b/python/tensor/README.md
new file mode 100644
index 0000000..e928a03
--- /dev/null
+++ b/python/tensor/README.md
@@ -0,0 +1,63 @@
+# `infinicore.tensor` 模块
+
+`Tensor` 类及核心构造函数定义在 `InfiniCore/python/infinicore/tensor.py`，负责在 Python 端封装底层 `_infinicore.Tensor` 指针并提供常用操作。
+
+## `Tensor` 类
+
+### 主要属性
+
+- `shape`（`tuple[int, ...]`）/ `ndim` / `size(dim)`：获取张量维度信息。
+- `stride(dim=None)`：返回步长数组或指定维度的步长。
+- `dtype`：对应的标量类型（参见 [`dtype` 文档](../dtype/README.md)）。
+- `device`：张量所在设备（参见 [`device` 文档](../device/README.md)）。
+- `numel()`：元素总数。
+- `is_contiguous()`：判断是否连续存储。
+
+### 常用方法
+
+- `copy_(src)`：将 `src` 的数据拷贝到当前张量。
+- `to(*args, **kwargs)`：执行跨设备/数据类型转换，返回新 `Tensor`。
+- `as_strided(size, stride)`：创建共享存储的视图。
+- `contiguous()`：返回连续副本。
+- `permute(dims)`：重新排列维度。
+- `view(shape)`：改变张量形状（需满足可视条件）。
+- `debug(filename=None)`：打印张量信息或将原始数据写入文件。
+
+## 构造函数
+
+全部为 `Tensor` 类的顶层函数，默认要求显式传入 `dtype` 与 `device`：
+
+```python
+from infinicore import (
+    empty, strided_empty, zeros, ones,
+    from_blob, strided_from_blob, empty_like,
+)
+```
+
+### 说明
+
+- `empty(shape, *, dtype, device, pin_memory=False)`：按给定形状分配未初始化存储。
+- `strided_empty(shape, strides, *, dtype, device, pin_memory=False)`：按指定步长分配存储。
+- `zeros` / `ones`：与 `empty` 类似，但在 Python 层暂时未初始化填充值。
+- `from_blob(data_ptr, shape, *, dtype, device)`：将外部内存包装为 `Tensor`，不接管内存所有权。
+- `strided_from_blob`：同上但可显式指定步长。
+- `empty_like(input, *, dtype=None, device=None)`：按照 `input` 的形状/步长创建新张量，可覆盖 `dtype` 或 `device`。
+
+## 示例
+
+```python
+import infinicore as ic
+
+device = ic.device("cuda:0")
+a = ic.empty((4, 8), dtype=ic.float16, device=device)
+b = ic.ones((4, 8), dtype=ic.float16, device=device)
+
+a.copy_(b)
+c = a.permute([1, 0]).contiguous()
+```
+
+## 相关链接
+
+- [`infinicore` 模块概览](../README.md)
+- [`顶层算子`](../ops/README.md)
+- [`nn.functional`](../nn/functional/README.md)