diff --git a/docs/ai-ml/dl-library.md b/docs/ai-ml/dl-library.md
new file mode 100644
index 00000000..ed90670b
--- /dev/null
+++ b/docs/ai-ml/dl-library.md
@@ -0,0 +1,519 @@
+# 深度学习框架搭建
+
+## 前言
+
+陈天奇和 J. Zico Kolter 在 CMU 开设了 [Deep Learning Systems](https://dlsyscourse.org/) 课程，讲授如何搭建一个深度学习框架并实现Transformer、RNN、CNN等模型，笔者很大程度上参考了该课程的内容。
+
+
+!!! Info 注意
+      
+    本文主要讲解如何搭建深度学习框架的基本组件，**不涉及**以下内容：
+    - 并行化程序、cuda kernel的编写
+    - 计算图优化等性能优化
+    - 机器学习模型如Transformer、RNN的高效实现
+    
+
+
+## 前置知识
+- 了解机器学习基本概念
+- 掌握Python、C++的基础知识
+- 了解类、继承等面向对象基本概念
+- 有较扎实的微积分、线代等数学基础
+
+## 学习路径
+
+如果对ML没有基础的，可以参考[吴恩达机器学习](https://www.coursera.org/learn/machine-learning)课程，也可以阅读[《动手学深度学习》](https://zh.d2l.ai/)。
+
+有一定基础，可以学习[CMU DLSystems](https://dlsyscourse.org/)课程。本文主要基于这门课程进行介绍。
+
+推荐资料：
+
+[LLMSys-PaperList](https://github.com/AmberLJC/LLMSys-PaperList)：大语言模型系统论文汇总。
+
+[MLsys-Guide](https://github.com/PKU-DAIR/Starter-Guide/blob/main/docs/systems/Readme.md)：深度学习系统入门指南。
+
+
+
+## 知识讲解
+
+### Stage 1. 基本数据类型Tensor
+
+如果你用过PyTorch、TensorFlow等深度学习框架，你肯定写过这个玩意：
+
+```Python
+loss.backward()
+```
+问题来了：调一个backward()函数为什么能够计算前置Tensor的梯度？系统怎么检测loss背后用了什么参数？
+
+实际上，在张量计算的背后，系统悄悄建立了一个**计算图**。
+
+> &nbsp;
+> 计算图是用于表示数学计算过程的一种有向无环图。
+> 在计算图中，节点代表**操作**（如加减乘除等）或**变量**（如输入），
+> 而边代表这些操作或变量之间的**依赖关系**。
+> &nbsp;
+
+
+举个例子：
+
+```Python
+v1 = Tensor([0], dtype="float32")
+v2 = exp(v1)
+v3 = v2 + 1
+v4 = v2 * v3
+```
+
+上述操作建立了以下计算图   [图源](https://github.com/dlsyscourse/lecture5/blob/main/5_automatic_differentiation_implementation.ipynb)
+
+![ ](/static/ai-ml/dl-library/computational-graph.png){height=300}
+
+
+
+换做深度学习框架，在每一次计算的时候，我们都能够得到这个张量需要什么样的前置张量，以及由什么操作得到。这样，反向传播时就可以知道如何将梯度传到前置张量。
+
+因此，张量的定义不仅仅要有张量的数据本身和梯度，还要有**前置节点和对应的操作**：
+
+
+```Python
+class Tensor:
+    
+    def __init__(self,data:NDArray, op: Optional[Op],inputs: Optional[List["Tensor"]]):
+        self.data=data          #  张量存的数据
+        self.grad=None          #  梯度
+        self.op=op            #  张量对应的运算
+        if inputs is not None:
+            self.inputs=inputs         #  张量的输入，如c=a+b中c.op对应加，c.inputs=[a,b]
+        else:
+            self.inputs=[]
+        ...
+
+    
+    def realize_cached_data(self):  #获取当前的data，有跳过，无就根据self.op和self.inputs递归计算
+        if self.data is not None:
+            return self.data
+        self.data = self.op.compute(
+            *[x.realize_cached_data() for x in self.inputs]
+        )
+        return self.data
+
+    @classmethod
+    def make_const(cls, data, *):       #给定输入data生成张量，也可用于detach()方法（从计算图中脱离）
+        value = cls.__init__(
+            data=data,
+            op=None,
+            inputs=[]
+        )
+        return value
+
+    @classmethod
+    def make_from_op(cls, op: Op, inputs: List["Tensor"]):   # 根据运算和输入生成张量
+        value = cls.__init__(None, op, inputs)
+
+        value.realize_cached_data()
+        return value
+```
+
+
+
+### Stage2: 算术运算搭建
+
+`self.op`代表张量对应的运算，需要定义它前向与反向传播的方法。运算的基类定义如下：
+
+```Python
+#所有张量运算的基类
+class Op:
+    def __call__(self,*args):  
+        return Tensor.make_from_op(self,*args)
+    def compute(self,*args:Tuple[NDArray]):
+        raise NotImplementedError()
+    def gradient(self,out_grad,nodes): 
+        raise NotImplementedError()
+
+````
+
+其中，`compute()`方法代表**前向计算**，接受若干个NDArray（即self.data对应的数据类型）作为参数，计算出来当前张量data的值。
+
+`__call__`就利用了`compute()`定义了这个运算符或操作出现的时候的操作：
+- 调用Tensor类的`make_from_op()`方法，将操作本身以及其他参数传进去；
+- 在`make_from_op()`中形成新的张量的输入，并调用`realize_cached_data()`方法，以获取张量data的值；
+- 在`realize_cached_data()`中通过Op的`compute()`计算张量的data，并返回。
+
+
+而`gradient()`方法则是**反向计算**，接收当前梯度（out_grad）和Tensor节点（nodes），根据node.inputs传递相应的梯度。
+
+根据链式法则，
+
+$$\frac{\partial \ell}{\partial x} = \frac{\partial \ell}{\partial f(x,y)} \frac{\partial f(x,y)}{\partial x}$$
+
+对$f(x,y)$而言，`out_grad`代表“最终结果对f的梯度”，即$ \frac{\partial \ell}{\partial f(x,y)}$;`gradient`方法则负责根据`out_grad`以及运算本身，求出最终结果对node各个输入的梯度$\frac{\partial \ell}{\partial x}$,$\frac{\partial \ell}{\partial y}$。
+
+
+
+
+
+例子：张量加法（假定二者维度相同）
+
+
+前向传播时就是简单的c=a+b，反向传播时$\frac{\partial f}{\partial c}=\frac{\partial f}{\partial b}=\frac{\partial f}{\partial a}$，梯度不变。
+
+```Python
+
+class EWiseAdd(Op):
+    def compute(self,a,b):
+        return a+b
+    def gradient(self,out_grad,nodes):
+        return out_grad,out_grad
+
+def add(a, b):
+    return EWiseAdd()(a, b)
+```
+```
+
+在执行`c=add(a,b)`过程中，调用了EWiseAdd的__call__方法（参数为a,b），该方法创建了一个新tensor,操作为EWiseAdd，输入为a和b，值为a+b。
+
+同理，张量逐元素乘法可以如下定义：
+```Python
+
+class EWiseMul(Op):
+    def compute(self,a,b):
+        return a*b
+    def gradient(self,out_grad,nodes):
+        return out_grad*input[1],out_grad*input[0]
+
+def mul(a, b):
+    return EWiseMul()(a, b)
+```
+
+而诸如`a+b`之类的向量运算，直接在Tensor类内部进行运算符重载即可，此处仅给出样例：
+
+```Python
+def __add__(self, other):
+        if isinstance(other, Tensor):
+            return needle.ops.EWiseAdd()(self, other)
+        else:
+            return needle.ops.AddScalar(other)(self)
+```
+
+
+### Stage3： 自动微分求解
+
+
+回到一开始的图：
+
+![ ](/static/ai-ml/dl-library/computational-graph.png){height=300}
+
+问题来了：调用`v4.backward()`后，系统怎么知道先算v2还是先算v3？以及各个节点的梯度如何计算？
+
+对前一个问题，我们需要用到**拓扑排序**。
+
+>   &nbsp;   
+> 拓扑排序的目标是将所有节点排序，使得**排在前面的节点不能依赖于排在后面的节点。**  - OI Wiki
+>   &nbsp;
+
+详细介绍见[链接](https://oi-wiki.org/graph/topo/)。拓扑排序可以通过深度优先搜索(dfs)实现，先处理所有直接input再将自身加入列表。代码实现见下：
+
+```Python
+
+def find_topo_sort(node_list: List[Value]) -> List[Value]:
+    """Given a list of nodes, return a topological sort list of nodes ending in them."""
+    visited={}
+    topo_order=[]
+    for node in node_list:
+        visited,topo_order=topo_sort_dfs(node,visited,topo_order)
+    return topo_order
+
+
+def topo_sort_dfs(node:Value, visited, topo_order):
+    """Post-order DFS
+    first add each unadded node input to the topo-order,
+    then add node itself
+    earlier nodes don't depend on later nodes
+    """
+    if len(node.inputs)==0:
+        pass
+    else:
+        for pred in node.inputs:
+            if pred not in visited:
+                visited,topo_order=topo_sort_dfs(pred,visited,topo_order)
+    visited[node]=1
+    topo_order.append(node)
+    return visited,topo_order
+```
+
+
+对后一个问题，我们需要自动微分求解了。自动微分求解认为，**复杂的计算都可以拆分成有限可微简单算子的组合**。因此，知道简单算子的微分是如何计算的，就可以通过链式法则应用于整个函数。
+
+回到例子：
+$v_2=\exp (v_1)\\
+v_3=v_2+1\\
+v_4=v_2*v_3
+y=v_4$
+
+这些张量的拓扑排序为$v_1,v_2,v_3,v_4,y$,后面的依赖前面的进行计算，因此在反向求梯度时，前面的梯度依赖后面的进行求解。
+
+例：求$\frac{\partial y}{\partial v_2}$
+
+$\frac{\partial y}{\partial v_2}=\frac{\partial y}{\partial v_4}\frac{\partial v_4}{\partial v_2}=\frac{\partial y}{\partial v_4}(v_3+v_2\frac{\partial v_3}{\partial v_2})$
+
+从$v_4$到$v_2$有两条路径：一条是$v_4=v_2*v_3$直接到$v_2$的（对应括号中的$v_3$）,另一条先到$v_3$再到$v_2$（对应括号中的$v_2\frac{\partial v_3}{\partial v_2}$）
+
+因此，自动微分求解的思路就有了：**通过反向拓扑顺序，递归地从后到前计算梯度**。如果某个节点的梯度由多个后继节点直接贡献，那么将这些贡献累加再进行计算。
+
+代码表示见下：
+
+```Python
+
+def compute_gradient_of_variables(output_tensor, out_grad):
+    """Take gradient of output node with respect to each node in node_list.
+    Store the computed result in the grad field of each Variable."""
+    # a map from node to a list of gradient contributions from each output node
+    node_to_output_grads_list: Dict[Tensor, List[Tensor]] = {}
+  
+    node_to_output_grads_list[output_tensor] = [out_grad]
+
+    # Traverse graph in reverse topological order given the output_node that we are taking gradient wrt.
+    reverse_topo_order = list(reversed(find_topo_sort([output_tensor])))
+
+    for i in reverse_topo_order:
+        vi_adj=sum(node_to_output_grads_list[i])  #current grad = sum of all incoming grad
+        i.grad=vi_adj
+        if i.op is None:
+            continue
+        grads=i.op.gradient_as_tuple(vi_adj,i)   #calculate input grads from current node
+        for k in range(len(i.inputs)):           #add input grads to list 
+            vki=grads[k]
+            if i.inputs[k] not in node_to_output_grads_list:
+                node_to_output_grads_list[i.inputs[k]]=[]
+            node_to_output_grads_list[i.inputs[k]].append(vki)
+```
+
+相应的，Tensor类内 `backward()`方法也可以如下实现：初始化`out_grad=1`，然后计算前面结点的梯度。
+```Python
+
+def backward(self, out_grad=None):
+        out_grad = (
+            out_grad if out_grad
+            else init.ones(*self.shape, dtype=self.dtype, device=self.device)
+        )
+        compute_gradient_of_variables(self, out_grad)
+```
+
+## Stage4：神经网络模块搭建
+
+深度学习网络具有高度模块化的特性，因此我们可以把“线性层”“LayerNorm层”等模块封装成一个个`Module`。
+
+在正式定义`Module`类前，我们先定义`Parameter`类，以标明哪些是模型参数：
+
+```Python
+class Parameter(Tensor):
+    """A special kind of tensor that represents parameters."""
+```
+
+`Module`类应该有以下接口：
+- `__call__()`:  返回前向传播的值。
+- `parameters()`:  返回模型以及子模型的参数，以方便优化器更新。
+- `train()`及`eval()`:  改变模型及子模型的`training`变量。
+
+`Module`无需定义`backward()`方法，因为反向传播运算通过`Tensor`的`backward()`就可以实现。
+
+实现如下：
+```Python
+
+class Module:
+    def __init__(self):
+        self.training = True
+
+    def parameters(self) -> List[Tensor]:
+        """Return the list of parameters in the module
+        _unpack_params(value: object) -> List[Tensor]:
+        iteratively unpack self's and children's params"""
+        return _unpack_params(self.__dict__)
+
+    def _children(self) -> List["Module"]:
+        return _child_modules(self.__dict__)
+
+    def eval(self):
+        self.training = False
+        for m in self._children():
+            m.training = False
+
+    def train(self):
+        self.training = True
+        for m in self._children():
+            m.training = True
+
+    def __call__(self, *args, **kwargs):
+        return self.forward(*args, **kwargs)
+    
+    def forward(self,*args, **kwargs):
+        raise NotImplementedError()
+
+```
+
+示例：Linear层：
+
+```Python
+
+class Linear(Module):
+    def __init__(self, in_features, out_features, bias=True):
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        self.bias=bias
+        self.weight=Parameter(init.kaiming_uniform(fan_in=in_features,fan_out=out_features))  
+        if bias:
+            self.bias=Parameter(init.kaiming_uniform(out_features,1).transpose())
+
+    def forward(self, X: Tensor) -> Tensor:
+        N=X.shape[0]
+        if self.bias:
+            return X@self.weight + self.bias.broadcast_to((N,self.out_features))
+        else:
+            return X@self.weight 
+```
+
+## Stage5: NDArray实现
+
+到此为止，我们实现了深度学习框架的大部分基本组件。只需要再加入初始化、优化器、DataLoader等组件，就可以基本完成一个神经网络的训练。但是，Tensor内部的数据如何表示？NDArray到底应该怎么样实现？
+
+先来看[NumPy Internals](https://docs.scipy.org/doc/numpy-1.17.0/reference/internals.html) 对 NumPy Array的解释：
+
+> &nbsp;
+>NumPy arrays consist of two major components, the **raw array data** (from now on, referred to as the data buffer), and the **information about the raw array data**. The data buffer is typically what people think of as arrays in C or Fortran, a **contiguous (and fixed) block of memory containing fixed sized data items**. NumPy also contains a significant set of data that describes **how to interpret the data in the data buffer**.
+> &nbsp;
+
+NDArray包括2个元素:**实际存储的数据**和**如何索引并解释存储的数据**。在大部分NDArray操作中，会尽量避免复制原来的数据，而是改变数据的解释方式。这样做节省了空间，同时部分操作无需改变原始数据，只需改变相应的stride,shape和offset即可。
+
+NDArray的变量如下：
+
+- `device` - 类型为 `BackendDevice`,指向相应的设备后端(如CPU、CUDA)。
+- `handle` - device对应库中的`Array`类，指向实际的数据。
+- `shape` - 数组各维度的大小。
+- `strides` - 数组各维度的“步幅”，即步幅相应索引加 1 时实际数据的位置改变多少。如`strides=(3,2)`时，`a[3][2]`相当于实际数据`a[13]`对应的数。
+- `offset` - 指定NDArray数据在`handle`处加上多少偏移开始。
+
+参考实现：
+
+```Python
+
+class NDArray:
+    def __init__(self, other, device=None):
+        """Create by copying another NDArray, or from numpy"""
+        if isinstance(other, NDArray):
+            # create a copy of existing NDArray
+            if device is None:
+                device = other.device
+            self._init(other.to(device) + 0.0)  # this creates a copy
+        elif isinstance(other, np.ndarray):
+            # create copy from numpy array
+            device = device if device is not None else default_device()
+            array = self.make(other.shape, device=device)
+            array.device.from_numpy(np.ascontiguousarray(other), array._handle)
+            self._init(array)
+        else:
+            #create a numpy array from input
+            array = NDArray(np.array(other), device=device)
+            self._init(array)
+    def _init(self, other):
+        """copy other's device,handle,shape,stride,offset to self"""
+        ...
+    def make(shape, strides=None, device=None, handle=None, offset=0):
+        """Create a new NDArray with the given properties."""
+        array = NDArray.__new__(NDArray)
+        array._shape = tuple(shape)
+        array._strides = NDArray.compact_strides(shape) if strides is None else strides
+        array._offset = offset
+        array._device = device if device is not [None else default_device()
+        if handle is None:
+            array._handle = array.device.Array(prod(shape))
+        else:
+            array._handle = handle
+        return array
+    def as_strided(self, shape, strides):
+        """Restride the matrix without copying memory."""
+        assert len(shape) == len(strides)
+        return NDArray.make(
+            shape, strides=strides, device=self.device, handle=self._handle, offset=self._offset
+        )
+```
+
+## Stage6: 设备后端绑定
+
+NDArray的转置、广播、索引等运算只需改变stride/shape/offset即可，在Python即可实现；而算术运算（加减乘除等）需要涉及handle里面的数据，且Python实现相对耗时较多，所以需要绑定对应的设备后端实现。
+
+先定义`backend_device`类型（也就是device的数据类型）：
+
+```Python
+
+class BackendDevice:
+    """A backend device, wraps the implementation module."""
+    def __init__(self, name, mod):
+        self.name = name
+        self.mod = mod
+    def __getattr__(self, name):
+        return getattr(self.mod, name)
+    ...
+```
+
+其中，`self.mod`为BackendDevice对应的库(如`ndarray_backend_cpu`、`ndarray_backend_cuda`等)，`__getattr__`方法则规定了调用device的方法的时候会调用device对应库的方法。
+
+相应的，NDArray涉及底层数据的一部分方法就可以交给设备后端了：
+```Python
+class NDArray:
+    ...
+    def __pow__(self, other):
+        out = NDArray.make(self.shape, device=self.device)
+        self.device.scalar_power(self.compact()._handle, other, out._handle)
+        return out
+    ...
+```
+
+上述程序调用了`self.device`对应库的标量乘方方法。
+
+最后，可以通过pybind11来实现C++源码到python module的转换。具体方式见[文档](https://pybind11.readthedocs.io/en/stable/index.html)。
+
+示例片段：
+```c++
+#include <pybind11/numpy.h>
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+// array dtype
+struct AlignedArray {
+  AlignedArray(const size_t size) {
+    int ret = posix_memalign((void**)&ptr, ALIGNMENT, size * ELEM_SIZE);
+    if (ret != 0) throw std::bad_alloc();
+    this->size = size;
+  }
+  ~AlignedArray() { free(ptr); }
+  size_t ptr_as_int() {return (size_t)ptr; }
+  scalar_t* ptr;
+  size_t size;
+};
+
+//...
+void ScalarPower(const AlignedArray& a, scalar_t val, AlignedArray* out) {
+  for (size_t i = 0; i < a.size; i++) {
+    out->ptr[i] = std::pow(a.ptr[i] , val);
+  }
+}
+//...
+
+PYBIND11_MODULE(ndarray_backend_cpu, m) {
+  namespace py = pybind11;
+
+  //define device.Array
+  py::class_<AlignedArray>(m, "Array")
+      .def(py::init<size_t>(), py::return_value_policy::take_ownership)
+      .def("ptr", &AlignedArray::ptr_as_int)
+      .def_readonly("size", &AlignedArray::size);
+
+  //define device.__device_name__
+  m.attr("__device_name__") = "cpu";  
+  //...
+
+  //define device.scalar_power
+  m.def("scalar_power", ScalarPower);
+  //...
+}
+```
\ No newline at end of file
diff --git a/docs/static/ai-ml/dl-library/computational-graph.png b/docs/static/ai-ml/dl-library/computational-graph.png
new file mode 100644
index 00000000..16942efe
Binary files /dev/null and b/docs/static/ai-ml/dl-library/computational-graph.png differ