fix-c8-c9-c16-c18

Ricardo-shuo-liu · Ricardo-shuo-liu · commit bba7376d9a55 · 2025-10-30T17:27:50.000+08:00
diff --git a/_typos.toml b/_typos.toml
@@ -40,12 +40,7 @@ Successed = "Successed"
 accordding = "accordding"
 accoustic = "accoustic"
 accpetance = "accpetance"
-cantains = "cantains"
 classfy = "classfy"
-cliping = "cliping"
-colunms = "colunms"
-containg = "containg"
-contruction = "contruction"
 contxt = "contxt"
 convertion = "convertion"
 convinience = "convinience"
diff --git a/docs/api/paddle/nn/functional/sparse_attention_cn.rst b/docs/api/paddle/nn/functional/sparse_attention_cn.rst
@@ -8,7 +8,7 @@ sparse_attention
 
 对 Transformer 模块中的 Attention 矩阵进行了稀疏化，从而减少内存消耗和计算量。
 
-其稀疏数据排布通过 CSR 格式表示，CSR 格式包含两个参数，``offset`` 和 ``colunms``。计算公式为：
+其稀疏数据排布通过 CSR 格式表示，CSR 格式包含两个参数，``offset`` 和 ``columns``。计算公式为：
 
 .. math::
    result=softmax(\frac{ Q * K^T }{\sqrt{d}}) * V
@@ -24,7 +24,7 @@ sparse_attention
   - **key** (Tensor) - 输入的 Tensor，代表注意力模块中的 ``key``，这是一个 4 维 Tensor，形状为：[batch_size, num_heads, seq_len, head_dim]，数据类型为 float32 或 float64。
   - **value** (Tensor) - 输入的 Tensor，代表注意力模块中的 ``value``，这是一个 4 维 Tensor，形状为：[batch_size, num_heads, seq_len, head_dim]，数据类型为 float32 或 float64。
   - **sparse_csr_offset** (Tensor) - 输入的 Tensor，注意力模块中的稀疏特性，稀疏特性使用 CSR 格式表示，``offset`` 代表矩阵中每一行非零元的数量。这是一个 3 维 Tensor，形状为：[batch_size, num_heads, seq_len + 1]，数据类型为 int32。
-  - **sparse_csr_columns** (Tensor) - 输入的 Tensor，注意力模块中的稀疏特性，稀疏特性使用 CSR 格式表示，``colunms`` 代表矩阵中每一行非零元的列索引值。这是一个 3 维 Tensor，形状为：[batch_size, num_heads, sparse_nnz]，数据类型为 int32。
+  - **sparse_csr_columns** (Tensor) - 输入的 Tensor，注意力模块中的稀疏特性，稀疏特性使用 CSR 格式表示，``columns`` 代表矩阵中每一行非零元的列索引值。这是一个 3 维 Tensor，形状为：[batch_size, num_heads, sparse_nnz]，数据类型为 int32。
 
 返回
 :::::::::
diff --git a/docs/design/concurrent/parallel_do.md b/docs/design/concurrent/parallel_do.md
@@ -113,7 +113,7 @@ We can avoid this step by making each device have a copy of the parameter. This
 1. In the backward, allreduce param@grad at different devices, this requires
     1. `backward.py` add `allreduce` operators at parallel_do_grad
     1. `allreduce` operators need to be called in async mode to achieve maximum throughput
-1. apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel
+1. apply gradients related op(i.e. clipping, normalization, decay, sgd) on different devices in parallel
 
 By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device".
 And the ProgramDesc looks like the following
diff --git a/docs/design/mkldnn/gru/gru.md b/docs/design/mkldnn/gru/gru.md
@@ -41,7 +41,7 @@ Proof:
 PaddlePaddle allows user to choose activation functions for update/reset gate and output gate. However, oneDNN supports only default `sigmoid` activation for gates and `tanh` for output. Currently oneDNN operator throws an error when user tries to execute it with other activations.
 
 ## oneDNN GRU operator
-oneDNN `GRU` operator is based on Paddle Paddle `fusion_gru` operator. It uses primitive/memory caching mechanism called `AcquireAPI`. Handler containg 2 caching key, one dependent on sentence length used in caching input/output and primitive. The other key (`memory_key`) depends only on other, not changing during inference, parameters and is used to cache weights and bias memory.
+oneDNN `GRU` operator is based on Paddle Paddle `fusion_gru` operator. It uses primitive/memory caching mechanism called `AcquireAPI`. Handler containing 2 caching key, one dependent on sentence length used in caching input/output and primitive. The other key (`memory_key`) depends only on other, not changing during inference, parameters and is used to cache weights and bias memory.
 
 ### Dimensions in oneDNN RNN primitives
 
diff --git a/docs/design/others/graph_survey.md b/docs/design/others/graph_survey.md
@@ -30,7 +30,7 @@ def get_symbol(num_classes=10, **kwargs):
 
 Variable here is actually a Symbol. Every basic Symbol will correspond to one Node, and every Node has its own AnyAttr. There is a op field in AnyAttr class, when a Symbol represents Variable(often input data), the op field is null.
 
-Symbol contains a data member, std::vector<NodeEntry> outputs, and NodeEntry cantains a pointer to Node. We can follow the Node pointer to get all the Graph.
+Symbol contains a data member, std::vector<NodeEntry> outputs, and NodeEntry contains a pointer to Node. We can follow the Node pointer to get all the Graph.
 
 And Symbol can be saved to a JSON file.
 
diff --git a/docs/guides/advanced/layer_and_model_en.md b/docs/guides/advanced/layer_and_model_en.md
@@ -11,7 +11,7 @@ In this guide, you will learn how to define and make use of models in Paddle, an
 
 In Paddle, most models consist of a series of layers. Layer serves as the foundamental logical unit of a model, composed of two parts: the variable that participates in the computation and the operator(s) that actually perform the execution.
 
-Constructing a model from scratch could be painful, with tons of nested codes to write and maintain. To make life easier, Paddle provides foundamental data structure ``paddle.nn.Layer`` to simplify the contruction of layer or model. One may easily inherit from ``paddle.nn.Layer`` to define their custom layers or models. In addition, since both model and layer are essentially inherited from ``paddle.nn.Layer``, model is nothing but a special layer in Paddle.
+Constructing a model from scratch could be painful, with tons of nested codes to write and maintain. To make life easier, Paddle provides foundamental data structure ``paddle.nn.Layer`` to simplify the construction of layer or model. One may easily inherit from ``paddle.nn.Layer`` to define their custom layers or models. In addition, since both model and layer are essentially inherited from ``paddle.nn.Layer``, model is nothing but a special layer in Paddle.
 
 Now let us construct a model using ``paddle.nn.Layer``:
 
diff --git a/docs/templates/common_docs.py b/docs/templates/common_docs.py
@@ -23,12 +23,12 @@
     stride (tuple|int): The stride size. It can be a single integer or a tuple containing two integers, representing the strides of the convolution along the height and width. If it is a single integer, the height and width are equal to the integer. Default is 1.
     groups (int, optional): The group number of convolution layer. When group=n, the input and convolution kernels are divided into n groups equally, the first group of convolution kernels and the first group of inputs are subjected to convolution calculation, the second group of convolution kernels and the second group of inputs are subjected to convolution calculation, ……, the nth group of convolution kernels and the nth group of inputs perform convolution calculations. Default is 1.
     regularization (WeightDecayRegularizer, optional): The strategy of regularization. There are two method: :ref:`api_fluid_regularizer_L1Decay` 、 :ref:`api_fluid_regularizer_L2Decay` . If a parameter has set regularizer using  :ref:`api_fluid_ParamAttr` already, the regularization setting here in optimizer will be ignored for this parameter. Otherwise, the regularization setting here in optimizer will take effect. Default None, meaning there is no regularization.
-    grad_clip (GradientClipBase, optional): Gradient cliping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three cliping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there is no gradient clipping.
+    grad_clip (GradientClipBase, optional): Gradient clipping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three clipping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there is no gradient clipping.
     dilation (tuple|int): The dilation size. It can be a single integer or a tuple containing two integers, representing the height and width of dilation of the convolution kernel elements. If it is a single integer,the height and width of dilation are equal to the integer. Default is 1.
     stop_gradient (bool, optional): A boolean that mentions whether gradient should flow. Default is True, means stop calculate gradients.
     force_cpu (bool, optional): Whether force to store the output tensor in CPU memory. If force_cpu is False, the output tensor will be stored in running device memory, otherwise it will be stored  to the CPU memory. Default is False.
     data_format (str, optional): Specify the input data format, the output data format will be consistent with the input, which can be ``NCHW`` or ``NHWC`` . N is batch size, C is channels, H is height, and W is width. Default is ``NCHW`` .
-    grad_clip (GradientClipBase, optional): Gradient cliping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three cliping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default is None, meaning there is no gradient clipping.
+    grad_clip (GradientClipBase, optional): Gradient clipping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three clipping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default is None, meaning there is no gradient clipping.
     num_filters (int): The number of filter. It is as same as the output channals numbers.
     dim (int, optional): A dimension along which to operate. Default is 0.
     is_sparse (bool, optional): Whether use sparse updating. For more information, please refer to :ref:`api_guide_sparse_update_en` . If it's True, it will use sparse updating.