From 128598ffe1eb77ab50ccde84eb437999000e57da Mon Sep 17 00:00:00 2001 From: Ricardo-shuo-liu <13838152117@139.com> Date: Thu, 30 Oct 2025 07:13:56 +0800 Subject: [PATCH 1/2] fix-a2-a5 --- _typos.toml | 5 ----- docs/design/mkldnn/inplace/inplace.md | 2 +- docs/design/network/deep_speech_2.md | 2 +- .../api_design_guidelines_standard_cn.md | 2 +- docs/guides/paddle_v3_features/paddle_ir_cn.md | 4 ++-- 5 files changed, 5 insertions(+), 10 deletions(-) diff --git a/_typos.toml b/_typos.toml index d3031fb6f45..d8e41db1ba1 100644 --- a/_typos.toml +++ b/_typos.toml @@ -23,11 +23,6 @@ Nervana = "Nervana" # These words need to be fixed Accuray = "Accuray" -Adventages = "Adventages" -Archetecture = "Archetecture" -Asynchoronous = "Asynchoronous" -Attrbute = "Attrbute" -Attribtue = "Attribtue" Creenshot = "Creenshot" Embeddding = "Embeddding" Embeding = "Embeding" diff --git a/docs/design/mkldnn/inplace/inplace.md b/docs/design/mkldnn/inplace/inplace.md index cc3e4821e8d..5e4f0ae7669 100644 --- a/docs/design/mkldnn/inplace/inplace.md +++ b/docs/design/mkldnn/inplace/inplace.md @@ -15,7 +15,7 @@ Currently assumption is that if operator can have in-place processing then all i - gelu* - sum** -Adventages of in-place computation are: +Advantages of in-place computation are: * lower memory usage * improved performance of operators diff --git a/docs/design/network/deep_speech_2.md b/docs/design/network/deep_speech_2.md index 5497c022bd4..aaaab971bb5 100644 --- a/docs/design/network/deep_speech_2.md +++ b/docs/design/network/deep_speech_2.md @@ -117,7 +117,7 @@ The classical DS2 network contains 15 layers (from bottom to top):

-Figure 1. Archetecture of Deep Speech 2 Network. +Figure 1. Architecture of Deep Speech 2 Network.
We don't have to persist on this 2-3-7-1-1-1 depth \[[2](#references)\]. Similar networks with different depths might also work well. As in \[[1](#references)\], authors use a different depth (e.g. 2-2-3-1-1-1) for final experiments. diff --git a/docs/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.md b/docs/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.md index d13d20995cb..c28310924ba 100644 --- a/docs/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.md +++ b/docs/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.md @@ -538,7 +538,7 @@ | 级联 | coalesced | | | 数据并行 | data parallelism | | | 模型并行 | model parallelism | | -| 异步随机梯度下降 | Asynchoronous Stochastic Gradient Descent | | +| 异步随机梯度下降 | Asynchronous Stochastic Gradient Descent | | | 参数服务器 | parameter server | | | 模型压缩 | model compression | | | 动态结构 | dynamic structure | | diff --git a/docs/guides/paddle_v3_features/paddle_ir_cn.md b/docs/guides/paddle_v3_features/paddle_ir_cn.md index 567399b09fe..afc1fbb043a 100644 --- a/docs/guides/paddle_v3_features/paddle_ir_cn.md +++ b/docs/guides/paddle_v3_features/paddle_ir_cn.md @@ -9,7 +9,7 @@ 在深度学习框架 IR 概念中,「顺序性」和「图语义」是两个非常高频常用的概念。旧的中间表示体系由「顺序性」ProgramDesc 和「图语义」Graph 两个核心类共同承载。用户在静态图 API 或者动转静模块下,产生的中间表示是 Op-by-Op 的 Program,如果要应用更高层面的优化策略(比如算子融合、inplace 策略、剪枝等),框架会将由 Program 构造出 Graph,其由数据节点、算子节点和彼此关联的边构成。 -在新的 Paddle IR 中,飞桨在底层抽象了一套高度可扩展的基础组件,包括 Type、Attrbute、Op、Trait 和 Interface,并引入了 Dialect 的概念,支持开发者灵活扩展、自由定制,提供了完备鲁邦的语义表达能力;在模型表示层,通过多 Dialect 模块化管理,统一多端表示,实现了训推一体的全架构统一表示,无缝衔接组合算子和编译器,支持自动优化和多硬件适配;在图变换层,通过统一底层模块,简化基础概念,向用户提供了低成本开发、易用高性能、丰富可插拔的 Pass 优化机制。 +在新的 Paddle IR 中,飞桨在底层抽象了一套高度可扩展的基础组件,包括 Type、Attribute、Op、Trait 和 Interface,并引入了 Dialect 的概念,支持开发者灵活扩展、自由定制,提供了完备鲁邦的语义表达能力;在模型表示层,通过多 Dialect 模块化管理,统一多端表示,实现了训推一体的全架构统一表示,无缝衔接组合算子和编译器,支持自动优化和多硬件适配;在图变换层,通过统一底层模块,简化基础概念,向用户提供了低成本开发、易用高性能、丰富可插拔的 Pass 优化机制。 飞桨的新一代的 IR 表示坚持 SSA(静态单赋值)原则,模型等价于一个有向无环图。并以 Value、Operation 对计算图进行抽象, Operation 为节点,Value 为边。 * Operation 表示计算图中的节点:一个 Operation 表示一个算子,它里面包含了零个或多个 Region;Region 表示一个闭包,它里面包含了零个或多个 Block;Block 表示一个符合 SSA 的基本块,里面包含了零个或多个 Operation;三者循环嵌套,可以实现任意复杂的语法结构 @@ -96,7 +96,7 @@ print(out) 如上左图所示,新一代 IR 的整体设计自底向上分为三层: ### 1.灵活的基础组件 -飞桨提供了 Trait 和 Interface 两种重要机制实现了对算子 Op 的特征和接口的抽象标记。 比如 InplaceTrait 表示一个 Op 具有 Inplace 特征, InferShapeInterface 表示一个算子定义了 InferShape 函数接口等,这二者都是可以任意扩展的,只要派生自相应的基类、遵循相应的实现规则即可;并对算子体系下核心概念抽出 Type、Attrbute、Op,这三者是基于 Trait 和 Interface 进行定义的。它们会对关联自己所拥有的相应 Trait 和 Interface ;Dialect 用来对 Type、Attribtue、Op 做模块化管理, 比如 BuiltinDialect、PaddleDialect、CinnDialect 等等。一个 Dialect 里面包含了一系列的 Type、Attribtue、Op 的定义。相应的,每个 Type、Attribtue、Op 都是定义在某个唯一的 Dialect 里面。对整个 IR 框架而言, Dialect 是可以随意插拔的,也是可以任意扩展的。 +飞桨提供了 Trait 和 Interface 两种重要机制实现了对算子 Op 的特征和接口的抽象标记。 比如 InplaceTrait 表示一个 Op 具有 Inplace 特征, InferShapeInterface 表示一个算子定义了 InferShape 函数接口等,这二者都是可以任意扩展的,只要派生自相应的基类、遵循相应的实现规则即可;并对算子体系下核心概念抽出 Type、Attribute、Op,这三者是基于 Trait 和 Interface 进行定义的。它们会对关联自己所拥有的相应 Trait 和 Interface ;Dialect 用来对 Type、Attribute、Op 做模块化管理, 比如 BuiltinDialect、PaddleDialect、CinnDialect 等等。一个 Dialect 里面包含了一系列的 Type、Attribute、Op 的定义。相应的,每个 Type、Attribute、Op 都是定义在某个唯一的 Dialect 里面。对整个 IR 框架而言, Dialect 是可以随意插拔的,也是可以任意扩展的。 这一层是 IR 适应多种场景的基础。这一层的每一个要素都是可定制化扩展的,一般情况下,针对一个具体的场景,比如分布式、编译器。都需要定义自己需要用到的 Trait、Interface,然后定义自己的 Dialect,在自己的 Dialect 里面,定义自己需要用到的 Type、Attribute、Op。 From bba7376d9a551b30407f44475a22c18e89e7f7ea Mon Sep 17 00:00:00 2001 From: Ricardo-shuo-liu <13838152117@139.com> Date: Thu, 30 Oct 2025 17:27:50 +0800 Subject: [PATCH 2/2] fix-c8-c9-c16-c18 --- _typos.toml | 5 ----- docs/api/paddle/nn/functional/sparse_attention_cn.rst | 4 ++-- docs/design/concurrent/parallel_do.md | 2 +- docs/design/mkldnn/gru/gru.md | 2 +- docs/design/others/graph_survey.md | 2 +- docs/guides/advanced/layer_and_model_en.md | 2 +- docs/templates/common_docs.py | 4 ++-- 7 files changed, 8 insertions(+), 13 deletions(-) diff --git a/_typos.toml b/_typos.toml index d8f38ffaa07..b536cac02a9 100644 --- a/_typos.toml +++ b/_typos.toml @@ -40,12 +40,7 @@ Successed = "Successed" accordding = "accordding" accoustic = "accoustic" accpetance = "accpetance" -cantains = "cantains" classfy = "classfy" -cliping = "cliping" -colunms = "colunms" -containg = "containg" -contruction = "contruction" contxt = "contxt" convertion = "convertion" convinience = "convinience" diff --git a/docs/api/paddle/nn/functional/sparse_attention_cn.rst b/docs/api/paddle/nn/functional/sparse_attention_cn.rst index fd4fa08f4f0..46e0edde37e 100755 --- a/docs/api/paddle/nn/functional/sparse_attention_cn.rst +++ b/docs/api/paddle/nn/functional/sparse_attention_cn.rst @@ -8,7 +8,7 @@ sparse_attention 对 Transformer 模块中的 Attention 矩阵进行了稀疏化,从而减少内存消耗和计算量。 -其稀疏数据排布通过 CSR 格式表示,CSR 格式包含两个参数,``offset`` 和 ``colunms``。计算公式为: +其稀疏数据排布通过 CSR 格式表示,CSR 格式包含两个参数,``offset`` 和 ``columns``。计算公式为: .. math:: result=softmax(\frac{ Q * K^T }{\sqrt{d}}) * V @@ -24,7 +24,7 @@ sparse_attention - **key** (Tensor) - 输入的 Tensor,代表注意力模块中的 ``key``,这是一个 4 维 Tensor,形状为:[batch_size, num_heads, seq_len, head_dim],数据类型为 float32 或 float64。 - **value** (Tensor) - 输入的 Tensor,代表注意力模块中的 ``value``,这是一个 4 维 Tensor,形状为:[batch_size, num_heads, seq_len, head_dim],数据类型为 float32 或 float64。 - **sparse_csr_offset** (Tensor) - 输入的 Tensor,注意力模块中的稀疏特性,稀疏特性使用 CSR 格式表示,``offset`` 代表矩阵中每一行非零元的数量。这是一个 3 维 Tensor,形状为:[batch_size, num_heads, seq_len + 1],数据类型为 int32。 - - **sparse_csr_columns** (Tensor) - 输入的 Tensor,注意力模块中的稀疏特性,稀疏特性使用 CSR 格式表示,``colunms`` 代表矩阵中每一行非零元的列索引值。这是一个 3 维 Tensor,形状为:[batch_size, num_heads, sparse_nnz],数据类型为 int32。 + - **sparse_csr_columns** (Tensor) - 输入的 Tensor,注意力模块中的稀疏特性,稀疏特性使用 CSR 格式表示,``columns`` 代表矩阵中每一行非零元的列索引值。这是一个 3 维 Tensor,形状为:[batch_size, num_heads, sparse_nnz],数据类型为 int32。 返回 ::::::::: diff --git a/docs/design/concurrent/parallel_do.md b/docs/design/concurrent/parallel_do.md index c5b6c094c6a..6e20b9b7b14 100644 --- a/docs/design/concurrent/parallel_do.md +++ b/docs/design/concurrent/parallel_do.md @@ -113,7 +113,7 @@ We can avoid this step by making each device have a copy of the parameter. This 1. In the backward, allreduce param@grad at different devices, this requires 1. `backward.py` add `allreduce` operators at parallel_do_grad 1. `allreduce` operators need to be called in async mode to achieve maximum throughput -1. apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel +1. apply gradients related op(i.e. clipping, normalization, decay, sgd) on different devices in parallel By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device". And the ProgramDesc looks like the following diff --git a/docs/design/mkldnn/gru/gru.md b/docs/design/mkldnn/gru/gru.md index c4f719e2886..bfd4bfa7918 100644 --- a/docs/design/mkldnn/gru/gru.md +++ b/docs/design/mkldnn/gru/gru.md @@ -41,7 +41,7 @@ Proof: PaddlePaddle allows user to choose activation functions for update/reset gate and output gate. However, oneDNN supports only default `sigmoid` activation for gates and `tanh` for output. Currently oneDNN operator throws an error when user tries to execute it with other activations. ## oneDNN GRU operator -oneDNN `GRU` operator is based on Paddle Paddle `fusion_gru` operator. It uses primitive/memory caching mechanism called `AcquireAPI`. Handler containg 2 caching key, one dependent on sentence length used in caching input/output and primitive. The other key (`memory_key`) depends only on other, not changing during inference, parameters and is used to cache weights and bias memory. +oneDNN `GRU` operator is based on Paddle Paddle `fusion_gru` operator. It uses primitive/memory caching mechanism called `AcquireAPI`. Handler containing 2 caching key, one dependent on sentence length used in caching input/output and primitive. The other key (`memory_key`) depends only on other, not changing during inference, parameters and is used to cache weights and bias memory. ### Dimensions in oneDNN RNN primitives diff --git a/docs/design/others/graph_survey.md b/docs/design/others/graph_survey.md index a3690d1f190..d713849af2f 100644 --- a/docs/design/others/graph_survey.md +++ b/docs/design/others/graph_survey.md @@ -30,7 +30,7 @@ def get_symbol(num_classes=10, **kwargs): Variable here is actually a Symbol. Every basic Symbol will correspond to one Node, and every Node has its own AnyAttr. There is a op field in AnyAttr class, when a Symbol represents Variable(often input data), the op field is null. -Symbol contains a data member, std::vector outputs, and NodeEntry cantains a pointer to Node. We can follow the Node pointer to get all the Graph. +Symbol contains a data member, std::vector outputs, and NodeEntry contains a pointer to Node. We can follow the Node pointer to get all the Graph. And Symbol can be saved to a JSON file. diff --git a/docs/guides/advanced/layer_and_model_en.md b/docs/guides/advanced/layer_and_model_en.md index 35f2214a6e3..e6086e7fd6e 100644 --- a/docs/guides/advanced/layer_and_model_en.md +++ b/docs/guides/advanced/layer_and_model_en.md @@ -11,7 +11,7 @@ In this guide, you will learn how to define and make use of models in Paddle, an In Paddle, most models consist of a series of layers. Layer serves as the foundamental logical unit of a model, composed of two parts: the variable that participates in the computation and the operator(s) that actually perform the execution. -Constructing a model from scratch could be painful, with tons of nested codes to write and maintain. To make life easier, Paddle provides foundamental data structure ``paddle.nn.Layer`` to simplify the contruction of layer or model. One may easily inherit from ``paddle.nn.Layer`` to define their custom layers or models. In addition, since both model and layer are essentially inherited from ``paddle.nn.Layer``, model is nothing but a special layer in Paddle. +Constructing a model from scratch could be painful, with tons of nested codes to write and maintain. To make life easier, Paddle provides foundamental data structure ``paddle.nn.Layer`` to simplify the construction of layer or model. One may easily inherit from ``paddle.nn.Layer`` to define their custom layers or models. In addition, since both model and layer are essentially inherited from ``paddle.nn.Layer``, model is nothing but a special layer in Paddle. Now let us construct a model using ``paddle.nn.Layer``: diff --git a/docs/templates/common_docs.py b/docs/templates/common_docs.py index 93d0c09deae..594f05961da 100644 --- a/docs/templates/common_docs.py +++ b/docs/templates/common_docs.py @@ -23,12 +23,12 @@ stride (tuple|int): The stride size. It can be a single integer or a tuple containing two integers, representing the strides of the convolution along the height and width. If it is a single integer, the height and width are equal to the integer. Default is 1. groups (int, optional): The group number of convolution layer. When group=n, the input and convolution kernels are divided into n groups equally, the first group of convolution kernels and the first group of inputs are subjected to convolution calculation, the second group of convolution kernels and the second group of inputs are subjected to convolution calculation, ……, the nth group of convolution kernels and the nth group of inputs perform convolution calculations. Default is 1. regularization (WeightDecayRegularizer, optional): The strategy of regularization. There are two method: :ref:`api_fluid_regularizer_L1Decay` 、 :ref:`api_fluid_regularizer_L2Decay` . If a parameter has set regularizer using :ref:`api_fluid_ParamAttr` already, the regularization setting here in optimizer will be ignored for this parameter. Otherwise, the regularization setting here in optimizer will take effect. Default None, meaning there is no regularization. - grad_clip (GradientClipBase, optional): Gradient cliping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three cliping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there is no gradient clipping. + grad_clip (GradientClipBase, optional): Gradient clipping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three clipping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there is no gradient clipping. dilation (tuple|int): The dilation size. It can be a single integer or a tuple containing two integers, representing the height and width of dilation of the convolution kernel elements. If it is a single integer,the height and width of dilation are equal to the integer. Default is 1. stop_gradient (bool, optional): A boolean that mentions whether gradient should flow. Default is True, means stop calculate gradients. force_cpu (bool, optional): Whether force to store the output tensor in CPU memory. If force_cpu is False, the output tensor will be stored in running device memory, otherwise it will be stored to the CPU memory. Default is False. data_format (str, optional): Specify the input data format, the output data format will be consistent with the input, which can be ``NCHW`` or ``NHWC`` . N is batch size, C is channels, H is height, and W is width. Default is ``NCHW`` . - grad_clip (GradientClipBase, optional): Gradient cliping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three cliping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default is None, meaning there is no gradient clipping. + grad_clip (GradientClipBase, optional): Gradient clipping strategy, it's an instance of some derived class of ``GradientClipBase`` . There are three clipping strategies ( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , :ref:`api_fluid_clip_GradientClipByValue` ). Default is None, meaning there is no gradient clipping. num_filters (int): The number of filter. It is as same as the output channals numbers. dim (int, optional): A dimension along which to operate. Default is 0. is_sparse (bool, optional): Whether use sparse updating. For more information, please refer to :ref:`api_guide_sparse_update_en` . If it's True, it will use sparse updating.