You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"<h1>Position-wise Feed-Forward Network (FFN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of position-wise feedforward network used in transformer.</p>\n<p>FFN consists of two fully connected layers. Number of dimensions in the hidden layer <span translate=no>_^_0_^_</span>, is generally set to around four times that of the token embedding <span translate=no>_^_1_^_</span>. So it is sometime also called the expand-and-contract network.</p>\n<p>There is an activation at the hidden layer, which is usually set to ReLU (Rectified Linear Unit) activation, <span translate=no>_^_2_^_</span></p>\n<p>That is, the FFN function is, <span translate=no>_^_3_^_</span> where <span translate=no>_^_4_^_</span>, <span translate=no>_^_5_^_</span>, <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> are learnable parameters.</p>\n<p>Sometimes the GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. <span translate=no>_^_8_^_</span> where <span translate=no>_^_9_^_</span></p>\n<h3>Gated Linear Units</h3>\n<p>This is a generic implementation that supports different variants including <a href=\"https://arxiv.org/abs/2002.05202\">Gated Linear Units</a> (GLU). We have also implemented experiments on these:</p>\n<ul><li><a href=\"glu_variants/experiment.html\">experiment that uses <span translate=no>_^_10_^_</span></a> </li>\n<li><a href=\"glu_variants/simple.html\">simpler version from scratch</a></li></ul>\n": "<h1>\u4f4d\u7f6e\u524d\u9988\u7f51\u7edc (FFN)</h1>\n<p>\u8fd9\u662f\u53d8\u538b\u5668\u4e2d\u4f7f\u7528\u7684\u6309\u4f4d\u7f6e\u524d\u9988\u7f51\u7edc\u7684 <a href=\"https://pytorch.org\">PyTorch</a> \u5b9e\u73b0\u3002</p>\n<p>FFN \u7531\u4e24\u4e2a\u5b8c\u5168\u8fde\u63a5\u7684\u5c42\u7ec4\u6210\u3002\u9690\u85cf\u5c42\u4e2d\u7684\u7ef4\u5ea6\u6570<span translate=no>_^_0_^_</span>\uff0c\u901a\u5e38\u8bbe\u7f6e\u4e3a\u4ee4\u724c\u5d4c\u5165\u7684\u56db\u500d\u5de6\u53f3<span translate=no>_^_1_^_</span>\u3002\u56e0\u6b64\uff0c\u5b83\u6709\u65f6\u4e5f\u88ab\u79f0\u4e3a\u6269\u5f20\u548c\u6536\u7f29\u7f51\u7edc\u3002</p>\n<p>\u9690\u85cf\u5c42\u6709\u4e00\u4e2a\u6fc0\u6d3b\uff0c\u901a\u5e38\u8bbe\u7f6e\u4e3aRelU\uff08\u6574\u6d41\u7ebf\u6027\u5355\u5143\uff09\u6fc0\u6d3b\uff0c<span translate=no>_^_2_^_</span></p>\n<p>\u4e5f\u5c31\u662f\u8bf4\uff0cFFN \u51fd\u6570\u662f\u3001<span translate=no>_^_3_^_</span>\u5176\u4e2d<span translate=no>_^_4_^_</span><span translate=no>_^_5_^_</span>\u3001<span translate=no>_^_6_^_</span>\u548c<span translate=no>_^_7_^_</span>\u662f\u53ef\u5b66\u4e60\u7684\u53c2\u6570\u3002</p>\n<p>\u6709\u65f6\u8fd8\u4f1a\u4f7f\u7528 GELU\uff08\u9ad8\u65af\u8bef\u5dee\u7ebf\u6027\u5355\u4f4d\uff09\u6fc0\u6d3b\u6765\u4ee3\u66ff RelU\u3002<span translate=no>_^_8_^_</span>\u5728\u54ea\u91cc<span translate=no>_^_9_^_</span></p>\n<h3>\u95e8\u63a7\u7ebf\u6027\u5355\u5143</h3>\n<p>\u8fd9\u662f\u4e00\u4e2a\u901a\u7528\u5b9e\u73b0\uff0c\u652f\u6301\u4e0d\u540c\u7684\u53d8\u4f53\uff0c\u5305\u62ec<a href=\"https://arxiv.org/abs/2002.05202\">\u95e8\u63a7\u7ebf\u6027\u5355\u5143</a> (GLU)\u3002\u6211\u4eec\u8fd8\u5bf9\u4ee5\u4e0b\u65b9\u9762\u8fdb\u884c\u4e86\u5b9e\u9a8c\uff1a</p>\n<ul><li><a href=\"glu_variants/experiment.html\">\u4f7f\u7528\u7684\u5b9e\u9a8c<span translate=no>_^_10_^_</span></a></li>\n<li><a href=\"glu_variants/simple.html\">\u4ece\u5934\u5f00\u59cb\u66f4\u7b80\u5355\u7684\u7248\u672c</a></li></ul>\n",
2
+
"<h1>Position-wise Feed-Forward Network (FFN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of position-wise feedforward network used in transformer.</p>\n<p>FFN consists of two fully connected layers. Number of dimensions in the hidden layer <span translate=no>_^_0_^_</span>, is generally set to around four times that of the token embedding <span translate=no>_^_1_^_</span>. So it is sometime also called the expand-and-contract network.</p>\n<p>There is an activation at the hidden layer, which is usually set to ReLU (Rectified Linear Unit) activation, <span translate=no>_^_2_^_</span></p>\n<p>That is, the FFN function is, <span translate=no>_^_3_^_</span> where <span translate=no>_^_4_^_</span>, <span translate=no>_^_5_^_</span>, <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> are learnable parameters.</p>\n<p>Sometimes the GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. <span translate=no>_^_8_^_</span> where <span translate=no>_^_9_^_</span></p>\n<h3>Gated Linear Units</h3>\n<p>This is a generic implementation that supports different variants including <a href=\"https://arxiv.org/abs/2002.05202\">Gated Linear Units</a> (GLU). We have also implemented experiments on these:</p>\n<ul><li><a href=\"glu_variants/experiment.html\">experiment that uses <span translate=no>_^_10_^_</span></a> </li>\n<li><a href=\"glu_variants/simple.html\">simpler version from scratch</a></li></ul>\n": "<h1>\u4f4d\u7f6e\u524d\u9988\u7f51\u7edc (FFN)</h1>\n<p>\u8fd9\u662f Transformer \u4e2d\u4f7f\u7528\u7684\u4f4d\u7f6e\u524d\u9988\u7f51\u7edc\u7684 <a href=\"https://pytorch.org\"> PyTorch </a> \u5b9e\u73b0\u3002</p>\n<p> FFN \u7531\u4e24\u4e2a\u5168\u8fde\u63a5\u5c42\u7ec4\u6210\u3002\u9690\u85cf\u5c42\u4e2d\u7684\u7ef4\u5ea6\u6570<span translate=no>_%5e_0_%5e_</span>\u901a\u5e38\u8bbe\u7f6e\u4e3a\u6807\u8bb0\u5d4c\u5165\u7ef4\u5ea6<span translate=no>_%5e_1_%5e_</span>\u7684\u56db\u500d\u5de6\u53f3\u3002\u56e0\u6b64\uff0c\u5b83\u6709\u65f6\u4e5f\u88ab\u79f0\u4e3a\u6269\u5f20-\u538b\u7f29\u7f51\u7edc\u3002</p>\n<p>\u9690\u85cf\u5c42\u6709\u4e00\u4e2a\u6fc0\u6d3b\u51fd\u6570\uff0c\u901a\u5e38\u8bbe\u7f6e\u4e3a ReLU (Rectified Linear Unit) \u6fc0\u6d3b\u51fd\u6570\uff0c<span translate=no>_%5e_2_%5e_</span></p>\n<p>\u5728\u6b64\u57fa\u7840\u4e0a\uff0c FFN \u51fd\u6570\u53ef\u4ee5\u5199\u4f5c\uff1a<span translate=no>_%5e_3_%5e_</span>\u5176\u4e2d<span translate=no>_%5e_4_%5e_</span><span translate=no>_%5e_5_%5e_</span>\u3001<span translate=no>_%5e_6_%5e_</span>\u548c<span translate=no>_%5e_7_%5e_</span>\u662f\u53ef\u5b66\u4e60\u7684\u53c2\u6570\u3002</p>\n<p>\u6709\u65f6\u8fd8\u4f1a\u4f7f\u7528 GELU (Gaussian Error Linear Unit) \u6fc0\u6d3b\u51fd\u6570\u6765\u4ee3\u66ff ReLU \u3002<span translate=no>_%5e_8_%5e_</span>\u5176\u4e2d<span translate=no>_%5e_9_%5e_</span></p>\n<h3>\u95e8\u63a7\u7ebf\u6027\u5355\u5143</h3>\n<p>\u8fd9\u662f\u4e00\u4e2a\u901a\u7528\u5b9e\u73b0\uff0c\u652f\u6301\u5305\u62ec<a href=\"https://arxiv.org/abs/2002.05202\">\u95e8\u63a7\u7ebf\u6027\u5355\u5143(GLU)</a> \u5728\u5185\u7684\u4e0d\u540c\u53d8\u4f53\u3002\u6211\u4eec\u8fd8\u5bf9\u8fd9\u4e9b\u8fdb\u884c\u4e86\u5b9e\u9a8c\uff1a</p>\n<ul><li><a href=\"glu_variants/experiment.html\">\u4f7f\u7528<span translate=no>_%5e_10_%5e_</span></a>\u7684\u5b9e\u9a8c</li>\n<li><a href=\"glu_variants/simple.html\">\u4ece\u5934\u5f00\u59cb\u7684\u7b80\u5316\u7248\u672c</a></li></ul>\n",
"<p><span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span> depending on whether it is gated </p>\n": "<p><span translate=no>_^_0_^_</span>\u6216\u8005<span translate=no>_^_1_^_</span>\u53d6\u51b3\u4e8e\u5b83\u662f\u5426\u6709\u95e8\u63a7</p>\n",
6
-
"<p>Activation function <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6fc0\u6d3b\u529f\u80fd<span translate=no>_^_0_^_</span></p>\n",
"<p>If there is a gate the linear layer to transform inputs to be multiplied by the gate, parameterized by weight <span translate=no>_^_0_^_</span> and bias <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5982\u679c\u6709\u95e8\uff0c\u5219\u8f6c\u6362\u8f93\u5165\u7684\u7ebf\u6027\u5c42\u5c06\u4e58\u4ee5\u95e8\uff0c\u5e76\u901a\u8fc7\u6743\u91cd<span translate=no>_^_0_^_</span>\u548c\u504f\u7f6e\u8fdb\u884c\u53c2\u6570\u5316<span translate=no>_^_1_^_</span></p>\n",
11
-
"<p>Layer one parameterized by weight <span translate=no>_^_0_^_</span> and bias <span translate=no>_^_1_^_</span> </p>\n": "<p>\u7b2c\u4e00\u5c42\u6309\u6743\u91cd<span translate=no>_^_0_^_</span>\u548c\u504f\u5dee\u8fdb\u884c\u53c2\u6570\u5316<span translate=no>_^_1_^_</span></p>\n",
5
+
"<p><span translate=no>_^_0_^_</span> or <span translate=no>_^_1_^_</span> depending on whether it is gated </p>\n": "<p>\u6839\u636e\u662f\u5426\u8fdb\u884c\u95e8\u63a7\uff0c\u8fd4\u56de<span translate=no>_^_0_^_</span>\u6216\u8005<span translate=no>_^_1_^_</span></p>\n",
6
+
"<p>Activation function <span translate=no>_^_0_^_</span> </p>\n": "<p>\u6fc0\u6d3b\u51fd\u6570<span translate=no>_^_0_^_</span></p>\n",
"<p>If there is a gate the linear layer to transform inputs to be multiplied by the gate, parameterized by weight <span translate=no>_^_0_^_</span> and bias <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5982\u679c\u5b58\u5728\u95e8\u63a7\uff0c\u5219\u901a\u8fc7\u7ebf\u6027\u5c42\u5c06\u8f93\u5165\u503c\u4e0e\u95e8\u76f8\u4e58\uff0c\u5e76\u7531\u6743\u91cd<span translate=no>_^_0_^_</span>\u548c\u504f\u7f6e<span translate=no>_^_1_^_</span>\u8fdb\u884c\u53c2\u6570\u5316</p>\n",
11
+
"<p>Layer one parameterized by weight <span translate=no>_^_0_^_</span> and bias <span translate=no>_^_1_^_</span> </p>\n": "<p>\u7b2c\u4e00\u5c42\u7531\u6743\u91cd<span translate=no>_^_0_^_</span>\u548c\u504f\u5dee<span translate=no>_^_1_^_</span>\u8fdb\u884c\u53c2\u6570\u5316</p>\n",
12
12
"<p>Otherwise </p>\n": "<p>\u5426\u5219</p>\n",
13
-
"<p>Whether there is a gate </p>\n": "<p>\u662f\u5426\u6709\u95e8</p>\n",
14
-
"<ul><li><span translate=no>_^_0_^_</span> is the number of features in a token embedding </li>\n<li><span translate=no>_^_1_^_</span> is the number of features in the hidden layer of the FFN </li>\n<li><span translate=no>_^_2_^_</span> is dropout probability for the hidden layer </li>\n<li><span translate=no>_^_3_^_</span> specifies whether the hidden layer is gated </li>\n<li><span translate=no>_^_4_^_</span> specified whether the first fully connected layer should have a learnable bias </li>\n<li><span translate=no>_^_5_^_</span> specified whether the second fully connected layer should have a learnable bias </li>\n<li><span translate=no>_^_6_^_</span> specified whether the fully connected layer for the gate should have a learnable bias</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u4ee4\u724c\u5d4c\u5165\u4e2d\u7684\u8981\u7d20\u6570\u91cf</li>\n<li><span translate=no>_^_1_^_</span>\u662f FFN \u9690\u85cf\u5c42\u4e2d\u7684\u8981\u7d20\u6570\u91cf</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u9690\u85cf\u5c42\u7684\u4e22\u5931\u6982\u7387</li>\n<li><span translate=no>_^_3_^_</span>\u6307\u5b9a\u9690\u85cf\u5c42\u662f\u5426\u4e3a\u95e8\u63a7</li>\n<li><span translate=no>_^_4_^_</span>\u6307\u5b9a\u7b2c\u4e00\u4e2a\u5b8c\u5168\u8fde\u63a5\u7684\u5c42\u662f\u5426\u5e94\u8be5\u6709\u53ef\u5b66\u4e60\u7684\u504f\u5dee</li>\n<li><span translate=no>_^_5_^_</span>\u6307\u5b9a\u7b2c\u4e8c\u4e2a\u5b8c\u5168\u8fde\u63a5\u7684\u5c42\u662f\u5426\u5e94\u8be5\u6709\u53ef\u5b66\u4e60\u7684\u504f\u5dee</li>\n<li><span translate=no>_^_6_^_</span>\u6307\u5b9a\u95e8\u7684\u5168\u8fde\u63a5\u5c42\u662f\u5426\u5e94\u5177\u6709\u53ef\u5b66\u4e60\u7684\u504f\u5dee</li></ul>\n",
15
-
"Documented reusable implementation of the position wise feedforward network.": "\u8bb0\u5f55\u4e86\u4f4d\u7f6e\u524d\u9988\u7f51\u7edc\u7684\u53ef\u91cd\u7528\u5b9e\u73b0\u3002",
13
+
"<p>Whether there is a gate </p>\n": "<p>\u662f\u5426\u5b58\u5728\u95e8\u63a7</p>\n",
14
+
"<ul><li><span translate=no>_^_0_^_</span> is the number of features in a token embedding </li>\n<li><span translate=no>_^_1_^_</span> is the number of features in the hidden layer of the FFN </li>\n<li><span translate=no>_^_2_^_</span> is dropout probability for the hidden layer </li>\n<li><span translate=no>_^_3_^_</span> specifies whether the hidden layer is gated </li>\n<li><span translate=no>_^_4_^_</span> specified whether the first fully connected layer should have a learnable bias </li>\n<li><span translate=no>_^_5_^_</span> specified whether the second fully connected layer should have a learnable bias </li>\n<li><span translate=no>_^_6_^_</span> specified whether the fully connected layer for the gate should have a learnable bias</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>\u662f\u6807\u8bb0\u5d4c\u5165\u4e2d\u7684\u7279\u5f81\u6570\u91cf</li>\n<li><span translate=no>_^_1_^_</span>\u662f FFN \u9690\u85cf\u5c42\u4e2d\u7684\u7279\u5f81\u6570\u91cf</li>\n<li><span translate=no>_^_2_^_</span>\u662f\u9690\u85cf\u5c42\u7684 Dropout \u7387</li>\n<li><span translate=no>_^_3_^_</span>\u6307\u5b9a\u4e86\u9690\u85cf\u5c42\u662f\u5426\u4e3a\u95e8\u63a7\u5c42</li>\n<li><span translate=no>_^_4_^_</span>\u6307\u5b9a\u4e86\u7b2c\u4e00\u4e2a\u5168\u8fde\u63a5\u5c42\u662f\u5426\u5e94\u8be5\u5177\u6709\u53ef\u5b66\u4e60\u7684\u504f\u7f6e</li>\n<li><span translate=no>_^_5_^_</span>\u6307\u5b9a\u7b2c\u4e8c\u4e2a\u5168\u8fde\u63a5\u5c42\u662f\u5426\u5e94\u5177\u6709\u53ef\u5b66\u4e60\u7684\u504f\u7f6e</li>\n<li><span translate=no>_^_6_^_</span>\u6307\u5b9a\u95e8\u63a7\u7684\u5168\u8fde\u63a5\u5c42\u662f\u5426\u5e94\u5177\u6709\u53ef\u5b66\u4e60\u7684\u504f\u7f6e</li></ul>\n",
15
+
"Documented reusable implementation of the position wise feedforward network.": "\u5df2\u8bb0\u5f55\u5e76\u53ef\u91cd\u590d\u4f7f\u7528\u7684\u4f4d\u7f6e\u524d\u9988\u7f51\u7edc\u5b9e\u73b0\u3002",
"<p>Show the target distributions expected by the system. </p>\n": "<p>\u663e\u793a\u7cfb\u7edf\u9884\u671f\u7684\u76ee\u6807\u5206\u5e03\u3002</p>\n",
"<p>Show the target distributions expected by the system. </p>\n": "<p>\u5c55\u793a\u7cfb\u7edf\u671f\u671b\u7684\u76ee\u6807\u5206\u5e03\u3002</p>\n",
"This is an implementation of label smoothing loss, that can be used as an alternative to cross entropy loss for improved accuracy.": "\u8fd9\u662f\u6807\u7b7e\u5e73\u6ed1\u635f\u5931\u7684\u5b9e\u73b0\uff0c\u53ef\u4ee5\u7528\u4f5c\u4ea4\u53c9\u71b5\u635f\u5931\u7684\u66ff\u4ee3\u65b9\u6848\uff0c\u4ee5\u63d0\u9ad8\u51c6\u786e\u6027\u3002"
6
+
"This is an implementation of label smoothing loss, that can be used as an alternative to cross entropy loss for improved accuracy.": "\u8fd9\u662f\u6807\u7b7e\u5e73\u6ed1\u635f\u5931\u7684\u5b9e\u73b0\uff0c\u53ef\u4f5c\u4e3a\u4ea4\u53c9\u71b5\u635f\u5931\u7684\u66ff\u4ee3\u54c1\u4ee5\u63d0\u9ad8\u51c6\u786e\u6027\u3002"
0 commit comments