Skip to content

Commit f713f02

Browse files
committed
spell
1 parent a2d4e4e commit f713f02

File tree

1 file changed

+12
-9
lines changed

1 file changed

+12
-9
lines changed

_doc/technical/plot_parallelized_reduction.py

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
Reproducible Parallelized Reduction is difficult
33
================================================
44
5-
A reduction is a frequent operation in neural network. It appears in layer normalization,
6-
softmax. Because of the float precision, the result of the computation
5+
A reduction is a frequent operation with neural networks. It appears in layer normalization,
6+
softmax... Because of the float precision, the result of the computation
77
changes based on the order of the elements. The following examples show the variation
88
based on different hypothesis on the vector distribution.
99
We consider a vector :math:`X = (x_1, ..., x_n)`.
@@ -19,11 +19,13 @@
1919
2020
norm(X)_i = \\frac{ X_i - \\mathbb{E}X}{ \\sqrt{ \\mathbb{V}X}}
2121
22-
We draw 128 random permutation of X. The average or mean should not change.
23-
And the normalized vector should have the same value. In the first case, we compute
22+
With :math:`\\mathbb{E}X = mean(X)`,
23+
:math:`\\mathbb{V}X = mean\\left(\\left(X - mean(X)\\rigth)^2\\right)`.
24+
We draw 128 random permutations of X. The average or mean should not change.
25+
And the normalized vector should have the same values. In the first case, we compute
2426
the difference between the highest and the lowest values obtained for the average.
2527
In the second case, we look for the maximum difference between the original normalized
26-
vector and the permuted one (both sorted).
28+
vector and the permuted one, both sorted.
2729
2830
The computation code
2931
++++++++++++++++++++
@@ -144,7 +146,8 @@ def make_value(base, value):
144146
mean["name"] = "fixed"
145147
print(mean)
146148

147-
149+
# %%
150+
# And the normalized vector.
148151
ln = compute(values, layer_norm)
149152
ln["name"] = "fixed"
150153
DATA.append(ln.reset_index(drop=True).max(axis=0))
@@ -163,7 +166,8 @@ def make_value(base, value):
163166
mean["name"] = "normal"
164167
print(mean)
165168

166-
169+
# %%
170+
# And the normalized vector.
167171
ln = compute(values, layer_norm)
168172
ln["name"] = "pareto"
169173
DATA.append(ln.reset_index(drop=True).max(axis=0))
@@ -175,7 +179,6 @@ def make_value(base, value):
175179
#
176180
# We consider the maximum difference obtained for any sample size.
177181

178-
print(DATA)
179182
df = pandas.DataFrame(DATA).set_index("name")
180183
print(df)
181184

@@ -192,7 +195,7 @@ def make_value(base, value):
192195
#
193196
# Some of the vector have 500 values, 16x32x1024x1024. A layer normalization
194197
# does 16x32x1024 ~ 2M reductions, over 20 layers.
195-
# When a deep neural network is computed with a difference code,
198+
# When a deep neural network is computed with a different code
196199
# doing a different parallelization (GPU/CPU for example),
197200
# the order of the reduction may change and therefore,
198201
# some errors will appear and propagate.

0 commit comments

Comments
 (0)