22Reproducible Parallelized Reduction is difficult
33================================================
44
5- A reduction is a frequent operation in neural network . It appears in layer normalization,
6- softmax. Because of the float precision, the result of the computation
5+ A reduction is a frequent operation with neural networks . It appears in layer normalization,
6+ softmax... Because of the float precision, the result of the computation
77changes based on the order of the elements. The following examples show the variation
88based on different hypothesis on the vector distribution.
99We consider a vector :math:`X = (x_1, ..., x_n)`.
1919
2020 norm(X)_i = \\ frac{ X_i - \\ mathbb{E}X}{ \\ sqrt{ \\ mathbb{V}X}}
2121
22- We draw 128 random permutation of X. The average or mean should not change.
23- And the normalized vector should have the same value. In the first case, we compute
22+ With :math:`\\ mathbb{E}X = mean(X)`,
23+ :math:`\\ mathbb{V}X = mean\\ left(\\ left(X - mean(X)\\ rigth)^2\\ right)`.
24+ We draw 128 random permutations of X. The average or mean should not change.
25+ And the normalized vector should have the same values. In the first case, we compute
2426the difference between the highest and the lowest values obtained for the average.
2527In the second case, we look for the maximum difference between the original normalized
26- vector and the permuted one ( both sorted) .
28+ vector and the permuted one, both sorted.
2729
2830The computation code
2931++++++++++++++++++++
@@ -144,7 +146,8 @@ def make_value(base, value):
144146mean ["name" ] = "fixed"
145147print (mean )
146148
147-
149+ # %%
150+ # And the normalized vector.
148151ln = compute (values , layer_norm )
149152ln ["name" ] = "fixed"
150153DATA .append (ln .reset_index (drop = True ).max (axis = 0 ))
@@ -163,7 +166,8 @@ def make_value(base, value):
163166mean ["name" ] = "normal"
164167print (mean )
165168
166-
169+ # %%
170+ # And the normalized vector.
167171ln = compute (values , layer_norm )
168172ln ["name" ] = "pareto"
169173DATA .append (ln .reset_index (drop = True ).max (axis = 0 ))
@@ -175,7 +179,6 @@ def make_value(base, value):
175179#
176180# We consider the maximum difference obtained for any sample size.
177181
178- print (DATA )
179182df = pandas .DataFrame (DATA ).set_index ("name" )
180183print (df )
181184
@@ -192,7 +195,7 @@ def make_value(base, value):
192195#
193196# Some of the vector have 500 values, 16x32x1024x1024. A layer normalization
194197# does 16x32x1024 ~ 2M reductions, over 20 layers.
195- # When a deep neural network is computed with a difference code,
198+ # When a deep neural network is computed with a different code
196199# doing a different parallelization (GPU/CPU for example),
197200# the order of the reduction may change and therefore,
198201# some errors will appear and propagate.
0 commit comments