You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: rfcs/20210119-determinism.md
+47-16Lines changed: 47 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,20 +6,19 @@
6
6
|**Sponsor**| Sanjoy Das (Google) |
7
7
|**Updated**| 2021-01-31 |
8
8
9
-
10
9
## Objective
11
10
Allow users to enable deterministic behavior in TensorFlow. This means if the user runs a TensorFlow program multiple times, the model outputs and weights will be the same each time. Determinism will be supported on CPUs and GPUs.
12
-
11
+
13
12
To get deterministic behavior, users must do the following:
14
13
15
14
* Enable determinism using the API proposed in this doc.
16
-
* Use same hardware in every run.
17
-
* Use the same software environment every run (OS, checkpoints, version of TF, environmental variables, etc).
18
-
* Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module or using multiple threads/processes in ways that influence TensorFlow’s behavior.
19
-
*Do not use nondeterministic custom ops.
15
+
* Use same hardware configuration in every run.
16
+
* Use the same software environment every run (OS, checkpoints, version of CUDA and TF, environmental variables, etc).
17
+
* Not use constructs outside TensorFlow that are nondeterministic, such as Python’s `random` module (without a fixed seed) or using multiple threads/processes in ways that influence TensorFlow’s behavior.
18
+
*Not use nondeterministic custom ops.
20
19
21
20
## Motivation
22
-
There are several mission critical applications in life sciences, finance and automation that require deterministic behavior. Determinism is required so that the behavior of these applications can be accurately predicted & demonstrated in a variety of scenarios.
21
+
There are several mission critical applications in medicine, finance and automation that require deterministic behavior. Determinism is required so that the behavior of these applications can be accurately predicted & demonstrated in a variety of scenarios.
23
22
24
23
Lack of determinism prevents companies from launching products using models developed in TF. For a subset of these industries having deterministic behavior is a regulatory requirement.
25
24
@@ -38,18 +37,32 @@ For ops which do not yet have a deterministic implementation, TensorFlow will ra
38
37
Enabling deterministic execution does not automatically cause a user’s program to become deterministic. If users use nondeterministic constructs outside TensorFlow, such as threads/process, in ways that influence TensorFlow’s behavior, their program will not be deterministic. In order for a user to ensure their program is deterministic, users must both enable deterministic execution within TensorFlow and remove any sources of nondeterminism outside TensorFlow.
39
38
40
39
### Existing Flags
41
-
Multiple environmental variables exist today that control determinism. As part of this change, we will deprecate then remove the following:
40
+
There are currently two environment variables in TensorFlow to enable deterministic op functionality.
41
+
42
+
The first environment variable is `TF_CUDNN_DETERMINISTIC`. When set to `'true'` or `'1'`, this,
43
+
44
+
* makes the selection of cuDNN convolution algorithms deterministic,
45
+
* selects deterministic gradient algorithms for `tf.nn.conv*d` and `tf.keras.layers.Conv*D`,
46
+
* selects deterministic gradient algorithms for `tf.nn.max_pool*d` and `tf.keras.layers.MaxPool*D`, and
47
+
* selects a deterministic gradient algorithm for `tf.nn.ctc_loss`.
48
+
49
+
The second environment variable is `TF_DETERMINISTIC_OPS`. This supercedes and replaces `TF_CUDNN_DETERMINISTIC` by having the same functionality and also (when set to `'true'` or `'1'`),
50
+
51
+
* selects deterministic gradient kernels for `tf.nn.bias_add` and the many Keras layers that apply a bias,
52
+
* selects a deterministic algorithm for XLA reductions on GPU, and
53
+
* selects a deterministic gradient algorithm for `tf.image.resize` with `method=ResizeMethod.BILINEAR` and `tf.keras.layers.UpSampling2D` with `interpolation='bilinear'`
54
+
55
+
Calling `tf.config.enable_deterministic_execution(True)` will be equivalent to setting `TF_DETERMINISTIC_OPS` to `'true'` or `'1'` plus the additional functionality described in this RFC.
42
56
43
-
* TF_DETERMINISTIC_OPS
44
-
* TF_CUDNN_DETERMINISTIC
57
+
The two environment variables will be first deprecated and then removed.
45
58
46
59
tf.data also has flags for determinism. The system will throw an error message if flags are out of sync i.e. if deterministic_execution_enabled is enabled but if the tf.data option is set to ‘false’, we will throw an error. (`tf.data.Options.experimental_deterministic`). We’ll also add the necessary checks for Dataset.map and Dataset.interleave. See the [Random ops](#random-ops) section for how random Datasets, such as `tf.data.experimental.RandomDataset`, are handled.
47
60
48
61
### Grappler changes
49
62
Grappler graph optimizations may add nondeterministic behavior. In particular some optimizations will time out if they take too long to run. When determinism is enabled, these timeouts will be disabled.
50
63
51
64
### Random ops
52
-
Legacy random ops, such as `tf.random.normal`, are not deterministic if no seed is set, and so such ops will raise an error when determinism is enabled. To fix, the user should set a global seed with `tf.random.set_seed`. Since most models use legacy random ops, in practice users must call `tf.random.set_seed` when enabling deterministic behavior. Alternatively, users can pass a seed to every individual random operation, but doing so is more inconvenient.
65
+
Legacy random ops, such as `tf.random.normal`, are not deterministic if no seed is set, and so such ops will raise an error when determinism is enabled. To fix, the user should set a global seed with `tf.random.set_seed`. Since most models use legacy random ops (for variable initialization and various other uses), in practice users must call `tf.random.set_seed` when enabling deterministic behavior. Alternatively, users can pass a seed to every individual random operation, but doing so is more inconvenient.
53
66
54
67
Certain random ops, such as `tf.image.sample_distorted_bounding_box` and `tf.nn.fractional_max_pool`, ignore the global seed if a seed is not explicitly passed. For such ops, setting the global seed is not enough to avoid the error, so users must pass a seed directly to the op.
55
68
@@ -65,21 +78,19 @@ No error will be raised if a random op or generator is run before determinism is
65
78
Use of parameter servers adds nondeterministic behavior. In case a model constructs a ParameterServerStrategy, TensorFlow will throw an error. We’ll also document this in the documentation for the flag.
66
79
67
80
### Op Review and changes
68
-
As part of the implementation, we will review all ops to make a determination of their behavior (deterministic vs nondeterministic). Some of the ops that are known to be nondeterministic, at least when running on a GPU, include:
81
+
As part of the implementation, we will review all ops to make a determination of their behavior (deterministic vs nondeterministic). Ops that are known to operate nondeterministically, at least when running on a GPU, include the following:
69
82
70
83
*`tf.nn.softmax_cross_entropy_with_logits`
71
84
*`tf.nn.sparse_softmax_cross_entropy_with_logits`
72
85
*`tf.image.resize` gradient with `method=ResizeMethod.NEAREST`
*`tf.image.crop_and_resize` gradient to both image and boxes
75
-
*`tf.sparse.sparse_dense_matmul` forward
76
88
*`tf.math.unsorted_segment_mean`, `tf.math.unsorted_segment_prod` and `tf.math.unsorted_segment_sqrt`; all foward
77
89
*`tf.sparse.sparse_dense_matmul`
78
90
91
+
We have a list of other ops that use CUDA's `atomicAdd` and are therefore likely to be sources of nondeterminism. Once it has been confirmed that those ops function nondeterministically, they will be made to throw errors when determinism is enabled. In the long term, we can add a deterministic implementation to such ops.
79
92
80
-
`tf.image.sample_distorted_bounding_box` has been observed to behave nondeterministically unless you set its seed parameter, even if you call tf.random.set_seed. We will review this Op as part the change. Another case that needs review is "pulling a random number from a PRNG before its state has been initialized".
81
-
82
-
Given the large number of ops involved, there is a chance that we might omit raising an error for a nondeterministic Op.
93
+
Given the large number of ops involved, there is a chance that we might omit raising an error for a nondeterministic Op. We will fix such cases as they arise.
83
94
84
95
## Discussion
85
96
@@ -101,5 +112,25 @@ We don’t want TensorFlow developers to have to worry about breaking determinis
101
112
* Sessions are nondeterminism and making them determinism requires having the executor run ops in a consistent order. It is probably not worth making sessions deterministic.
102
113
* If performant, we could potentially have determinism be enabled by default, but not raising an error for nondeterministic ops.
103
114
115
+
## Apendix
116
+
117
+
### Existing environmental variables
118
+
119
+
There are currently two environment variables in TensorFlow to enable deterministic op functionality.
120
+
121
+
The first environment variable is `TF_CUDNN_DETERMINISTIC`. When set to `'true'` or `'1'`, this,
122
+
123
+
* makes the selection of cuDNN convolution algorithms deterministic,
124
+
* selects deterministic gradient algorithms for `tf.nn.conv*d` and `tf.keras.layers.Conv*D`,
125
+
* selects deterministic gradient algorithms for `tf.nn.max_pool*d` and `tf.keras.layers.MaxPool*D`, and
126
+
* selects a deterministic gradient algorithm for `tf.nn.ctc_loss`.
127
+
128
+
The second environment variable is `TF_DETERMINISTIC_OPS`. This supercedes and replaces `TF_CUDNN_DETERMINISTIC` by first implementing the same functionality, but then it also (when set to `'true'` or `'1'`),
129
+
130
+
* selects deterministic gradient kernels for `tf.nn.bias_add` and the many Keras layers that apply a bias,
131
+
* selects a deterministic algorithm for XLA reductions on GPU, and
132
+
* selects a deterministic gradient algorithm for `tf.image.resize` with `method=ResizeMethod.BILINEAR` and `tf.keras.layers.UpSampling2D` with `interpolation='bilinear'`
104
133
134
+
Calling `tf.config.enable_deterministic_execution(True)` will be equivalent to setting `TF_DETERMINISTIC_OPS` to `'true'` or `'1'` plus the additional functionality described in this RFC.
105
135
136
+
The two environment variables will be first deprecated and then removed.
0 commit comments