You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
The default value of `GRPC_FAIL_FAST` is set to `false` currently. As a result,
35
-
users need to
35
+
The default value of `GRPC_FAIL_FAST` is currently set to `false`. One of the
36
+
consequences is that users and/or high-level distribute libraries (such as
37
+
`ParameterServerStrategy`) need to
36
38
[manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106)
37
39
to receive reasonable exceptions when workers fail / get preempted; otherwise
38
40
the cluster will hang and cannot recover from failures.
@@ -72,12 +74,17 @@ deployment.
72
74
73
75
## User Impact
74
76
77
+
When this change is made to the codebase, subsequent TensorFlow 2 releases will
78
+
have this new default behavior. TensorFlow 1.x users who use the stable releases
79
+
(e.g., TensorFlow 1.15 or earlier) should not be affected by this change. Users
80
+
who build TensorFlow directly from source at the head will also be affected.
81
+
75
82
Most users should see the new default as expected behaviors in distributed
76
83
execution. Users can take advantage of the built-in fault tolerance support in
77
84
`ParameterServerStrategy` without having to make changes to the environment
78
85
variable configurations. In other setups, exceptions will be raised to the model
79
86
training loop code, where users can catch and handle these errors with custom
80
-
logic.
87
+
logic instead of hanging indefinitely.
81
88
82
89
Certain users might receive "false alarms" if there are transient connection
83
90
errors to the remote servers. We expect this to happen very rarely since GRPC
0 commit comments