Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 9aa8330

Browse files
committed
Added more clarification on proposed changes and user impact
1 parent 14c00e2 commit 9aa8330

File tree

1 file changed

+15
-8
lines changed

1 file changed

+15
-8
lines changed

rfcs/20210218-grpc-fail-fast-use-caller.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
| **RFC #** | [355](https://github.com/tensorflow/community/pull/355) |
66
| **Author(s)** | Haoyu Zhang ([email protected]) |
77
| **Sponsor** | Bramandia Ramadhana ([email protected]) |
8-
| **Updated** | 2021-02-18 |
8+
| **Updated** | 2021-03-04 |
99

1010
## Objective
1111

@@ -23,16 +23,18 @@ with remote servers. It can be configured to the following values:
2323

2424
* `true`, which immediately reports an `UnavailableError` when there is a
2525
connection issue for all RPCs, regardless of the per-RPC configurations;
26-
* `false`, which will (in most cases) hang until successfully connected to the
27-
remote server for all RPCs (see
26+
* `false`, which blocks and waits until successfully connected to the remote
27+
server (see
2828
[gRPC `wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)),
2929
regardless of the per-RPC configurations;
30-
* `use_caller`, which is `true` for RPCs used in distributed execution (such
31-
as `RecvTensor`, `RunComponentFunction`), and `false` for RPCs in
30+
* `use_caller`, which allows customization per RPC basis; in the current
31+
implementation, `true` is used for RPCs used in distributed execution (such
32+
as `RecvTensor`, `RunComponentFunction`), and `false` is used for RPCs in
3233
initializing remote execution environments (e.g., `GetStatus`).
3334

34-
The default value of `GRPC_FAIL_FAST` is set to `false` currently. As a result,
35-
users need to
35+
The default value of `GRPC_FAIL_FAST` is currently set to `false`. One of the
36+
consequences is that users and/or high-level distribute libraries (such as
37+
`ParameterServerStrategy`) need to
3638
[manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106)
3739
to receive reasonable exceptions when workers fail / get preempted; otherwise
3840
the cluster will hang and cannot recover from failures.
@@ -72,12 +74,17 @@ deployment.
7274

7375
## User Impact
7476

77+
When this change is made to the codebase, subsequent TensorFlow 2 releases will
78+
have this new default behavior. TensorFlow 1.x users who use the stable releases
79+
(e.g., TensorFlow 1.15 or earlier) should not be affected by this change. Users
80+
who build TensorFlow directly from source at the head will also be affected.
81+
7582
Most users should see the new default as expected behaviors in distributed
7683
execution. Users can take advantage of the built-in fault tolerance support in
7784
`ParameterServerStrategy` without having to make changes to the environment
7885
variable configurations. In other setups, exceptions will be raised to the model
7986
training loop code, where users can catch and handle these errors with custom
80-
logic.
87+
logic instead of hanging indefinitely.
8188

8289
Certain users might receive "false alarms" if there are transient connection
8390
errors to the remote servers. We expect this to happen very rarely since GRPC

0 commit comments

Comments
 (0)