Added more clarification on proposed changes and user impact

haoyuz · haoyuz · commit 9aa833016b01 · 2021-03-04T11:24:48.000-08:00
diff --git a/rfcs/20210218-grpc-fail-fast-use-caller.md b/rfcs/20210218-grpc-fail-fast-use-caller.md
@@ -5,7 +5,7 @@
 | **RFC #**     | [355](https://github.com/tensorflow/community/pull/355) |
 | **Author(s)** | Haoyu Zhang (haoyuzhang@google.com)                     |
 | **Sponsor**   | Bramandia Ramadhana (bramandia@google.com)              |
-| **Updated**   | 2021-02-18                                              |
+| **Updated**   | 2021-03-04                                              |
 
 ## Objective
 
@@ -23,16 +23,18 @@ with remote servers. It can be configured to the following values:
 
 *   `true`, which immediately reports an `UnavailableError` when there is a
     connection issue for all RPCs, regardless of the per-RPC configurations;
-*   `false`, which will (in most cases) hang until successfully connected to the
-    remote server for all RPCs (see
+*   `false`, which blocks and waits until successfully connected to the remote
+    server (see
     [gRPC `wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)),
     regardless of the per-RPC configurations;
-*   `use_caller`, which is `true` for RPCs used in distributed execution (such
-    as `RecvTensor`, `RunComponentFunction`), and `false` for RPCs in
+*   `use_caller`, which allows customization per RPC basis; in the current
+    implementation, `true` is used for RPCs used in distributed execution (such
+    as `RecvTensor`, `RunComponentFunction`), and `false` is used for RPCs in
     initializing remote execution environments (e.g., `GetStatus`).
 
-The default value of `GRPC_FAIL_FAST` is set to `false` currently. As a result,
-users need to
+The default value of `GRPC_FAIL_FAST` is currently set to `false`. One of the
+consequences is that users and/or high-level distribute libraries (such as
+`ParameterServerStrategy`) need to
 [manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106)
 to receive reasonable exceptions when workers fail / get preempted; otherwise
 the cluster will hang and cannot recover from failures.
@@ -72,12 +74,17 @@ deployment.
 
 ## User Impact
 
+When this change is made to the codebase, subsequent TensorFlow 2 releases will
+have this new default behavior. TensorFlow 1.x users who use the stable releases
+(e.g., TensorFlow 1.15 or earlier) should not be affected by this change. Users
+who build TensorFlow directly from source at the head will also be affected.
+
 Most users should see the new default as expected behaviors in distributed
 execution. Users can take advantage of the built-in fault tolerance support in
 `ParameterServerStrategy` without having to make changes to the environment
 variable configurations. In other setups, exceptions will be raised to the model
 training loop code, where users can catch and handle these errors with custom
-logic.
+logic instead of hanging indefinitely.
 
 Certain users might receive "false alarms" if there are transient connection
 errors to the remote servers. We expect this to happen very rarely since GRPC