|
1 | 1 | # Distributed tf.data service |
2 | 2 |
|
3 | | -| Status | Proposed | |
| 3 | +| Status | Accepted | |
4 | 4 | | :------------ | :------------------------------------------------------ | |
5 | 5 | | **RFC #** | [195](https://github.com/tensorflow/community/pull/195) | |
6 | 6 | | **Author(s) ** | Andrew Audibert ( [email protected]) Rohan Jain ( [email protected]) | |
7 | 7 | | **Sponsor ** | Jiri Simsa ( [email protected]) | |
8 | | -| **Updated** | 2019-01-24 | |
| 8 | +| **Updated** | 2019-01-30 | |
9 | 9 |
|
10 | 10 | ## Objective |
11 | 11 |
|
@@ -143,14 +143,16 @@ here to implement datasets which produce per-replica elements, enabling |
143 | 143 | idiomatic control flow. |
144 | 144 |
|
145 | 145 | ```python |
146 | | -def tf.data.experimental.service.distribute(address): |
| 146 | +def tf.data.experimental.service.distribute(address_or_resolver): |
147 | 147 | """Marks that a dataset should be processed by the tf.data service. |
148 | 148 |
|
149 | 149 | ds = ... # dataset to distribute |
150 | | - ds = ds.apply(tf.data.experimental.service.distribute(address)) |
| 150 | + ds = ds.apply( |
| 151 | + tf.data.experimental.service.distribute(address_or_resolver)) |
151 | 152 |
|
152 | 153 | Args: |
153 | | - address: The address of the tf.data service master. |
| 154 | + address_or_resolver: The address of the tf.data service master, or a |
| 155 | + cluster resolver that can be used to determine the master address. |
154 | 156 |
|
155 | 157 | Returns: |
156 | 158 | A function that can be passed to `dataset.apply()`. |
@@ -622,22 +624,25 @@ service. We will also provide a tutorial for using the tf.data service. |
622 | 624 | * How should we communicate that distributing a dataset will change the order |
623 | 625 | in which elements are processed? If users' datasets rely on elements being |
624 | 626 | processed in a certain order, they could face unpleasant surprises. |
625 | | - - Current plan is to address this through documentation. |
| 627 | + - Final decision: Address this through documentation. |
626 | 628 | * Should we support splitting `skip`, `take`, and `scan` by having them |
627 | 629 | operate at a per-task level (e.g. skip or take the first `N` elements within |
628 | 630 | each task)? |
629 | | - - Leaning towards supporting these operations at a per-task level. This is |
630 | | - consistent with how skip/take/scan behave today when using distribution |
631 | | - strategies to distribute a dataset. |
| 631 | + - Final decision: Prohibit distributing these transformations, and tell |
| 632 | + users to instead use these transformations *after* applying the |
| 633 | + `distribute` transformation. |
632 | 634 | * Is there a more user-friendly way to share iteration ids across consumers? |
633 | 635 | Distribution strategy is well-equipped with collective ops to share the |
634 | 636 | iteration ids, but sharing the iteration id could be a heavy burden for |
635 | 637 | some users. |
636 | | - - Distributing iteration ids is simple in the common case where a single |
637 | | - process builds the graph. If users are advanced enough to do distributed |
638 | | - training without distribution strategies, they will likely have a |
639 | | - different mechanism available for distributing iteration ids. |
| 638 | + - Final decision: It is a reasonable expectation for users to either use |
| 639 | + distribution strategies, or distribute their own iteration ids. |
| 640 | + TensorFlow will soon have public APIs for collective operations that |
| 641 | + would make it easy to broadcast iteration ids. |
640 | 642 | * Can `service.distribute` take a `ClusterResolver` so that the master |
641 | 643 | hostname isn't baked into the dataset definition? |
642 | | - - We can achieve this by having the `distribute` transformation take a |
643 | | - master_address_or_resolver. |
| 644 | + - Final decision: Accept `master_address_or_resolver`, and wait to resolve |
| 645 | + the master address until iteration begins. The `ClusterResolver` will be |
| 646 | + stored in the Python `Dataset` object. In the future, we may want C++ |
| 647 | + implementations of `ClusterResolver` so that we can represent the |
| 648 | + resolver within the dataset graph. |
0 commit comments