55| ** RFC #** | [ 195] ( https://github.com/tensorflow/community/pull/195 ) |
66| ** Author(s)
** | Andrew Audibert (
[email protected] ) Rohan Jain (
[email protected] )
| 77| ** Sponsor
** | Jiri Simsa (
[email protected] )
| 8- | ** Updated** | 2019-01-09 |
8+ | ** Updated** | 2019-01-24 |
99
1010## Objective
1111
@@ -98,20 +98,20 @@ provides dataset elements to consumers over RPC.
9898** Consumer** : A machine which consumes data from the tf.data service. The
9999consumer may be attached to a GPU or TPU, or use data for on-CPU training.
100100
101- #### Separate Cluster Architecture
101+ #### Option 1: Separate Cluster Architecture
102102
103103Each server is run on a separate host from the TensorFlow cluster. This
104104configuration gives users a way to provide horizontally scaling CPU for
105105processing their input pipelines and quickly feeding data to accelerators.
106106
107- #### Embedded Cluster Architecture
107+ #### Option 2: Embedded Cluster Architecture
108108
109109Each TensorFlow server runs the tf.data worker gRPC service, and one server also
110110runs the master gRPC service. This lets users leverage the tf.data service
111111without needing to provision additional compute resources. and gives all the
112112benefits of the tf.data service except for horizontal scaling.
113113
114- #### Hybrid Architecture
114+ #### Option 3: Hybrid Architecture
115115
116116Users could run tf.data workers embedded in their TensorFlow cluster, and also
117117run additional tf.data workers (and potentially the tf.data master) outside the
@@ -136,6 +136,12 @@ code. The steps for distributed iteration over a dataset are
1361365 . Create per-consumer iterators using ` make_iterator ` , and use these iterators
137137 to read data from the tf.data service.
138138
139+ We move away from the idiomatic ` for element in dataset ` control flow because
140+ there is now an extra step when going from dataset to iterator: creating an
141+ iteration. A higher layer API such as tf.distribute may use the API presented
142+ here to implement datasets which produce per-replica elements, enabling
143+ idiomatic control flow.
144+
139145``` python
140146def tf.data.experimental.service.distribute(address):
141147 """ Marks that a dataset should be processed by the tf.data service.
@@ -237,7 +243,9 @@ It will be entirely in C++, and we don't currently have any plans to expose
237243splitting through Python.
238244
239245The API focuses on producing and consuming ` Split ` s. A ` Split ` is a variant
240- Tensor that can be subclassed to represent arbitrary types of splitting.
246+ Tensor that can be subclassed to represent arbitrary types of splitting. The
247+ ` Split ` base class is intentionally general so that subclasses have the
248+ flexibility to define splits however they like.
241249
242250``` cpp
243251class Split {
@@ -614,6 +622,8 @@ service. We will also provide a tutorial for using the tf.data service.
614622* How should we communicate that distributing a dataset will change the order
615623 in which elements are processed? If users' datasets rely on elements being
616624 processed in a certain order, they could face unpleasant surprises.
625+ * Should we support splitting `skip` and `take` by having them operate at a
626+ per-task level (skip or take the first `N` elements within each task)?
617627* Is there a more user-friendly way to share iteration data across consumers?
618628 Distribution strategy is well-equipped with collective ops to share the
619629 iteration data, but sharing the iteration data could be a heavy burden for
0 commit comments