Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit beff086

Browse files
author
Frank Chen
committed
Updated design doc by removing some unneeded parameters
1 parent d941c33 commit beff086

File tree

1 file changed

+22
-36
lines changed

1 file changed

+22
-36
lines changed

rfcs/20200107-tf-data-snapshot.md

Lines changed: 22 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,10 @@ We are proposing the following API for the snapshot transformation.
7070
```python
7171
def snapshot(path,
7272
compression=None,
73-
reader_path_prefix=None,
74-
writer_path_prefix=None,
7573
shard_size_bytes=None,
7674
pending_snapshot_expiry_seconds=None,
7775
num_reader_threads=None,
78-
reader_buffer_size=None,
7976
num_writer_threads=None,
80-
writer_buffer_size=None,
8177
shuffle_on_read=None,
8278
shuffle_seed=None,
8379
mode=None,
@@ -88,60 +84,44 @@ def snapshot(path,
8884
1. `path`: Required. A directory where we want to save our snapshots and/or
8985
read from a previously saved snapshot.
9086

91-
2. `compression`: Optional. The type of compression to apply to the snapshot
87+
1. `compression`: Optional. The type of compression to apply to the snapshot
9288
written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to
9389
AUTO.
9490

95-
3. `reader_path_prefix`: Optional. A prefix to add to the path when reading
96-
from snapshots. This is useful for filesystems where configuration is passed
97-
in through the path. Defaults to None.
98-
99-
4. `writer_path_prefix`: Optional. A prefix to add to the path when writing to
100-
snapshots. This is useful for filesystems where configuration is passed in
101-
through the path. Defaults to None.
102-
103-
5. `shard_size_bytes`: Optional. The maximum size of each data file to be
91+
1. `shard_size_bytes`: Optional. The maximum size of each data file to be
10492
written by the snapshot dataset op. Defaults to AUTO.
10593

106-
6. `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)
94+
1. `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)
10795
before the snapshot op considers a previously unfinished snapshot to be
10896
stale and starts writing a snapshot from scratch again. Defaults to 86400
10997
seconds (1 day).
11098

111-
7. `num_reader_threads`: Optional. Number of threads to parallelize reading
99+
1. `num_reader_threads`: Optional. Number of threads to parallelize reading
112100
from snapshot. Especially useful if compression is turned on since the
113101
decompression operation tends to be intensive. If > 1, then
114102
this might introduce non-determinism i.e. the order in which the elements
115103
are read from the snapshot are different from the order they're written.
116104
Defaults to AUTO.
117105

118-
8. `reader_buffer_size`: Optional. Maximum number of elements we can prefetch
119-
reading from the snapshot. Increasing this might improve
120-
performance but will increase memory consumption. Defaults to AUTO.
121-
122-
9. `num_writer_threads`: Optional. Number of threads to parallelize writing
106+
1. `num_writer_threads`: Optional. Number of threads to parallelize writing
123107
from snapshot. We'll open up `num_writer_threads` files and write to them in
124108
parallel. Especially useful if compression is turned on since the
125109
compression operation tends to be intensive. If > 1, then
126110
this might introduce non-determinism i.e. the order in which the elements
127111
are read from the upstream iterator are different from the order they're
128112
written. Defaults to AUTO.
129113

130-
10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill
131-
up the buffer before writing them out using `num_writer_threads`. Defaults
132-
to AUTO.
133-
134-
11. `shuffle_on_read`: Optional. If this is True, then snapshot randomizes the
114+
1. `shuffle_on_read`: Optional. If this is True, then snapshot randomizes the
135115
order in which the snapshot files are read back. This emulates shuffling
136116
of the input files during a training run (e.g. when `Dataset.list_files`
137117
is called with `shuffle` turned on). Defaults to False.
138118

139-
12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number
119+
1. `shuffle_seed`: Optional. If shuffle_seed is set, the random number
140120
generator used for shuffling (when `shuffle_on_read` is turned on) is seeded
141121
by the given seed. Otherwise, it is seeded by a random seed that differs for
142122
every run.
143123

144-
13. `mode`: Optional. The mode at which snapshot should operate. Valid options
124+
1. `mode`: Optional. The mode at which snapshot should operate. Valid options
145125
are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,
146126
where the snapshot op will automatically determine what mode to operate in.
147127

@@ -150,25 +130,34 @@ def snapshot(path,
150130
materialization currently exists. In other words, we enter the **WRITE**
151131
state immediately.
152132

153-
2. `read` mode forces the snapshot transformation to read from the latest
133+
1. `read` mode forces the snapshot transformation to read from the latest
154134
version of the materialization on disk, regardless of whether the data
155135
stored on disk is complete and valid. In other words, we enter the
156136
**READ** state immediately.
157137

158-
3. `passthrough` mode turns the snapshot transformation into a no-op. In
138+
1. `passthrough` mode turns the snapshot transformation into a no-op. In
159139
other words, we enter the **PASSTHROUGH** state immediately.
160140

161-
4. `auto` retains the default behavior of snapshot. See the "Standard
141+
1. `auto` retains the default behavior of snapshot. See the "Standard
162142
Kernel Workflow" section for the default behavior.
163143

164-
14. `snapshot_name`: Optional. If set, use the supplied string as a named
144+
1. `snapshot_name`: Optional. If set, use the supplied string as a named
165145
snapshot name instead of introspecting the data pipeline and automatically
166146
generating a unique identifier for the specific data pipeline.
167147

168148
1. Instead of generating a new fingerprint of the input processing graph or
169149
and `run_id` (see the _Detailed Design_ section for details), we will
170150
use the `snapshot_name` to uniquely identify the snapshot.
171151

152+
1. Multiple concurrent training jobs with the same "snapshot_name" may
153+
result in concurrent write collisions and a potentially invalid snapshot
154+
if the jobs tries to read from and then write to the metadata file at
155+
exactly the same time.
156+
157+
The user is expected to handle these cases and explicitly specify `mode`s
158+
to ensure that only one run is set to `write` mode at any point if
159+
collisions are a possibility.
160+
172161
Note: `AUTO` options above indicates that snapshot will attempt to pick a
173162
reasonable default that is suitable for most use cases. We will eventually add
174163
tf.data autotuning to pick the right parameters for the best performance for
@@ -195,10 +184,7 @@ select whether to train or preprocess on their own, which is not good.
195184

196185
### Performance Implications
197186

198-
* Do you expect any (speed / memory)? How will you confirm?
199-
* There should be microbenchmarks. Are there?
200-
* There should be end-to-end tests and benchmarks. If there are not (since
201-
this is still a design), how will you track that these will be created?
187+
Benchmarks for this feature will be included as part of Dataset microbenchmarks.
202188

203189
### Dependencies
204190

0 commit comments

Comments
 (0)