Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 2bee40f

Browse files
author
Frank Chen
committed
Updated design doc after comments from various folks
1 parent 958ae63 commit 2bee40f

File tree

1 file changed

+31
-20
lines changed

1 file changed

+31
-20
lines changed

rfcs/20200107-tf-data-snapshot.md

Lines changed: 31 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ def snapshot(path,
9090

9191
2. `compression`: Optional. The type of compression to apply to the snapshot
9292
written to disk. This will support `GZIP`, `SNAPPY` or None. Defaults to
93-
None.
93+
AUTO.
9494

9595
3. `reader_path_prefix`: Optional. A prefix to add to the path when reading
9696
from snapshots. This is useful for filesystems where configuration is passed
@@ -101,7 +101,7 @@ def snapshot(path,
101101
through the path. Defaults to None.
102102

103103
5. `shard_size_bytes`: Optional. The maximum size of each data file to be
104-
written by the snapshot dataset op. Defaults to 10 GiB.
104+
written by the snapshot dataset op. Defaults to AUTO.
105105

106106
6. `pending_snapshot_expiry_seconds`: Optional. How long to wait (in seconds)
107107
before the snapshot op considers a previously unfinished snapshot to be
@@ -110,28 +110,31 @@ def snapshot(path,
110110

111111
7. `num_reader_threads`: Optional. Number of threads to parallelize reading
112112
from snapshot. Especially useful if compression is turned on since the
113-
decompression operation tends to be intensive. Defaults to 1. If > 1, then
113+
decompression operation tends to be intensive. If > 1, then
114114
this might introduce non-determinism i.e. the order in which the elements
115115
are read from the snapshot are different from the order they're written.
116+
Defaults to AUTO.
116117

117118
8. `reader_buffer_size`: Optional. Maximum number of elements we can prefetch
118-
reading from the snapshot. Defaults to 1. Increasing this might improve
119-
performance but will increase memory consumption.
119+
reading from the snapshot. Increasing this might improve
120+
performance but will increase memory consumption. Defaults to AUTO.
120121

121122
9. `num_writer_threads`: Optional. Number of threads to parallelize writing
122123
from snapshot. We'll open up `num_writer_threads` files and write to them in
123124
parallel. Especially useful if compression is turned on since the
124-
compression operation tends to be intensive. Defaults to 1. If > 1, then
125+
compression operation tends to be intensive. If > 1, then
125126
this might introduce non-determinism i.e. the order in which the elements
126127
are read from the upstream iterator are different from the order they're
127-
written.
128+
written. Defaults to AUTO.
128129

129130
10. `writer_buffer_size`: Optional. Maximum number of pipeline elements to fill
130-
up the buffer before writing them out using `num_writer_threads`.
131+
up the buffer before writing them out using `num_writer_threads`. Defaults
132+
to AUTO.
131133

132-
11. `shuffle_on_read`: Optional. If this is True, then the order in which
133-
examples are produced when reading from a snapshot will be random. Defaults
134-
to False.
134+
11. `shuffle_on_read`: Optional. If this is True, then snapshot randomizes the
135+
order in which the snapshot files are read back. This emulates shuffling
136+
of the input files during a training run (e.g. when `Dataset.list_files`
137+
is called with `shuffle` turned on). Defaults to False.
135138

136139
12. `shuffle_seed`: Optional. If shuffle_seed is set, the random number
137140
generator used for shuffling (when `shuffle_on_read` is turned on) is seeded
@@ -166,12 +169,15 @@ def snapshot(path,
166169
and `run_id` (see the _Detailed Design_ section for details), we will
167170
use the `snapshot_name` to uniquely identify the snapshot.
168171

172+
Note: `AUTO` options above indicates that snapshot will attempt to pick a
173+
reasonable default that is suitable for most use cases. We will eventually add
174+
tf.data autotuning to pick the right parameters for the best performance for
175+
individual workloads.
176+
169177
### External API Guarantees
170178

171179
Externally, we guarantee that snapshots written by a particular version of
172-
TensorFlow will be readable by that specific version of TensorFlow. Eventually,
173-
we can also guarantee that snapshots written will be readable by all future
174-
versions of TensorFlow.
180+
TensorFlow will be readable by that specific version of TensorFlow.
175181

176182
We are not currently handling the case where workers do not go through the
177183
entire training set at least once.
@@ -285,14 +291,17 @@ WRITE, PASSTHROUGH, or READ state.
285291
1. If the snapshot directory is non-existent, empty or it doesn’t contain a
286292
`metadata` file, we will enter the **WRITE** state.
287293

288-
1. If the snapshot directory contains a `metadata` file, we will read the
289-
metadata file.
294+
1. If the snapshot directory contains a `metadata.final` file, we will read
295+
the final metadata file and proceed to the **READ** state.
290296

291-
1. The metadata file contains the following fields:
292-
1. A training run ID
293-
1. A boolean indicating if the snapshot is complete
297+
1. The file contains the following fields:
298+
1. A training run ID,
299+
1. A boolean indicating if the snapshot is complete.
294300
1. A training run start-time.
295301

302+
1. If the snapshot directory contains a `metadata` file but not a
303+
`metadata.final` file, we will read the metadata file.
304+
296305
1. If the training run start-time is more than the (configurable) training run
297306
timeout (set with the `pending_snapshot_expiry_seconds` parameter), we will
298307
enter the **WRITE** state.
@@ -315,7 +324,9 @@ WRITE, PASSTHROUGH, or READ state.
315324
the snapshot.metadata file to determine whether it contains the same
316325
training run ID.
317326

318-
1. If it does, we set the complete bit to true to finalize the directory.
327+
1. If it does, we write a `metadata.final` file containing the
328+
same information as the `metadata` file but with the complete
329+
bit set to true.
319330
1. If it does not, it means that someone else is concurrently writing the
320331
snapshot and we lost the race to them. We delete all data in the
321332
training run directory.

0 commit comments

Comments
 (0)