@@ -90,7 +90,7 @@ def snapshot(path,
9090
91912 . ` compression ` : Optional. The type of compression to apply to the snapshot
9292 written to disk. This will support ` GZIP ` , ` SNAPPY ` or None. Defaults to
93- None .
93+ AUTO .
9494
95953 . ` reader_path_prefix ` : Optional. A prefix to add to the path when reading
9696 from snapshots. This is useful for filesystems where configuration is passed
@@ -101,7 +101,7 @@ def snapshot(path,
101101 through the path. Defaults to None.
102102
1031035 . ` shard_size_bytes ` : Optional. The maximum size of each data file to be
104- written by the snapshot dataset op. Defaults to 10 GiB .
104+ written by the snapshot dataset op. Defaults to AUTO .
105105
1061066 . ` pending_snapshot_expiry_seconds ` : Optional. How long to wait (in seconds)
107107 before the snapshot op considers a previously unfinished snapshot to be
@@ -110,28 +110,31 @@ def snapshot(path,
110110
1111117 . ` num_reader_threads ` : Optional. Number of threads to parallelize reading
112112 from snapshot. Especially useful if compression is turned on since the
113- decompression operation tends to be intensive. Defaults to 1. If > 1, then
113+ decompression operation tends to be intensive. If > 1, then
114114 this might introduce non-determinism i.e. the order in which the elements
115115 are read from the snapshot are different from the order they're written.
116+ Defaults to AUTO.
116117
1171188 . ` reader_buffer_size ` : Optional. Maximum number of elements we can prefetch
118- reading from the snapshot. Defaults to 1. Increasing this might improve
119- performance but will increase memory consumption.
119+ reading from the snapshot. Increasing this might improve
120+ performance but will increase memory consumption. Defaults to AUTO.
120121
1211229 . ` num_writer_threads ` : Optional. Number of threads to parallelize writing
122123 from snapshot. We'll open up ` num_writer_threads ` files and write to them in
123124 parallel. Especially useful if compression is turned on since the
124- compression operation tends to be intensive. Defaults to 1. If > 1, then
125+ compression operation tends to be intensive. If > 1, then
125126 this might introduce non-determinism i.e. the order in which the elements
126127 are read from the upstream iterator are different from the order they're
127- written.
128+ written. Defaults to AUTO.
128129
12913010 . ` writer_buffer_size ` : Optional. Maximum number of pipeline elements to fill
130- up the buffer before writing them out using ` num_writer_threads ` .
131+ up the buffer before writing them out using ` num_writer_threads ` . Defaults
132+ to AUTO.
131133
132- 11 . ` shuffle_on_read ` : Optional. If this is True, then the order in which
133- examples are produced when reading from a snapshot will be random. Defaults
134- to False.
134+ 11 . ` shuffle_on_read ` : Optional. If this is True, then snapshot randomizes the
135+ order in which the snapshot files are read back. This emulates shuffling
136+ of the input files during a training run (e.g. when ` Dataset.list_files `
137+ is called with ` shuffle ` turned on). Defaults to False.
135138
13613912 . ` shuffle_seed ` : Optional. If shuffle_seed is set, the random number
137140 generator used for shuffling (when ` shuffle_on_read ` is turned on) is seeded
@@ -166,12 +169,15 @@ def snapshot(path,
166169 and ` run_id ` (see the _ Detailed Design_ section for details), we will
167170 use the ` snapshot_name ` to uniquely identify the snapshot.
168171
172+ Note: ` AUTO ` options above indicates that snapshot will attempt to pick a
173+ reasonable default that is suitable for most use cases. We will eventually add
174+ tf.data autotuning to pick the right parameters for the best performance for
175+ individual workloads.
176+
169177### External API Guarantees
170178
171179Externally, we guarantee that snapshots written by a particular version of
172- TensorFlow will be readable by that specific version of TensorFlow. Eventually,
173- we can also guarantee that snapshots written will be readable by all future
174- versions of TensorFlow.
180+ TensorFlow will be readable by that specific version of TensorFlow.
175181
176182We are not currently handling the case where workers do not go through the
177183entire training set at least once.
@@ -285,14 +291,17 @@ WRITE, PASSTHROUGH, or READ state.
2852911 . If the snapshot directory is non-existent, empty or it doesn’t contain a
286292 ` metadata ` file, we will enter the ** WRITE** state.
287293
288- 1 . If the snapshot directory contains a ` metadata ` file, we will read the
289- metadata file.
294+ 1 . If the snapshot directory contains a ` metadata.final ` file, we will read
295+ the final metadata file and proceed to the ** READ ** state .
290296
291- 1 . The metadata file contains the following fields:
292- 1 . A training run ID
293- 1 . A boolean indicating if the snapshot is complete
297+ 1 . The file contains the following fields:
298+ 1 . A training run ID,
299+ 1 . A boolean indicating if the snapshot is complete.
294300 1 . A training run start-time.
295301
302+ 1 . If the snapshot directory contains a ` metadata ` file but not a
303+ ` metadata.final ` file, we will read the metadata file.
304+
2963051 . If the training run start-time is more than the (configurable) training run
297306 timeout (set with the ` pending_snapshot_expiry_seconds ` parameter), we will
298307 enter the ** WRITE** state.
@@ -315,7 +324,9 @@ WRITE, PASSTHROUGH, or READ state.
315324 the snapshot.metadata file to determine whether it contains the same
316325 training run ID.
317326
318- 1 . If it does, we set the complete bit to true to finalize the directory.
327+ 1 . If it does, we write a ` metadata.final ` file containing the
328+ same information as the ` metadata ` file but with the complete
329+ bit set to true.
319330 1 . If it does not, it means that someone else is concurrently writing the
320331 snapshot and we lost the race to them. We delete all data in the
321332 training run directory.
0 commit comments