@@ -70,14 +70,10 @@ We are proposing the following API for the snapshot transformation.
7070``` python
7171def snapshot (path ,
7272 compression = None ,
73- reader_path_prefix = None ,
74- writer_path_prefix = None ,
7573 shard_size_bytes = None ,
7674 pending_snapshot_expiry_seconds = None ,
7775 num_reader_threads = None ,
78- reader_buffer_size = None ,
7976 num_writer_threads = None ,
80- writer_buffer_size = None ,
8177 shuffle_on_read = None ,
8278 shuffle_seed = None ,
8379 mode = None ,
@@ -88,60 +84,44 @@ def snapshot(path,
88841 . ` path ` : Required. A directory where we want to save our snapshots and/or
8985 read from a previously saved snapshot.
9086
91- 2 . ` compression ` : Optional. The type of compression to apply to the snapshot
87+ 1 . ` compression ` : Optional. The type of compression to apply to the snapshot
9288 written to disk. This will support ` GZIP ` , ` SNAPPY ` or None. Defaults to
9389 AUTO.
9490
95- 3 . ` reader_path_prefix ` : Optional. A prefix to add to the path when reading
96- from snapshots. This is useful for filesystems where configuration is passed
97- in through the path. Defaults to None.
98-
99- 4 . ` writer_path_prefix ` : Optional. A prefix to add to the path when writing to
100- snapshots. This is useful for filesystems where configuration is passed in
101- through the path. Defaults to None.
102-
103- 5 . ` shard_size_bytes ` : Optional. The maximum size of each data file to be
91+ 1 . ` shard_size_bytes ` : Optional. The maximum size of each data file to be
10492 written by the snapshot dataset op. Defaults to AUTO.
10593
106- 6 . ` pending_snapshot_expiry_seconds ` : Optional. How long to wait (in seconds)
94+ 1 . ` pending_snapshot_expiry_seconds ` : Optional. How long to wait (in seconds)
10795 before the snapshot op considers a previously unfinished snapshot to be
10896 stale and starts writing a snapshot from scratch again. Defaults to 86400
10997 seconds (1 day).
11098
111- 7 . ` num_reader_threads ` : Optional. Number of threads to parallelize reading
99+ 1 . ` num_reader_threads ` : Optional. Number of threads to parallelize reading
112100 from snapshot. Especially useful if compression is turned on since the
113101 decompression operation tends to be intensive. If > 1, then
114102 this might introduce non-determinism i.e. the order in which the elements
115103 are read from the snapshot are different from the order they're written.
116104 Defaults to AUTO.
117105
118- 8 . ` reader_buffer_size ` : Optional. Maximum number of elements we can prefetch
119- reading from the snapshot. Increasing this might improve
120- performance but will increase memory consumption. Defaults to AUTO.
121-
122- 9 . ` num_writer_threads ` : Optional. Number of threads to parallelize writing
106+ 1 . ` num_writer_threads ` : Optional. Number of threads to parallelize writing
123107 from snapshot. We'll open up ` num_writer_threads ` files and write to them in
124108 parallel. Especially useful if compression is turned on since the
125109 compression operation tends to be intensive. If > 1, then
126110 this might introduce non-determinism i.e. the order in which the elements
127111 are read from the upstream iterator are different from the order they're
128112 written. Defaults to AUTO.
129113
130- 10 . ` writer_buffer_size ` : Optional. Maximum number of pipeline elements to fill
131- up the buffer before writing them out using ` num_writer_threads ` . Defaults
132- to AUTO.
133-
134- 11 . ` shuffle_on_read ` : Optional. If this is True, then snapshot randomizes the
114+ 1 . ` shuffle_on_read ` : Optional. If this is True, then snapshot randomizes the
135115 order in which the snapshot files are read back. This emulates shuffling
136116 of the input files during a training run (e.g. when ` Dataset.list_files `
137117 is called with ` shuffle ` turned on). Defaults to False.
138118
139- 12 . ` shuffle_seed ` : Optional. If shuffle_seed is set, the random number
119+ 1 . ` shuffle_seed ` : Optional. If shuffle_seed is set, the random number
140120 generator used for shuffling (when ` shuffle_on_read ` is turned on) is seeded
141121 by the given seed. Otherwise, it is seeded by a random seed that differs for
142122 every run.
143123
144- 13 . ` mode ` : Optional. The mode at which snapshot should operate. Valid options
124+ 1 . ` mode ` : Optional. The mode at which snapshot should operate. Valid options
145125 are ` auto ` , ` read ` , ` write ` , and ` passthrough ` . The default mode is ` auto ` ,
146126 where the snapshot op will automatically determine what mode to operate in.
147127
@@ -150,25 +130,34 @@ def snapshot(path,
150130 materialization currently exists. In other words, we enter the ** WRITE**
151131 state immediately.
152132
153- 2 . ` read ` mode forces the snapshot transformation to read from the latest
133+ 1 . ` read ` mode forces the snapshot transformation to read from the latest
154134 version of the materialization on disk, regardless of whether the data
155135 stored on disk is complete and valid. In other words, we enter the
156136 ** READ** state immediately.
157137
158- 3 . ` passthrough ` mode turns the snapshot transformation into a no-op. In
138+ 1 . ` passthrough ` mode turns the snapshot transformation into a no-op. In
159139 other words, we enter the ** PASSTHROUGH** state immediately.
160140
161- 4 . ` auto ` retains the default behavior of snapshot. See the "Standard
141+ 1 . ` auto ` retains the default behavior of snapshot. See the "Standard
162142 Kernel Workflow" section for the default behavior.
163143
164- 14 . ` snapshot_name ` : Optional. If set, use the supplied string as a named
144+ 1 . ` snapshot_name ` : Optional. If set, use the supplied string as a named
165145 snapshot name instead of introspecting the data pipeline and automatically
166146 generating a unique identifier for the specific data pipeline.
167147
168148 1 . Instead of generating a new fingerprint of the input processing graph or
169149 and ` run_id ` (see the _ Detailed Design_ section for details), we will
170150 use the ` snapshot_name ` to uniquely identify the snapshot.
171151
152+ 1 . Multiple concurrent training jobs with the same "snapshot_name" may
153+ result in concurrent write collisions and a potentially invalid snapshot
154+ if the jobs tries to read from and then write to the metadata file at
155+ exactly the same time.
156+
157+ The user is expected to handle these cases and explicitly specify ` mode ` s
158+ to ensure that only one run is set to ` write ` mode at any point if
159+ collisions are a possibility.
160+
172161Note: ` AUTO ` options above indicates that snapshot will attempt to pick a
173162reasonable default that is suitable for most use cases. We will eventually add
174163tf.data autotuning to pick the right parameters for the best performance for
@@ -195,10 +184,7 @@ select whether to train or preprocess on their own, which is not good.
195184
196185### Performance Implications
197186
198- * Do you expect any (speed / memory)? How will you confirm?
199- * There should be microbenchmarks. Are there?
200- * There should be end-to-end tests and benchmarks. If there are not (since
201- this is still a design), how will you track that these will be created?
187+ Benchmarks for this feature will be included as part of Dataset microbenchmarks.
202188
203189### Dependencies
204190
0 commit comments