@@ -72,10 +72,8 @@ def snapshot(path,
7272 compression = None ,
7373 shard_size_bytes = None ,
7474 pending_snapshot_expiry_seconds = None ,
75- num_reader_threads = None ,
7675 num_writer_threads = None ,
77- shuffle_on_read = None ,
78- shuffle_seed = None ,
76+ reader_fn = None ,
7977 mode = None ,
8078 snapshot_name = None ):
8179 pass # Implementation goes here.
@@ -96,13 +94,6 @@ def snapshot(path,
9694 stale and starts writing a snapshot from scratch again. Defaults to 86400
9795 seconds (1 day).
9896
99- 1 . ` num_reader_threads ` : Optional. Number of threads to parallelize reading
100- from snapshot. Especially useful if compression is turned on since the
101- decompression operation tends to be intensive. If > 1, then
102- this might introduce non-determinism i.e. the order in which the elements
103- are read from the snapshot are different from the order they're written.
104- Defaults to AUTO.
105-
106971 . ` num_writer_threads ` : Optional. Number of threads to parallelize writing
10798 from snapshot. We'll open up ` num_writer_threads ` files and write to them in
10899 parallel. Especially useful if compression is turned on since the
@@ -111,15 +102,47 @@ def snapshot(path,
111102 are read from the upstream iterator are different from the order they're
112103 written. Defaults to AUTO.
113104
114- 1 . ` shuffle_on_read ` : Optional. If this is True, then snapshot randomizes the
115- order in which the snapshot files are read back. This emulates shuffling
116- of the input files during a training run (e.g. when ` Dataset.list_files `
117- is called with ` shuffle ` turned on). Defaults to False.
118-
119- 1 . ` shuffle_seed ` : Optional. If shuffle_seed is set, the random number
120- generator used for shuffling (when ` shuffle_on_read ` is turned on) is seeded
121- by the given seed. Otherwise, it is seeded by a random seed that differs for
122- every run.
105+ 1 . ` reader_fn ` : Optional. A user provided reader function to use when reading
106+ the snapshot back. This allows the user to specify the concurrency and
107+ randomization required when reading from the snapshot.
108+
109+ ` reader_fn ` should be a function that accepts two arguments: (1) a list of
110+ snapshot file paths, and (2) a reference to a ` SnapshotDataset ` class.
111+ The function should return a ` Dataset ` class.
112+
113+ The ` SnapshotDataset ` class is a ` Dataset ` (similar to other source datasets
114+ like ` TFRecordDataset ` or ` TextLineDataset ` ) with the following constructor:
115+ ``` python
116+ class SnapshotDataset (dataset_ops .DatasetSource ):
117+ def __init__ (filenames ):
118+ """ Creates a `SnapshotDataset`.
119+
120+ Args:
121+ filenames: A `tf.string` tensor or a `tf.data.Dataset` containing one or
122+ more filenames.
123+ """
124+ pass
125+ ```
126+
127+ If the `reader_fn` is not specified, a default equivalent to the following
128+ will be used:
129+ ```python
130+ def reader_fn (filenames , SnapshotDataset ):
131+ return SnapshotDataset(filenames)
132+ ```
133+
134+ Users can optionally add snapshot file shuffling and parallelism by passing
135+ a `reader_fn` similar to the one here:
136+ ```python
137+ def reader_fn (filenames , SnapshotDataset ):
138+ file_ds = Dataset.from_tensor_slices(filenames)
139+ file_ds = file_ds.shuffle(1000 )
140+ reader_ds = dataset.interleave(
141+ lambda x : SnapshotDataset(x),
142+ cycle_length = 32 ,
143+ num_parallel_calls = 32 )
144+ return reader_ds
145+ ```
123146
1241471 . `mode` : Optional. The mode at which snapshot should operate. Valid options
125148 are `auto` , `read` , `write` , and `passthrough` . The default mode is `auto` ,
0 commit comments