Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit a7a7c5b

Browse files
author
Frank Chen
committed
Removed miscellaneous reader options and added a reader_fn parameter
1 parent beff086 commit a7a7c5b

File tree

1 file changed

+42
-19
lines changed

1 file changed

+42
-19
lines changed

rfcs/20200107-tf-data-snapshot.md

Lines changed: 42 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -72,10 +72,8 @@ def snapshot(path,
7272
compression=None,
7373
shard_size_bytes=None,
7474
pending_snapshot_expiry_seconds=None,
75-
num_reader_threads=None,
7675
num_writer_threads=None,
77-
shuffle_on_read=None,
78-
shuffle_seed=None,
76+
reader_fn=None,
7977
mode=None,
8078
snapshot_name=None):
8179
pass # Implementation goes here.
@@ -96,13 +94,6 @@ def snapshot(path,
9694
stale and starts writing a snapshot from scratch again. Defaults to 86400
9795
seconds (1 day).
9896

99-
1. `num_reader_threads`: Optional. Number of threads to parallelize reading
100-
from snapshot. Especially useful if compression is turned on since the
101-
decompression operation tends to be intensive. If > 1, then
102-
this might introduce non-determinism i.e. the order in which the elements
103-
are read from the snapshot are different from the order they're written.
104-
Defaults to AUTO.
105-
10697
1. `num_writer_threads`: Optional. Number of threads to parallelize writing
10798
from snapshot. We'll open up `num_writer_threads` files and write to them in
10899
parallel. Especially useful if compression is turned on since the
@@ -111,15 +102,47 @@ def snapshot(path,
111102
are read from the upstream iterator are different from the order they're
112103
written. Defaults to AUTO.
113104

114-
1. `shuffle_on_read`: Optional. If this is True, then snapshot randomizes the
115-
order in which the snapshot files are read back. This emulates shuffling
116-
of the input files during a training run (e.g. when `Dataset.list_files`
117-
is called with `shuffle` turned on). Defaults to False.
118-
119-
1. `shuffle_seed`: Optional. If shuffle_seed is set, the random number
120-
generator used for shuffling (when `shuffle_on_read` is turned on) is seeded
121-
by the given seed. Otherwise, it is seeded by a random seed that differs for
122-
every run.
105+
1. `reader_fn`: Optional. A user provided reader function to use when reading
106+
the snapshot back. This allows the user to specify the concurrency and
107+
randomization required when reading from the snapshot.
108+
109+
`reader_fn` should be a function that accepts two arguments: (1) a list of
110+
snapshot file paths, and (2) a reference to a `SnapshotDataset` class.
111+
The function should return a `Dataset` class.
112+
113+
The `SnapshotDataset` class is a `Dataset` (similar to other source datasets
114+
like `TFRecordDataset` or `TextLineDataset`) with the following constructor:
115+
```python
116+
class SnapshotDataset(dataset_ops.DatasetSource):
117+
def __init__(filenames):
118+
"""Creates a `SnapshotDataset`.
119+
120+
Args:
121+
filenames: A `tf.string` tensor or a `tf.data.Dataset` containing one or
122+
more filenames.
123+
"""
124+
pass
125+
```
126+
127+
If the `reader_fn` is not specified, a default equivalent to the following
128+
will be used:
129+
```python
130+
def reader_fn(filenames, SnapshotDataset):
131+
return SnapshotDataset(filenames)
132+
```
133+
134+
Users can optionally add snapshot file shuffling and parallelism by passing
135+
a `reader_fn` similar to the one here:
136+
```python
137+
def reader_fn(filenames, SnapshotDataset):
138+
file_ds = Dataset.from_tensor_slices(filenames)
139+
file_ds = file_ds.shuffle(1000)
140+
reader_ds = dataset.interleave(
141+
lambda x: SnapshotDataset(x),
142+
cycle_length=32,
143+
num_parallel_calls=32)
144+
return reader_ds
145+
```
123146

124147
1. `mode`: Optional. The mode at which snapshot should operate. Valid options
125148
are `auto`, `read`, `write`, and `passthrough`. The default mode is `auto`,

0 commit comments

Comments
 (0)