Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit d941c33

Browse files
author
Frank Chen
committed
Add clarification regarding number of workers being the same
1 parent 2bee40f commit d941c33

File tree

1 file changed

+11
-3
lines changed

1 file changed

+11
-3
lines changed

rfcs/20200107-tf-data-snapshot.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -241,9 +241,17 @@ the MVP this is designed for:
241241
training dataset and be able to store that data on disk. Otherwise, a
242242
snapshot will never get created.
243243

244-
2. In case there are multiple workers and the dataset is sharded across
245-
workers, we assume that the number of workers remains the same from one run
246-
to another. If the number changes, we’ll trigger another snapshot.
244+
2. In the cases where there are multiple workers and the dataset is sharded with
245+
`Dataset.shard`, we assume that the number of workers remains the same from
246+
the initial (writing) run through to the reading runs.
247+
248+
If the number of workers change, then the `num_shards` parameter to
249+
`Dataset.shard` will change, and this will result in a different graph
250+
fingerprint and another snapshot write will be triggered.
251+
252+
If all workers use the exact same input pipeline with no sharding (e.g. all
253+
workers will read from all the files), then snapshot will still be able to
254+
read from previous snapshots even if the number of workers is different.
247255

248256
3. Any `repeat`s in the dataset should be moved to after the `snapshot` op, to
249257
avoid writing large (or infinite) amounts of data during a snapshot writing

0 commit comments

Comments
 (0)