You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/process.mdx
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -657,6 +657,7 @@ In this case, the new dataset is constructed by getting examples one by one from
657
657
You can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples.
658
658
You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.
659
659
Note that if no sampling probabilities are specified, the new dataset will have `max_length_datasets*nb_dataset samples`.
660
+
There is also `stopping_strategy=all_exhausted_without_replacement` to ensure that every sample is seen exactly once.
Copy file name to clipboardExpand all lines: docs/source/stream.mdx
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -197,6 +197,7 @@ Around 80% of the final dataset is made of the `es_dataset`, and 20% of the `fr_
197
197
You can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples.
198
198
You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.
199
199
Note that if no sampling probabilities are specified, the new dataset will have `max_length_datasets*nb_dataset samples`.
200
+
There is also `stopping_strategy=all_exhausted_without_replacement` to ensure that every sample is seen exactly once.
By default, `first_exhausted` is an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples.
6588
6590
If the strategy is `all_exhausted`, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once.
6591
+
When strategy is `all_exhausted_without_replacement` we make sure that each sample in each dataset is sampled only once.
6589
6592
Note that if the strategy is `all_exhausted`, the interleaved dataset size can get enormous:
6590
6593
- with no probabilities, the resulting dataset will have max_length_datasets*nb_dataset samples.
6591
6594
- with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting.
# if undersampling ("first_exhausted"), we stop as soon as one dataset is exhausted
6639
6642
# if oversampling ("all_exhausted"), we stop as soons as every dataset is exhausted, i.e as soon as every samples of every dataset has been visited at least once
Interleave several datasets (sources) into a single dataset.
@@ -55,9 +57,10 @@ def interleave_datasets(
55
57
Name of the dataset split.
56
58
<Added version="2.4.0"/>
57
59
stopping_strategy (`str`, defaults to `first_exhausted`):
58
-
Two strategies are proposed right now, `first_exhausted`and `all_exhausted`.
60
+
Three strategies are proposed right now, `first_exhausted`, `all_exhausted` and `all_exhausted_without_replacement`.
59
61
By default, `first_exhausted` is an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples.
60
62
If the strategy is `all_exhausted`, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once.
63
+
When strategy is `all_exhausted_without_replacement` we make sure that each sample in each dataset is sampled only once.
61
64
Note that if the strategy is `all_exhausted`, the interleaved dataset size can get enormous:
62
65
- with no probabilities, the resulting dataset will have `max_length_datasets*nb_dataset` samples.
63
66
- with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting.
@@ -143,15 +146,20 @@ def interleave_datasets(
143
146
raiseValueError(
144
147
f"Unable to interleave a {dataset_type.__name__} (at position 0) with a {other_type.__name__} (at position {i}). Expected a list of Dataset objects or a list of IterableDataset objects."
# if undersampling ("first_exhausted"), we stop as soon as one dataset is exhausted
683
685
# if oversampling ("all_exhausted"), we stop as soons as every dataset is exhausted, i.e as soon as every samples of every dataset has been visited at least once
0 commit comments