Arbitrary Dataset Mixtures with DataModules #12731
Unanswered
siddk
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 2 comments
-
do you mean the distribution should be used across training or per epoch? like choose dataset1 for 50% of all epochs, 25% for dataset2.... |
Beta Was this translation helpful? Give feedback.
0 replies
-
I was thinking the latter -- throughout training each batch is comprised of 50% dataset1, 25% dataset2... The usecase is that a lot of language modeling datasets come from sources of various quality; I'd want to upsample the high-quality sources within each batch, but still see the full diversity of data. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey folks - I have a semi-standard usecase that I was hoping to use DataModules for. I'd like to train a language model on multiple different training datasets (each processed individually), sampling my full batch from each sub-dataset according to some mixture parameters (e.g. 50% from dataset 1, 25% from dataset 2, 25% from dataset 3).
Is there a nice way to do this with DataModules? Separately, if I wanted to extend this to streaming datasets, what would I need to do?
Beta Was this translation helpful? Give feedback.
All reactions