Refactor/Extend training #15
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enhanced Pre-Training Pipeline
The main changes in this PR are:
Trainerclassscripts/added a simple, slimmer training configpretrain_classification_new.pyhydrasupport for better config handlingThis is still WIP and there are still issues specifically with DDP which are due to the current structure of the DataLoader/PriorDataSets which makes it really difficult to properly handle DDP.
@AlexanderPfefferle the problem imo right now is that the datasets itself handle batching. This leads to the following issue: when training with DDP normally one would use something like a distributed sampler to split a batch size of size b into b/m smaller batches where m is the number of GPUs we are using. However because our datasets already return batched elements the internal DataLoader inside of the data pipeline essentially has a
batch_sizeofNonebecause the underlying dataset already returns batched elements therefore we can't to batch splitting.As far as I see it right now we could change it in two ways:
Datasetobject (pre-loaded or prior) returns only one dataset at a time. This however makes loading way more expensive as our stored data/the data from the prior usually already comes in a batched format. In that setting we would leave the batching entirely to theDataLoaderinside of theTrainerclass.PriorDataLoadersitself though I think this would really complicate things as we would need to handle this stuff outside of theTrainerwhich I would not want to do.At least for on-the-fly generation this entire thing is not really a problem as one could just change the seed for each machine inside the
PriorDatasettherefore each machine would draw different datasets from the Prior anyway and you would increase the effective batch_size tob*mI think we should find a solution for that before merging and also adapt the entire structure of the
PriorDataLoaderto be consistent with the rest of the project.Open TODOs