Add unmasking and deprecate use_legacy_pretraining_format#565
Merged
mergify[bot] merged 6 commits intoinstructlab:mainfrom Apr 8, 2025
Merged
Add unmasking and deprecate use_legacy_pretraining_format#565mergify[bot] merged 6 commits intoinstructlab:mainfrom
use_legacy_pretraining_format#565mergify[bot] merged 6 commits intoinstructlab:mainfrom
Conversation
1e3686e to
2641838
Compare
2641838 to
143f41e
Compare
Member
Author
|
Spoke with @bbrowning offline, one consideration we'll need to make is to make sure that if we are concatenating datasets with other ones that we preserve backwards compatibility. I.e., if we have an older pre-computed skills dataset, we would need to update all of those samples to also use the unmask flag prior to merging. We will add some tests to this PR for that. |
Contributor
|
@Mergifyio rebase |
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Contributor
✅ Branch has been successfully rebased |
This test just ensures the new unmask field mixes properly with our previous format of precomputed skills datasets that did not have the unmask field. Signed-off-by: Ben Browning <bbrownin@redhat.com>
bbrowning
approved these changes
Apr 8, 2025
Contributor
bbrowning
left a comment
There was a problem hiding this comment.
I reviewed the changes and they look good. I did rebase this on top of the latest main and added one test to ensure we can mix samples with the format in our precomputed skills dataset with these new samples that have unmask.
eshwarprasadS
approved these changes
Apr 8, 2025
RobotSail
commented
Apr 11, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The way that pretraining samples are created today is by manually formatting the samples based on a model's specific chat template, and requiring the user to specify which format gets used.
However; this formatting should not be a responsibility of data-mixing/post-processing, as it imposes an unnecessary burden for the repo to maintain specific information about a particular model.
This PR removes this burden by replacing the formatting logic with
unmaskfields that are set on general samples.Where a knowledge sample would previously be treated by replacing the user/assistant messages with a pretraining message. Now, the user/assistant messages are preserved, and an
unmaskboolean field can be found on each sample. Consuming libraries may now choose to use this information or simply ignore it as necessary.