Skip to content

Add unmasking and deprecate use_legacy_pretraining_format#565

Merged
mergify[bot] merged 6 commits intoinstructlab:mainfrom
RobotSail:add-unmasking
Apr 8, 2025
Merged

Add unmasking and deprecate use_legacy_pretraining_format#565
mergify[bot] merged 6 commits intoinstructlab:mainfrom
RobotSail:add-unmasking

Conversation

@RobotSail
Copy link
Copy Markdown
Member

The way that pretraining samples are created today is by manually formatting the samples based on a model's specific chat template, and requiring the user to specify which format gets used.

However; this formatting should not be a responsibility of data-mixing/post-processing, as it imposes an unnecessary burden for the repo to maintain specific information about a particular model.

This PR removes this burden by replacing the formatting logic with unmask fields that are set on general samples.

Where a knowledge sample would previously be treated by replacing the user/assistant messages with a pretraining message. Now, the user/assistant messages are preserved, and an unmask boolean field can be found on each sample. Consuming libraries may now choose to use this information or simply ignore it as necessary.

@mergify mergify bot added testing Relates to testing ci-failure and removed ci-failure labels Feb 23, 2025
@mergify mergify bot added CI/CD Affects CI/CD configuration and removed ci-failure labels Feb 23, 2025
@RobotSail
Copy link
Copy Markdown
Member Author

Spoke with @bbrowning offline, one consideration we'll need to make is to make sure that if we are concatenating datasets with other ones that we preserve backwards compatibility. I.e., if we have an older pre-computed skills dataset, we would need to update all of those samples to also use the unmask flag prior to merging.

We will add some tests to this PR for that.

@bbrowning
Copy link
Copy Markdown
Contributor

@Mergifyio rebase

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 4, 2025

rebase

✅ Branch has been successfully rebased

This test just ensures the new unmask field mixes properly with our
previous format of precomputed skills datasets that did not have the
unmask field.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Copy link
Copy Markdown
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the changes and they look good. I did rebase this on top of the latest main and added one test to ensure we can mix samples with the format in our precomputed skills dataset with these new samples that have unmask.

@mergify mergify bot added the one-approval label Apr 8, 2025
@mergify mergify bot removed the one-approval label Apr 8, 2025
@mergify mergify bot merged commit 0a572b9 into instructlab:main Apr 8, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants