Add unmasking and deprecate `use_legacy_pretraining_format` by RobotSail · Pull Request #565 · instructlab/sdg

RobotSail · 2025-02-23T08:21:17Z

The way that pretraining samples are created today is by manually formatting the samples based on a model's specific chat template, and requiring the user to specify which format gets used.

However; this formatting should not be a responsibility of data-mixing/post-processing, as it imposes an unnecessary burden for the repo to maintain specific information about a particular model.

This PR removes this burden by replacing the formatting logic with unmask fields that are set on general samples.

Where a knowledge sample would previously be treated by replacing the user/assistant messages with a pretraining message. Now, the user/assistant messages are preserved, and an unmask boolean field can be found on each sample. Consuming libraries may now choose to use this information or simply ignore it as necessary.

RobotSail · 2025-02-28T01:21:52Z

Spoke with @bbrowning offline, one consideration we'll need to make is to make sure that if we are concatenating datasets with other ones that we preserve backwards compatibility. I.e., if we have an older pre-computed skills dataset, we would need to update all of those samples to also use the unmask flag prior to merging.

We will add some tests to this PR for that.

bbrowning · 2025-04-04T13:34:07Z

@Mergifyio rebase

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

mergify · 2025-04-04T13:34:15Z

rebase

✅ Branch has been successfully rebased

This test just ensures the new unmask field mixes properly with our previous format of precomputed skills datasets that did not have the unmask field. Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning

I reviewed the changes and they look good. I did rebase this on top of the latest main and added one test to ensure we can mix samples with the format in our precomputed skills dataset with these new samples that have unmask.

tests/test_datamixing.py

RobotSail force-pushed the add-unmasking branch from 1e3686e to 2641838 Compare February 23, 2025 08:21

mergify bot added testing Relates to testing ci-failure and removed ci-failure labels Feb 23, 2025

RobotSail force-pushed the add-unmasking branch from 2641838 to 143f41e Compare February 23, 2025 08:25

mergify bot added CI/CD Affects CI/CD configuration and removed ci-failure labels Feb 23, 2025

RobotSail requested review from aakankshaduggal, bbrowning and shivchander February 23, 2025 08:29

RobotSail mentioned this pull request Mar 26, 2025

Deprecate pretraining sample format in favor of unmasking Red-Hat-AI-Innovation-Team/sdg_hub#30

Closed

RobotSail added 5 commits April 4, 2025 13:34

replace embedding the chat template with an "unmask" field

5535fb2

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

add deprecation warnings for use_legacy_pretraining_format

e1cb00f

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

add testing for data mixing

aca6c70

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

add some nice logic to help set defaults, not sure if is needed

e3eb7b4

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

update data generation tests

5a961f5

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

bbrowning force-pushed the add-unmasking branch from 143f41e to 5a961f5 Compare April 4, 2025 13:34

mergify bot added the ci-failure label Apr 8, 2025

bbrowning force-pushed the add-unmasking branch from 690a2a7 to 14e053d Compare April 8, 2025 20:25

mergify bot added ci-failure and removed ci-failure labels Apr 8, 2025

Add a test to ensure unmask mixes with old precomputed datasets

b13774f

This test just ensures the new unmask field mixes properly with our previous format of precomputed skills datasets that did not have the unmask field. Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning force-pushed the add-unmasking branch from 14e053d to b13774f Compare April 8, 2025 20:41

mergify bot removed the ci-failure label Apr 8, 2025

bbrowning approved these changes Apr 8, 2025

View reviewed changes

mergify bot added the one-approval label Apr 8, 2025

eshwarprasadS approved these changes Apr 8, 2025

View reviewed changes

mergify bot removed the one-approval label Apr 8, 2025

RobotSail removed request for aakankshaduggal and shivchander April 8, 2025 22:42

mergify bot merged commit 0a572b9 into instructlab:main Apr 8, 2025
28 checks passed

RobotSail commented Apr 11, 2025

View reviewed changes

tests/test_datamixing.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unmasking and deprecate `use_legacy_pretraining_format`#565

Add unmasking and deprecate `use_legacy_pretraining_format`#565
mergify[bot] merged 6 commits intoinstructlab:mainfrom
RobotSail:add-unmasking

RobotSail commented Feb 23, 2025

Uh oh!

RobotSail commented Feb 28, 2025

Uh oh!

bbrowning commented Apr 4, 2025

Uh oh!

mergify bot commented Apr 4, 2025

Uh oh!

bbrowning left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RobotSail commented Feb 23, 2025

Uh oh!

RobotSail commented Feb 28, 2025

Uh oh!

bbrowning commented Apr 4, 2025

Uh oh!

mergify bot commented Apr 4, 2025

✅ Branch has been successfully rebased

Uh oh!

bbrowning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants