Support preformatted datasets by shimizust · Pull Request #169 · sgl-project/SpecForge

shimizust · 2025-08-24T08:27:34Z

Motivation

Generic datasets like sharegpt/ultrachat do not have user/assistant tokens, and we expect to have to apply a specific model's chat template to it. However, there are use cases where we want to train a draft model on the dataset used during fine-tuning of the target model which has its own system prompt and may have been saved as the raw prompt, along with raw target model generations (including user/assistant/eot tokens).
Currently, we would need to reverse-engineer the chat template or create new templates with custom system prompts, which is inconvenient and error-prone.

Modifications

Add support for passing pre-formatted conversational text directly without applying additional chat templating by passing in the --is-preformatted flag to the training scripts.
If --is-preformatted is passed, then we expect json data in this format:

{
  "id": ...,
  "text": "<|im_start|>system\nYou are a ...<|im_end|>\n<|im_start|>user\nHello...<|im_start|>assistant\n<think>..."
}

The current dataset format for conversations that are not pre-formatted:

{
  "id": ...,
  "conversations": [{"role": "user", "content": ...}, {"role": "assistant", "content": ...}]
}

Accuracy Test

Added visual debugging test for preprocessing conversations with/without pre-formatting.

Ran training successfully without preformatting (sharegpt) and with pre-formatted dataset:

torchrun \
    --nnodes 1 \
    --nproc_per_node 2 \
    train_eagle3_online.py \
    --target-model-path <TARGET_MODEL_PATH> \
    --draft-model-config <MODEL_CONFIG_PATH> \
    --train-data-path <PATH_TO_PREFORMATTED_DATASET> \
    --output-dir <OUTPUT_DIR> \
    --num-epochs 10 \
    --batch-size 1 \
    --learning-rate 1e-4 \
    --max-length 8192 \
    --tp-size 1 \
    --chat-template qwen \
    --is-preformatted \
    --cache-dir /dev/shm/cache \
    --log-steps 50 \
    --report-to mlflow

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

FlamingoPg

LGTM, any other comment? @FrankLeeeee @shuaills

shimizust · 2025-08-25T23:58:34Z

Thanks @FlamingoPg , lmk your thoughts @FrankLeeeee @shuaills

tests/test_preprocessing.py

shimizust added 16 commits August 23, 2025 17:41

Support preformatted datasets

ccfeb03

Added initial unit test for preprocessing

e078f6c

test fixes

f620771

Fixed tests

ffc91b4

Fixed preformmated logic

4a17239

preprocessing visualization

0a33cbb

Updated scripts

0425f56

Fixed scripts

4da116e

Fixed

2c33372

Working

ee12000

Simplified tests

85502ae

Cleanup

90ab4ea

Added documentation

dbf0728

Linting

744ca54

reamde

cf3fb7c

cleanup

320a3ad

shimizust marked this pull request as ready for review August 24, 2025 08:57

shimizust requested review from FlamingoPg, FrankLeeeee, shuaills, sleepcoo and zyksir as code owners August 24, 2025 08:57

Merge branch 'main' into sshimizu/pre-formatted-dataset

aa6ac35

FlamingoPg approved these changes Aug 25, 2025

View reviewed changes

Removed local path

022df8a

yubofredwang reviewed Aug 26, 2025

View reviewed changes

tests/test_preprocessing.py Outdated Show resolved Hide resolved

shimizust added 2 commits August 26, 2025 01:10

added tests

d7ef2f5

Fixed linting

0a83902

yubofredwang approved these changes Aug 26, 2025

View reviewed changes

yubofredwang merged commit 8337db0 into sgl-project:main Aug 26, 2025
2 checks passed

Ximingwang-09 mentioned this pull request Aug 26, 2025

Support system prompt in preprocess_conversations. #181

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support preformatted datasets#169

Support preformatted datasets#169
yubofredwang merged 20 commits intosgl-project:mainfrom
shimizust:sshimizu/pre-formatted-dataset

shimizust commented Aug 24, 2025 •

edited

Loading

Uh oh!

FlamingoPg left a comment

Uh oh!

shimizust commented Aug 25, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shimizust commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Checklist

Uh oh!

FlamingoPg left a comment

Choose a reason for hiding this comment

Uh oh!

shimizust commented Aug 25, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shimizust commented Aug 24, 2025 •

edited

Loading