Skip to content

Support preformatted datasets#169

Merged
yubofredwang merged 20 commits intosgl-project:mainfrom
shimizust:sshimizu/pre-formatted-dataset
Aug 26, 2025
Merged

Support preformatted datasets#169
yubofredwang merged 20 commits intosgl-project:mainfrom
shimizust:sshimizu/pre-formatted-dataset

Conversation

@shimizust
Copy link
Collaborator

@shimizust shimizust commented Aug 24, 2025

Motivation

  • Generic datasets like sharegpt/ultrachat do not have user/assistant tokens, and we expect to have to apply a specific model's chat template to it. However, there are use cases where we want to train a draft model on the dataset used during fine-tuning of the target model which has its own system prompt and may have been saved as the raw prompt, along with raw target model generations (including user/assistant/eot tokens).
  • Currently, we would need to reverse-engineer the chat template or create new templates with custom system prompts, which is inconvenient and error-prone.

Modifications

  • Add support for passing pre-formatted conversational text directly without applying additional chat templating by passing in the --is-preformatted flag to the training scripts.
  • If --is-preformatted is passed, then we expect json data in this format:
{
  "id": ...,
  "text": "<|im_start|>system\nYou are a ...<|im_end|>\n<|im_start|>user\nHello...<|im_start|>assistant\n<think>..."
}

The current dataset format for conversations that are not pre-formatted:

{
  "id": ...,
  "conversations": [{"role": "user", "content": ...}, {"role": "assistant", "content": ...}]
}

Accuracy Test

  • Added visual debugging test for preprocessing conversations with/without pre-formatting.
image
  • Ran training successfully without preformatting (sharegpt) and with pre-formatted dataset:
torchrun \
    --nnodes 1 \
    --nproc_per_node 2 \
    train_eagle3_online.py \
    --target-model-path <TARGET_MODEL_PATH> \
    --draft-model-config <MODEL_CONFIG_PATH> \
    --train-data-path <PATH_TO_PREFORMATTED_DATASET> \
    --output-dir <OUTPUT_DIR> \
    --num-epochs 10 \
    --batch-size 1 \
    --learning-rate 1e-4 \
    --max-length 8192 \
    --tp-size 1 \
    --chat-template qwen \
    --is-preformatted \
    --cache-dir /dev/shm/cache \
    --log-steps 50 \
    --report-to mlflow

Checklist

@shimizust shimizust marked this pull request as ready for review August 24, 2025 08:57
Copy link
Collaborator

@FlamingoPg FlamingoPg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, any other comment? @FrankLeeeee @shuaills

@shimizust
Copy link
Collaborator Author

Thanks @FlamingoPg , lmk your thoughts @FrankLeeeee @shuaills

@yubofredwang yubofredwang merged commit 8337db0 into sgl-project:main Aug 26, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants