Skip to content

synthetic datasets for benchmarking and testing#3518

Merged
winglian merged 3 commits intomainfrom
synthetic-dataset
Mar 22, 2026
Merged

synthetic datasets for benchmarking and testing#3518
winglian merged 3 commits intomainfrom
synthetic-dataset

Conversation

@winglian
Copy link
Collaborator

@winglian winglian commented Mar 20, 2026

Summary by CodeRabbit

Release Notes

  • New Features
    • Introduced synthetic dataset generation for testing and development with configurable parameters: dataset size, token ID ranges, sequence length, and reproducible random seed.
    • Configuration system now recognizes and automatically handles synthetic datasets.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e5f9c40a-3291-48f1-bbc9-d0103927fa14

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change introduces a new synthetic dataset generation feature that enables creating random token-based datasets for training without external data sources. It includes a new strategy module, updated configuration schemas to support the SyntheticDataset type, integration into the data loading pipeline, and comprehensive tests for the new functionality.

Changes

Cohort / File(s) Summary
Synthetic Dataset Strategy
src/axolotl/prompt_strategies/synthetic.py, tests/prompt_strategies/test_synthetic.py
New SyntheticDatasetStrategy class that generates datasets with random token IDs within configured ranges and fixed sequence lengths. Factory load() function reads overrides from dataset config. Comprehensive test suite validates row count, token field lengths, value ranges, reproducibility via seeding, and configuration override behavior.
Dataset Schema
src/axolotl/utils/schemas/datasets.py
Added new SyntheticDataset Pydantic model with fields for length, sequence_length, token ID bounds (min_input_id, max_input_id), and optional seed. Updated DatasetConfig union to include SyntheticDataset.
Configuration Schema
src/axolotl/utils/schemas/config.py
Updated type annotations for AxolotlInputConfig.datasets and test_datasets fields to include SyntheticDataset in their union types alongside existing dataset types.
Data Loading & Validation
src/axolotl/utils/config/__init__.py, src/axolotl/utils/data/sft.py
Updated validate_config to recognize path == "synthetic" entries and convert them to SyntheticDataset objects. Modified _load_and_process_single_dataset to handle synthetic datasets by creating minimal in-memory placeholder datasets before wrapping via strategy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'synthetic datasets for benchmarking and testing' directly and clearly summarizes the main changes—adding a new synthetic dataset generation feature with accompanying configuration schemas and tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch synthetic-dataset
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/axolotl/prompt_strategies/synthetic.py (1)

68-68: Avoid list multiplication for nested lists—use a comprehension instead.

[[1] * self.sequence_length] * self.length creates self.length references to the same inner list. While this works correctly here because Dataset.from_dict copies the data into Arrow format, it's a subtle footgun that could cause issues if the code is refactored or the list is used before conversion.

♻️ Safer approach using list comprehension
-        attention_mask = [[1] * self.sequence_length] * self.length
+        attention_mask = [[1] * self.sequence_length for _ in range(self.length)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/prompt_strategies/synthetic.py` at line 68, The attention_mask is
built with nested list multiplication which creates shared inner-list
references; update the construction of attention_mask in synthetic.py (the
assignment to attention_mask that uses self.sequence_length and self.length) to
use a list comprehension that creates independent inner lists (e.g., iterate
over range(self.length) and create [1] * self.sequence_length for each) so each
row is a distinct list before passing to Dataset.from_dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/axolotl/prompt_strategies/synthetic.py`:
- Line 68: The attention_mask is built with nested list multiplication which
creates shared inner-list references; update the construction of attention_mask
in synthetic.py (the assignment to attention_mask that uses self.sequence_length
and self.length) to use a list comprehension that creates independent inner
lists (e.g., iterate over range(self.length) and create [1] *
self.sequence_length for each) so each row is a distinct list before passing to
Dataset.from_dict.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa26e1be-b85c-4a3e-85cf-668ec4e28446

📥 Commits

Reviewing files that changed from the base of the PR and between 1fc86d5 and 07e83ab.

📒 Files selected for processing (6)
  • src/axolotl/prompt_strategies/synthetic.py
  • src/axolotl/utils/config/__init__.py
  • src/axolotl/utils/data/sft.py
  • src/axolotl/utils/schemas/config.py
  • src/axolotl/utils/schemas/datasets.py
  • tests/prompt_strategies/test_synthetic.py

@github-actions
Copy link
Contributor

github-actions bot commented Mar 20, 2026

📖 Documentation Preview: https://69bd40c71a433c223a77aa76--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 2a50620

@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 97.67442% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/utils/data/sft.py 66.66% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian added the scheduled_release This PR is slated for the upcoming release label Mar 21, 2026
@winglian winglian merged commit fc3b3d1 into main Mar 22, 2026
15 of 16 checks passed
@winglian winglian deleted the synthetic-dataset branch March 22, 2026 02:47
@winglian winglian removed the scheduled_release This PR is slated for the upcoming release label Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant