synthetic datasets for benchmarking and testing by winglian · Pull Request #3518 · axolotl-ai-cloud/axolotl

winglian · 2026-03-20T03:30:05Z

Summary by CodeRabbit

Release Notes

New Features
- Introduced synthetic dataset generation for testing and development with configurable parameters: dataset size, token ID ranges, sequence length, and reproducible random seed.
- Configuration system now recognizes and automatically handles synthetic datasets.

coderabbitai · 2026-03-20T03:31:25Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e5f9c40a-3291-48f1-bbc9-d0103927fa14

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This change introduces a new synthetic dataset generation feature that enables creating random token-based datasets for training without external data sources. It includes a new strategy module, updated configuration schemas to support the SyntheticDataset type, integration into the data loading pipeline, and comprehensive tests for the new functionality.

Changes

Cohort / File(s)	Summary
Synthetic Dataset Strategy `src/axolotl/prompt_strategies/synthetic.py`, `tests/prompt_strategies/test_synthetic.py`	New `SyntheticDatasetStrategy` class that generates datasets with random token IDs within configured ranges and fixed sequence lengths. Factory `load()` function reads overrides from dataset config. Comprehensive test suite validates row count, token field lengths, value ranges, reproducibility via seeding, and configuration override behavior.
Dataset Schema `src/axolotl/utils/schemas/datasets.py`	Added new `SyntheticDataset` Pydantic model with fields for `length`, `sequence_length`, token ID bounds (`min_input_id`, `max_input_id`), and optional `seed`. Updated `DatasetConfig` union to include `SyntheticDataset`.
Configuration Schema `src/axolotl/utils/schemas/config.py`	Updated type annotations for `AxolotlInputConfig.datasets` and `test_datasets` fields to include `SyntheticDataset` in their union types alongside existing dataset types.
Data Loading & Validation `src/axolotl/utils/config/__init__.py`, `src/axolotl/utils/data/sft.py`	Updated `validate_config` to recognize `path == "synthetic"` entries and convert them to `SyntheticDataset` objects. Modified `_load_and_process_single_dataset` to handle synthetic datasets by creating minimal in-memory placeholder datasets before wrapping via strategy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'synthetic datasets for benchmarking and testing' directly and clearly summarizes the main changes—adding a new synthetic dataset generation feature with accompanying configuration schemas and tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch synthetic-dataset

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/axolotl/prompt_strategies/synthetic.py (1)
68-68: Avoid list multiplication for nested lists—use a comprehension instead.

[[1] * self.sequence_length] * self.length creates self.length references to the same inner list. While this works correctly here because Dataset.from_dict copies the data into Arrow format, it's a subtle footgun that could cause issues if the code is refactored or the list is used before conversion.
♻️ Safer approach using list comprehension
-        attention_mask = [[1] * self.sequence_length] * self.length
+        attention_mask = [[1] * self.sequence_length for _ in range(self.length)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/prompt_strategies/synthetic.py` at line 68, The attention_mask is
built with nested list multiplication which creates shared inner-list
references; update the construction of attention_mask in synthetic.py (the
assignment to attention_mask that uses self.sequence_length and self.length) to
use a list comprehension that creates independent inner lists (e.g., iterate
over range(self.length) and create [1] * self.sequence_length for each) so each
row is a distinct list before passing to Dataset.from_dict.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/axolotl/prompt_strategies/synthetic.py`:
- Line 68: The attention_mask is built with nested list multiplication which
creates shared inner-list references; update the construction of attention_mask
in synthetic.py (the assignment to attention_mask that uses self.sequence_length
and self.length) to use a list comprehension that creates independent inner
lists (e.g., iterate over range(self.length) and create [1] *
self.sequence_length for each) so each row is a distinct list before passing to
Dataset.from_dict.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa26e1be-b85c-4a3e-85cf-668ec4e28446

📥 Commits

Reviewing files that changed from the base of the PR and between 1fc86d5 and 07e83ab.

📒 Files selected for processing (6)

src/axolotl/prompt_strategies/synthetic.py
src/axolotl/utils/config/__init__.py
src/axolotl/utils/data/sft.py
src/axolotl/utils/schemas/config.py
src/axolotl/utils/schemas/datasets.py
tests/prompt_strategies/test_synthetic.py

github-actions · 2026-03-20T03:37:37Z

📖 Documentation Preview: https://69bd40c71a433c223a77aa76--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 2a50620

codecov · 2026-03-20T05:01:28Z

Codecov Report

❌ Patch coverage is 97.67442% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/data/sft.py	66.66%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

synthetic datasets for benchmarking and testing

07e83ab

winglian mentioned this pull request Mar 20, 2026

feat: add dataset weight support for weighted data mixing #3501

Open

8 tasks

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

fix synthetic dataset parse from config and add tests

c806f9e

use type=_synthetic

2a50620

winglian added the scheduled_release This PR is slated for the upcoming release label Mar 21, 2026

winglian merged commit fc3b3d1 into main Mar 22, 2026
15 of 16 checks passed

winglian deleted the synthetic-dataset branch March 22, 2026 02:47

winglian removed the scheduled_release This PR is slated for the upcoming release label Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

synthetic datasets for benchmarking and testing#3518

synthetic datasets for benchmarking and testing#3518
winglian merged 3 commits intomainfrom
synthetic-dataset

winglian commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

winglian commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 20, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

winglian commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

github-actions bot commented Mar 20, 2026 •

edited

Loading