Skip to content

DataArcTech/DataArc-SynData-Toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataArc SynData Toolkit

Python 3.10+ Framework: uv Pydantic v2

A modular, highly user-friendly synthetic data generation toolkit supporting multi-source, multi-language data synthesis.

Easily synthesize training data for LLMs with zero-code CLI and GUI !

πŸ“– [ English | δΈ­ζ–‡ ]

🎯 Project Overview

DataArc SynData Toolkit is a synthetic data generation toolkit developed and open-sourced by DataArcTech (https://www.dataarctech.com/) and International Digital Economy Academy (https://www.idea.edu.cn/). It enables users to generate customized training data in one step through simple configuration files based on their requirements.

πŸ’‘ Key Features

  • Extremely Simple Usage: Synthesize data with a single command and a configuration file. GUI is also provided for easy operations.
  • Support for Multi-Source Synthetic Data:
    • Local Synthesis: Support for generating data based on local corpora.
    • Huggingface Integration: Automatically crawl and filter data from Huggingface.
    • Model Distillation: Enable synthetic data generation through model distillation.
  • Integrated Post-Training Module: End-to-end model training workflows powered by verl, supporting SFT and GRPO.
  • Multilingual Support: Supports English and various low-resource languages.
  • Multi-Provider Model Support: Works with local deployment, OpenAI APIs, and more.
  • Highly Extensible: The entire synthetic data workflow is modular, allowing developers to flexibly customize them.

πŸŽ₯ Demo

Watch our 2-minute demo to experience how DataArc SynData Toolkit works in practice.

demo.mp4

πŸ”¬ Performance

Model Medical Finance Law
Qwen-2.5-7B-Instruct 42.34% 52.91% 19.80%
Trained with Synthetic Data 64.57% 73.93% 42.80%

A few lines of code deliver over 20% performance improvements.

πŸ““ Changelog

[25/11/17] πŸŽ‰We open-sourced our synthetic data platform.
[25/11/27] We added parallel processing module to significantly accelerate the synthetic data generation pipeline.
[25/11/28] We added intermediate result saving, allowing users to resume from the last successful stage** instead of restarting the entire pipeline β€” a major token saver.
[25/12/25] πŸ”₯Major upgrade:

  • Frontend–Backend Separation: DataArc SynData Toolkit now adopts a fully frontend–backend separated architecture, featuring a FastAPI backend (REST APIs + SSE streaming for real-time progress) and a standalone React frontend for improved visualization, usability, and scalability.
  • Post-Training Support via verl: Introduced an integrated post-training module powered by verl, enabling end-to-end model training workflows including SFT and GRPO on synthesized data.
  • Multilingual Expansion: Added support for generating Arabic datasets, leveraging an Arabic translation model to produce fully localized synthetic data outputs.

Tip

If you cannot use the latest feature, please pull the latest code.

🏭 DataArc SynData Toolkit Pipeline

DataArc SynData Toolkit is designed to synthesize data in a modular pipeline, allowing users to customize the strategies and implementation methods of each step. The main components include:

  • Synthetic Data Generation: Generate data through methods such as local synthesis, Huggingface dataset retrieval, and model distillation.
  • Data Filtering and Rewriting: Filter and rewrite initially synthesized data according to the target model's requirements.

dataarc-sdg_pipeline

By decoupling modules, developers can achieve flexible customization of functional modules based on specific needs.

🧩 Use Cases

We provide three different use cases that sythesize data through DataArc SynData Toolkit.

πŸ“ Project Structure

DataArc-SynData-Toolkit/
β”œβ”€β”€ configs/                        # YAML configuration examples
β”‚   β”œβ”€β”€ sdg.yaml                    # SDG pipeline config
β”‚   β”œβ”€β”€ sft.yaml                    # SFT training config
β”‚   └── grpo.yaml                   # GRPO training config
β”‚
β”œβ”€β”€ sdgsystem/                      # Core System
β”‚   β”œβ”€β”€ app/                        # FastAPI backend (REST + SSE)
β”‚   β”œβ”€β”€ generation/                 # Data generation
β”‚   β”œβ”€β”€ documents/                  # File parsing & retrieval
β”‚   β”œβ”€β”€ huggingface/                # HF dataset integration
β”‚   β”œβ”€β”€ distillation/               # Model distillation synthesis
β”‚   β”œβ”€β”€ tasks/                      # SDG execution tasks
β”‚   β”œβ”€β”€ evaluation/                 # Quality evaluation
β”‚   β”œβ”€β”€ models/                     # Unified LLM interface & postprocess
β”‚   β”œβ”€β”€ trainer/                    # Post-training (verl: SFT + GRPO)
β”‚   β”œβ”€β”€ translation/                # Multilingual support
β”‚   β”œβ”€β”€ webui/                      # React frontend
β”‚   β”œβ”€β”€ pipeline.py                 # Core SDG pipeline
β”‚   └── cli.py                      # CLI entry
β”‚
β”œβ”€β”€ verl/                           # Integrated verl framework
β”œβ”€β”€ docs/                           # Documentation
β”œβ”€β”€ pyproject.toml
└── README.md

πŸš€ Quick Start

1. Install DataArc SynData Toolkit

# 1. Clone the repository
git clone https://github.com/DataArcTech/DataArc-SynData-Toolkit.git
cd DataArc-SynData-Toolkit

# 2. Install uv if not already installed
pip install uv

# 3. Install dependencies 
uv sync

For hardware requirements and dependencies detail, please refer to dependency and installation guide.

2. Configuration

Please refer to the example configuration file and modify the configuration based on your requirements.

3. Synthesize Data

Run through CLI:

Create a .env file and specified the following fields.

API_KEY=sk-xxx   # your api key
BASE_URL=https://api.openai.com/v1  # Optional: your base url

And run following command.

uv run sdg generate configs/sdg.yaml  # or change to your .yaml file

πŸ”€ Training with Synthesized Data

DataArc SynData Toolkit integrates an end-to-end model training module powered by verl, enabling you to train models directly on your synthesized data. We support two training methods: SFT (Supervised Fine-Tuning) and GRPO (Group Relative Policy Optimization)

Quick Start with CLI

1. Prepare Your Configuration

Create a training configuration file based on the SFT Configuration Example or GRPO Configuration Example.

2. Run Training

# SFT training
uv run sdg train configs/sft.yaml

# GRPO training
uv run sdg train configs/grpo.yaml

For detailed configuration options, refer to the example YAML files.

πŸ–₯️ Run with GUI

Start FastAPI server with following command.

uv run fastapi dev sdgsystem/app/main.py

Open another terminal and build frontend with following command.

cd sdgsystem/webui

# Install dependencies
pnpm install

# Start development server
pnpm dev

If you have any doubt about regrading our Web UI, check our Web UI document.

πŸ“… Schedule for the Next Release

  • Multi-modal Dataset Synthesizing: Support synthesize image-included dataset.

🀝 Contributing

We welcome contributions!

About

Synthetic Data Generation Platform By DataArcTech

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published