GitHub - patrickfleith/datafast: Synthetic Text Dataset Generation for LLM projects

Generate text datasets for LLMs in minutes, not weeks.

Intended use cases

Get initial evaluation text data instead of starting your LLM project blind.
Increase diversity and coverage of an existing dataset by generating more data.
Experiment and test quickly LLM-based application PoCs.
Make your own datasets to fine-tune and evaluate language models for your application.

🌟 Star this repo if you find this useful!

Supported Dataset Types

✅ Text Classification Dataset
✅ Raw Text Generation Dataset
✅ Instruction Dataset (Ultrachat-like)
✅ Multiple Choice Question (MCQ) Dataset
✅ Preference Dataset
⏳ more to come...

Supported LLM Providers

Currently we support the following LLM providers:

✔︎ OpenAI
✔︎ Anthropic
✔︎ Google Gemini
✔︎ Ollama (local LLM server)
⏳ more to come...

Try it in Colab:

Installation

pip install datafast

Quick Start

1. Environment Setup

Make sure you have created a secrets.env file with your API keys. HF token is needed if you want to push the dataset to your HF hub. Other keys depends on which LLM providers you use.

GEMINI_API_KEY=XXXX
OPENAI_API_KEY=sk-XXXX
ANTHROPIC_API_KEY=sk-ant-XXXXX
HF_TOKEN=hf_XXXXX

2. Import Dependencies

from datafast.datasets import ClassificationDataset
from datafast.schema.config import ClassificationDatasetConfig, PromptExpansionConfig
from datafast.llms import OpenAIProvider, AnthropicProvider, GeminiProvider
from dotenv import load_dotenv

# Load environment variables
load_dotenv("secrets.env") # <--- your API keys

3. Configure Dataset

# Configure the dataset for text classification
config = ClassificationDatasetConfig(
    classes=[
        {"name": "positive", "description": "Text expressing positive emotions or approval"},
        {"name": "negative", "description": "Text expressing negative emotions or criticism"}
    ],
    num_samples_per_prompt=5,
    output_file="outdoor_activities_sentiments.jsonl",
    languages={
        "en": "English", 
        "fr": "French"
    },
    prompts=[
        (
            "Generate {num_samples} reviews in {language_name} which are diverse "
            "and representative of a '{label_name}' sentiment class. "
            "{label_description}. The reviews should be {{style}} and in the "
            "context of {{context}}."
        )
    ],
    expansion=PromptExpansionConfig(
        placeholders={
            "context": ["hike review", "speedboat tour review", "outdoor climbing experience"],
            "style": ["brief", "detailed"]
        },
        combinatorial=True
    )
)

4. Setup LLM Providers

# Create LLM providers
providers = [
    OpenAIProvider(model_id="gpt-4.1-mini-2025-04-14"),
    AnthropicProvider(model_id="claude-3-5-haiku-latest"),
    GeminiProvider(model_id="gemini-2.0-flash")
]

5. Generate and Push Dataset

# Generate dataset and local save
dataset = ClassificationDataset(config)
dataset.generate(providers)

# Optional: Push to Hugging Face Hub
dataset.push_to_hub(
    repo_id="YOUR_USERNAME/YOUR_DATASET_NAME",
    train_size=0.6
)

Next Steps

Check out our guides for different dataset types:

How to Generate a Text Classification Dataset
How to Create a Raw Text Dataset
How to Create a Preference Dataset
How to Create a Multiple Choice Question (MCQ) Dataset
How to Create an Instruction (Ultrachat) Dataset
Star and watch this github repo to get updates 🌟

Key Features

Easy-to-use and simple interface 🚀
Multi-lingual datasets generation 🌍
Multiple LLMs used to boost dataset diversity 🤖
Flexible prompt: use our default prompts or provide your own custom prompts 📝
Prompt expansion: Combinatorial variation of prompts to maximize diversity 🔄
Hugging Face Integration: Push generated datasets to the Hub 🤗

Warning

This library is in its early stages of development and might change significantly.

Contributing

Contributions are welcome! If you are new to the project, pick an issue labelled "good first issue".

How to proceed?

Pick an issue
Comment on the issue to let others know you are working on it
Fork the repository
Clone your fork locally
Create a new branch and give it a name like feature/my-awsome-feature
Make your changes
If you feel like it, write a few tests for your changes
To run the current tests, you can run pytest in the root directory. Don't pay attention to UserWarning: Pydantic serializer warnings. Note that for the LLMs test to run successfully you'll need to have:

openai API key
anthropic API key
gemini API key
an ollama server running (use ollama serve from command line)

Commit your change, push to your fork and create a pull request from your fork branch to datafast main branch.
Explain your pull request in a clear and concise way, I'll review it as soon as possible.

Roadmap:

RAG datasets
Personas
Seeds
More types of instructions datasets (not just ultrachat)
More LLM providers
Deduplication, filtering
Dataset cards generation

Creator

Made with ❤️ by Patrick Fleith.

This is volunteer work, star this repo to show your support! 🙏

Project Details

Status: Work in Progress (APIs may change)
License: Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 276 Commits
.github		.github
assets		assets
datafast		datafast
docs		docs
tests		tests
.gitignore		.gitignore
IDEAS.md		IDEAS.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intended use cases

Supported Dataset Types

Supported LLM Providers

Installation

Quick Start

1. Environment Setup

2. Import Dependencies

3. Configure Dataset

4. Setup LLM Providers

5. Generate and Push Dataset

Next Steps

Key Features

Contributing

Roadmap:

Creator

Project Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

patrickfleith/datafast

Folders and files

Latest commit

History

Repository files navigation

Intended use cases

Supported Dataset Types

Supported LLM Providers

Installation

Quick Start

1. Environment Setup

2. Import Dependencies

3. Configure Dataset

4. Setup LLM Providers

5. Generate and Push Dataset

Next Steps

Key Features

Contributing

Roadmap:

Creator

Project Details

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages