Skip to content

twofacednine380/sdg_hub

 
 

Repository files navigation

SDG Hub: Synthetic Data Generation Toolkit

Build Release License Tests codecov

A modular, scalable, and efficient solution for creating synthetic data generation flows in a "low-code" manner.

SDG Hub is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful flows for generating data and processing tasks. Define complex workflows using nothing but YAML configuration files.


✨ Key Features

  • Low-Code Flow Creation: Build sophisticated data generation pipelines using simple YAML configuration files without writing any code.

  • Modular Block System: Compose workflows from reusable, self-contained blocks that handle LLM calls, data transformations, and filtering.

  • LLM-Agnostic: Works with any language model through configurable prompt templates and generation parameters.

  • Prompt Engineering Friendly: Tune LLM behavior by editing declarative YAML prompts.

🚀 Installation

Stable Release (Recommended)

pip install sdg-hub

Development Version

pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git

📚 Documentation

Explore the full documentation for detailed guides:

🏁 Quick Start

Generate with an Existing Flow

You can invoke any built-in flow using run_flow:

from sdg_hub.flow_runner import run_flow

run_flow(
    ds_path="path/to/dataset.json",
    save_path="path/to/output.json",
    endpoint="https://api.openai.com/v1",
    flow_path="path/to/flow.yaml",
    checkpoint_dir="path/to/checkpoints",
    batch_size=8,
    num_workers=32,
    save_freq=2,
)

📂 Available Built-in Flows

You can start with any of these YAML flows out of the box:

🔎 Knowledge Flows

Flow Name Description
flows/generation/knowledge/synth_knowledge.yaml Produces document-grounded questions and answers for factual memorization
flows/generation/knowledge/synth_knowledge1.5.yaml Improved version that builds intermediate representations for better recall

🧠 Skills Flows

Flow Name Description
flows/generation/skills/synth_skills.yaml Freeform skills QA generation (eg: "Create a new github issue to add type hints")
flows/generation/skills/synth_grounded_skills.yaml Domain-specific skill generation (eg: "From the given conversation create a table for feature requests")
flows/generation/skills/improve_responses.yaml Uses planning and critique-based refinement to improve generated answers

All these can be found here: flows

📺 Video Tutorial

For a comprehensive walkthrough of sdg_hub:

SDG Hub Tutorial

🤝 Contributing

We welcome contributions from the community! Whether it's bug reports, feature requests, documentation improvements, or code contributions, please check out our contribution guidelines.

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


Built with ❤️ by the Red Hat AI Innovation Team

About

Synthetic Data Generation Toolkit for LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 86.8%
  • JavaScript 7.9%
  • HTML 2.5%
  • CSS 1.3%
  • Other 1.5%