A modular, scalable, and efficient solution for creating synthetic data generation flows in a "low-code" manner.
Important Links: Documentation, Examples, Video Tutorial & GitHub.
SDG Hub is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful flows for generating data and processing tasks. Define complex workflows using nothing but YAML configuration files.
-
Low-Code Flow Creation: Build sophisticated data generation pipelines using simple YAML configuration files without writing any code.
-
Modular Block System: Compose workflows from reusable, self-contained blocks that handle LLM calls, data transformations, and filtering.
-
LLM-Agnostic: Works with any language model through configurable prompt templates and generation parameters.
-
Prompt Engineering Friendly: Tune LLM behavior by editing declarative YAML prompts.
pip install sdg-hubpip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.gitExplore the full documentation for detailed guides:
- Architecture Guide - Core concepts and design principles
- Available Blocks - Complete reference of all blocks
- Prompt Configuration - How to configure LLM prompts
You can invoke any built-in flow using run_flow:
from sdg_hub.flow_runner import run_flow
run_flow(
ds_path="path/to/dataset.json",
save_path="path/to/output.json",
endpoint="https://api.openai.com/v1",
flow_path="path/to/flow.yaml",
checkpoint_dir="path/to/checkpoints",
batch_size=8,
num_workers=32,
save_freq=2,
)You can start with any of these YAML flows out of the box:
| Flow Name | Description |
|---|---|
flows/generation/knowledge/synth_knowledge.yaml |
Produces document-grounded questions and answers for factual memorization |
flows/generation/knowledge/synth_knowledge1.5.yaml |
Improved version that builds intermediate representations for better recall |
| Flow Name | Description |
|---|---|
flows/generation/skills/synth_skills.yaml |
Freeform skills QA generation (eg: "Create a new github issue to add type hints") |
flows/generation/skills/synth_grounded_skills.yaml |
Domain-specific skill generation (eg: "From the given conversation create a table for feature requests") |
flows/generation/skills/improve_responses.yaml |
Uses planning and critique-based refinement to improve generated answers |
All these can be found here: flows
For a comprehensive walkthrough of sdg_hub:
We welcome contributions from the community! Whether it's bug reports, feature requests, documentation improvements, or code contributions, please check out our contribution guidelines.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Built with ❤️ by the Red Hat AI Innovation Team
