Skip to content

Commit dc523e5

Browse files
authored
Merge pull request #700 from bbrowning/point-to-sdg_hub
Point users and contributors to sdg_hub
2 parents e10cad8 + 787e00f commit dc523e5

File tree

3 files changed

+1
-256
lines changed

3 files changed

+1
-256
lines changed

.github/workflows/docs.yml

Lines changed: 0 additions & 42 deletions
This file was deleted.

.github/workflows/spellcheck.yml

Lines changed: 0 additions & 40 deletions
This file was deleted.

README.md

Lines changed: 1 addition & 174 deletions
Original file line numberDiff line numberDiff line change
@@ -1,176 +1,3 @@
11
# Synthetic Data Generation (SDG)
22

3-
![Lint](https://github.com/instructlab/sdg/actions/workflows/lint.yml/badge.svg?branch=main)
4-
![Build](https://github.com/instructlab/sdg/actions/workflows/pypi.yaml/badge.svg?branch=main)
5-
![Release](https://img.shields.io/github/v/release/instructlab/sdg)
6-
![License](https://img.shields.io/github/license/instructlab/sdg)
7-
8-
![`e2e-nvidia-l4-x1.yaml` on `main`](https://github.com/instructlab/sdg/actions/workflows/e2e-nvidia-l4-x1.yml/badge.svg?branch=main)
9-
![`e2e-nvidia-l40s-x4.yml` on `main`](https://github.com/instructlab/sdg/actions/workflows/e2e-nvidia-l40s-x4.yml/badge.svg?branch=main)
10-
11-
The SDG Framework is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a “no-code” manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.
12-
13-
## Core Design Principles
14-
15-
The framework is built around the following principles:
16-
17-
1. **Modular Design**: Highly composable blocks form the building units of the framework, allowing users to build workflows effortlessly.
18-
2. **No-Code Workflow Creation**: Specify workflows using simple YAML configuration files.
19-
3. **Scalability and Performance**: Optimized for handling large-scale workflows with millions of records.
20-
21-
---
22-
23-
## Framework Architecture
24-
25-
![overview](assets/imgs/overview.png)
26-
27-
### Blocks: The Fundamental Unit
28-
29-
At the heart of the framework is the **Block**. Each block is a self-contained computational unit that performs specific tasks, such as:
30-
31-
- Making LLM calls
32-
- Performing data transformations
33-
- Applying filters
34-
35-
Blocks are designed to be:
36-
37-
- **Modular**: Reusable across multiple pipelines.
38-
- **Composable**: Easily chained together to create workflows.
39-
40-
These blocks are implemented in the [src/instructlab/sdg/blocks](src/instructlab/sdg/blocks) directory.
41-
42-
### Pipelines: Higher-Level Abstraction
43-
44-
Blocks can be chained together to form a **Pipeline**. Pipelines enable:
45-
46-
- Linear or recursive chaining of blocks.
47-
- Execution of complex workflows by chaining multiple pipelines together.
48-
49-
There are four default pipelines shipped in SDG: `simple`, `full`, `eval` and `llama`. Each pipeline requires specific hardware specifications.
50-
51-
#### Simple Pipeline
52-
53-
The [simple pipeline](src/instructlab/sdg/pipelines/simple) is designed to be used with [quantized Merlinite](https://huggingface.co/instructlab/merlinite-7b-lab-GGUF) as the teacher model. It enables basic data generation results on low-end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs.
54-
55-
#### Full Pipeline
56-
57-
The [full pipeline](src/instructlab/sdg/pipelines/full) is designed to be used with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) as the the teacher model, but has also been successfully tested with smaller models such as [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and even some quantized versions of the two teacher models. This is the preferred data generation pipeline on higher end consumer grade hardware and all enterprise hardware.
58-
59-
#### Eval Pipeline
60-
61-
The [eval pipeline](src/instructlab/sdg/pipelines/eval) is used to generate [MMLU](https://en.wikipedia.org/wiki/MMLU) benchmark data that can be used to later evaluate a trained model on your knowledge dataset. It does not generate data for use during model training.
62-
63-
#### Llama Pipeline
64-
65-
The Llama pipeline is designed for use with the [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) as the teacher model. Currently, our support for Llama pipelines focuses on generating knowledge pipelines, aimed at producing high-quality, context-aware educational content on higher-end consumer hardware and enterprise systems.
66-
67-
_Note: Support for Llama-based skills pipelines is still under development and will be rolled out in future releases._
68-
69-
---
70-
71-
### YAML-Based Workflow: The Pipeline Configuration
72-
73-
The Pipeline YAML configuration file is central to defining data generation workflows in the SDG Framework. This configuration file describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.
74-
75-
Pipeline configuration must adhere to our [JSON schema](src/instructlab/sdg/pipelines/schema/v1.json) to be considered valid.
76-
77-
#### Key Features of Pipeline Configuration
78-
79-
1. **Modular Design**:
80-
- Pipelines are composed of blocks, which can be chained together.
81-
- Each block performs a specific task, such as generating, filtering, or transforming data.
82-
83-
2. **Reusability**:
84-
- Blocks and their configurations can be reused across different workflows.
85-
- YAML makes it easy to tweak or extend workflows without significant changes.
86-
87-
3. **Ease of Configuration**:
88-
- Users can specify block types, configurations, and data processing details in a simple and intuitive manner.
89-
90-
---
91-
92-
### Sample Pipeline Configuration
93-
94-
Here is an example of a Pipeline configuration:
95-
96-
```yaml
97-
version: "1.0"
98-
blocks:
99-
- name: gen_questions
100-
type: LLMBlock
101-
config:
102-
config_path: configs/skills/freeform_questions.yaml
103-
output_cols:
104-
- question
105-
batch_kwargs:
106-
num_samples: 30
107-
drop_duplicates:
108-
- question
109-
- name: filter_questions
110-
type: FilterByValueBlock
111-
config:
112-
filter_column: score
113-
filter_value: 1.0
114-
operation: eq
115-
convert_dtype: float
116-
drop_columns:
117-
- evaluation
118-
- score
119-
- num_samples
120-
- name: gen_responses
121-
type: LLMBlock
122-
block_config:
123-
config_path: configs/skills/freeform_responses.yaml
124-
output_cols:
125-
- response
126-
```
127-
128-
### Data Flow and Storage
129-
130-
- **Data Representation**: Data flow between blocks and pipelines is handled using **Hugging Face Datasets**, which are based on Arrow tables. This provides:
131-
- Native parallelization capabilities (e.g., maps, filters).
132-
- Support for efficient data transformations.
133-
134-
- **Data Checkpoints**: Intermediate caches of generated data. Checkpoints allow users to:
135-
- Resume workflows from the last successful state if interrupted.
136-
- Improve reliability for long-running workflows.
137-
138-
---
139-
140-
## Installing the SDG library
141-
142-
Clone the library and navigate to the repo:
143-
144-
```bash
145-
git clone https://github.com/instructlab/sdg
146-
cd sdg
147-
```
148-
149-
Install the library:
150-
151-
```bash
152-
pip install .
153-
```
154-
155-
### Using the library
156-
157-
You can import SDG into your Python files with the following items:
158-
159-
```python
160-
from instructlab.sdg.generate_data import generate_data
161-
from instructlab.sdg.utils import GenerateException
162-
```
163-
164-
## Repository structure
165-
166-
```bash
167-
|-- src/instructlab/ (1)
168-
|-- docs/ (2)
169-
|-- scripts/ (3)
170-
|-- tests/ (4)
171-
```
172-
173-
1. Contains the SDG code that interacts with InstructLab.
174-
2. Contains documentation on various SDG methodologies.
175-
3. Contains some utility scripts, but not part of any supported API.
176-
4. Contains all the tests for the SDG repository.
3+
Future contributions to InstructLab's Synthetic Data Generation capabilities should be redirected to the [sdg_hub](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub) repository. Please see the [community announcement](https://github.com/instructlab/#community-announcement-sept-2-2025) for more details.

0 commit comments

Comments
 (0)