InstructLab Skills Synthetic Data Generation

⚠️ Deprecated Example Notice
These example files are no longer actively maintained and may be outdated.

👉 For the latest and fully supported examples, please visit the official repository:
Red Hat AI Innovation Team – SDG Hub Skills Tuning Examples

InstructLab Skills Synthetic Data Generation

The provided notebooks demonstrate how to customize language models by generating training data for specific skills, following the methodology outlined in the LAB (Large-scale Alignment for Chatbots) framework [paper link].

Customizing Model Behavior

The LAB framework enables us to shape how a model responds to various tasks by training it on carefully crafted examples. Want your model to write emails in your company's tone? Need it to follow specific formatting guidelines? This customization is achieved through what the paper defines as compositional skills.

Compositional skills are tasks that combine different abilities to handle complex queries. For example, if you want your model to write company emails about quarterly performance, it needs to:

Understand financial concepts
Perform basic arithmetic
Write in your preferred communication style
Follow your organization's email format

Demo Overview

The example notebooks will show you how to:

Set up a teacher model for generating training data
Create examples that reflect your preferred style and approach
Generate Synthetic Data
Validate that the generated data matches your requirements

The end goal is to create training data that will help align the model with your specific needs, whether that's matching your company's communication style, following particular protocols, or handling specialized tasks in your preferred way.

Instructlab Grounded Skills Generation Pipeline

InstructLab uses a multi-step process of generation and evaluation to generate synthetic data. For grounded skills it looks like this:

Context Generation (gen_contexts)
Generates diverse, relevant contexts for the skill
Produces 10 unique contexts per run
Question Generation & Validation
gen_grounded_questions: Creates 3 questions per context
eval_grounded_questions: Evaluates question quality
filter_grounded_questions: Keeps only perfect scores (1.0)
Response Generation & Quality Control
gen_grounded_responses: Generates appropriate responses
evaluate_grounded_qa_pair: Scores Q&A pair quality
filter_grounded_qa_pair: Retains high-quality pairs (score ≥ 2.0)
Final Processing
combine_question_and_context: Merges context with questions for complete examples

Providing the Seed Data

When teaching a language model a new skill, carefully crafted seed examples are the foundation. Seed examples show the model what good behavior looks like by pairing inputs with ideal outputs, allowing the model to learn patterns, structure, reasoning, and formatting that generalize beyond the examples themselves.

A strong seed example, regardless of domain, should:

✅ Clearly define the task context and expected behavior

✅ Provide a realistic, natural input that mimics what users or systems would actually produce

✅ Include a high-quality output that fully satisfies the task requirements—accurate, complete, and formatted correctly

✅ Minimize ambiguity: avoid examples where multiple interpretations are possible without explanation

✅ Reflect diverse edge cases: cover a variety of structures, phrasings, or difficulty levels to help the model generalize

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
blocks		blocks
flows		flows
prompts		prompts
seed_data		seed_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
byop_annotation.ipynb		byop_annotation.ipynb
ruff.toml		ruff.toml
structured_summary.ipynb		structured_summary.ipynb
table_manipulation.ipynb		table_manipulation.ipynb
test_json_format.py		test_json_format.py
unstructured_to_structured.ipynb		unstructured_to_structured.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstructLab Skills Synthetic Data Generation

Customizing Model Behavior

Demo Overview

Instructlab Grounded Skills Generation Pipeline

Providing the Seed Data

About

Uh oh!

Releases

Packages

Languages

License

Red-Hat-AI-Innovation-Team/IL-Skills-Demo

Folders and files

Latest commit

History

Repository files navigation

InstructLab Skills Synthetic Data Generation

Customizing Model Behavior

Demo Overview

Instructlab Grounded Skills Generation Pipeline

Providing the Seed Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages