Transform raw data into structured training datasets using LLM APIs. Spec-driven, DataFrame-centric, no hallucinations.
A pipeline that takes most data formats (text, JSON, parquet) and transforms it into training datasets through declarative YAML configs. Instead of writing custom code for each transformation, you write a config file (or just tell claude code what you want!) that specifies what to extract and how to transform it.
Key idea: You provide raw data + extraction specs → LLM extracts faithfully → structured training data.
- Python 3.8+
- Vertex AI API access (currently required, but you can modify
io/upload_api_batch.pyto support OpenAI, Anthropic, or other APIs)
# Install dependencies
pip install -r requirements.txt
# Copy example config
cp example_qa_extraction.yaml my_config.yaml
# Edit my_config.yaml to point to your data and Vertex project
# Then run:
python workflows/simple_unified_processor.py --config my_config.yamlLet's walk through example_qa_extraction.yaml to understand the system:
# example_qa_extraction.yaml
goal: Extract Q&A pairs from text documents and augment with reasoning traces
teacher_model: gemini-2.5-flash
api_type: vertex
vertex_config:
project_id: your-project-id # Your GCP project
location: us-central1
bucket_name: your-bucket-name
input_data_metadata:
path: data/your_data_directory/ # Point to your text files
chunk_size: 100000 # Chunk large files
tokens_per_part: 100000000 # API batch size
data_gen_procedures:
round_1:
prompt_name: extract_qa_pairs
extraction_config: extraction_configs/list_extraction_example.yaml
has_branching: true # 1:N transformation (one chunk → many Q&A pairs)
round_2:
prompt_name: add_thinking_traces
extraction_config: extraction_configs/add_thinking_traces.yaml
has_branching: false # 1:1 transformation (one Q&A → one Q&A+thinking)
validation:
enabled: true
filter_1: is_valid == True only if the value of column thinking is not null
output:
output_format: format_templates/chat_format_qa_with_thinking.pyRound 1: Extract Q&A pairs
- System loads your text files, chunks them
- Applies
prompts/extract_qa_pairs.pytemplate to each chunk - Sends to LLM API: "Extract Q&A pairs from this text"
- Uses
extraction_configs/list_extraction_example.yamlto parse responses - Output: Multiple Q&A pairs per chunk (that's why
has_branching: true)
Round 2: Add thinking traces
- Takes Round 1 output (Q&A pairs)
- Applies
prompts/add_thinking_traces.pytemplate - Sends to LLM API: "Given this Q&A, generate reasoning that leads to the answer"
- Uses
extraction_configs/add_thinking_traces.yamlto parse responses - Filters out items where thinking is null
- Output: Q&A pairs with thinking traces
This is what you'll customize most often. Extraction configs define how to parse LLM responses using regex.
Example 1: List extraction (extraction_configs/list_extraction_example.yaml)
extraction_mode: list # Extract multiple items from one response
item_pattern: '<ITEM>(.*?)</ITEM>' # Find each item block
patterns:
question: '<QUESTION>(.*?)</QUESTION>' # Extract question
answer: '<ANSWER>(.*?)</ANSWER>' # Extract answer
defaults:
question: null
answer: nullThis tells the system: "The LLM will return multiple <ITEM> blocks, and within each block, extract the question and answer."
Example 2: Single extraction (extraction_configs/add_thinking_traces.yaml)
patterns:
thinking: '<THINKING>(.*?)</THINKING>' # Extract just the thinking
defaults:
thinking: nullThis tells the system: "Extract one thinking trace per response."
The LLM response (controlled by your prompt) must match these patterns:
<ITEM>
<QUESTION>What is X?</QUESTION>
<ANSWER>X is...</ANSWER>
</ITEM>
<ITEM>
<QUESTION>How does Y work?</QUESTION>
<ANSWER>Y works by...</ANSWER>
</ITEM># Generate API requests
python workflows/simple_unified_processor.py --config example_qa_extraction.yaml
# This creates: outputs/round_1_requests.jsonl
# Upload to Vertex AI
python io/upload_api_batch.py outputs/round_1_requests.jsonl --config example_qa_extraction.yaml
# Wait for batch to complete, then download
python io/download_extract_results.py BATCH_JOB_NAME \
--config example_qa_extraction.yaml \
--extraction-config extraction_configs/list_extraction_example.yaml
# Repeat for round 2...Final output in chat format ready for training:
{
"messages": [
{
"role": "system",
"content": "You are a helpful expert assistant.",
"thinking": null
},
{
"role": "user",
"content": "What is X?",
"thinking": null
},
{
"role": "assistant",
"content": "X is...",
"thinking": "To answer this, I need to consider... Therefore X is..."
}
]
}You declare what you want in YAML. The system handles all the DataFrame operations, API batching, result merging automatically.
This is where you'll spend most time. Your extraction config must match what the LLM outputs (which is controlled by your prompt template). If the LLM returns <THINKING> tags, your extraction config needs a thinking pattern that matches <THINKING>(.*?)</THINKING>.
-
has_branching: true → 1:N transformation (one input creates multiple outputs)
- Example: One text chunk → 10 Q&A pairs
- IDs change:
chunk_001→chunk_001_000,chunk_001_001, etc.
-
has_branching: false → 1:1 transformation (one input creates one output)
- Example: One Q&A pair → same Q&A pair + thinking
- IDs stay same:
chunk_001_000→chunk_001_000
Every operation uses pandas DataFrames. At any point you can inspect:
df = load_source_data_as_df('data/')
print(df.head()) # See your data
print(df['text'].iloc[0]) # Inspect first chunkTo create your own pipeline:
-
Write a prompt template (Python file in
prompts/)- Uses f-strings to reference DataFrame columns
- Tells LLM what to extract/generate
- Defines output format (XML tags)
-
Write an extraction config (YAML file in
extraction_configs/)- Regex patterns matching your prompt's output format
- Defines what fields to extract
- Single vs list extraction mode
-
Create a config file (YAML)
- Specify rounds, prompts, extraction configs
- Point to your data
- Set validation rules
-
Run the pipeline
Currently uses Vertex AI. To use other APIs:
- Modify
io/upload_api_batch.pyto call your API - Modify
io/download_extract_results.pyto download from your API - Update
api_typein your config
The rest of the system (DataFrame operations, extraction, formatting) works with any API.
example_qa_extraction.yaml: Two-round pipeline (extract Q&A → add thinking)example_add_thinking.yaml: Single-round (augment existing Q&A data)prompts/: Example prompt templatesextraction_configs/: Example extraction patternsformat_templates/: Example output formats
- No hallucinations: Extract from existing data only
- Reproducible: Same config = same output
- Flexible: Works with any data format or domain
- Debuggable: Inspect DataFrames at every step
- Scalable: Automatic API batching for large datasets
CLAUDE.md: Complete technical documentation- Example configs demonstrate all features
- All core functions in
processing/dataframe_core.py
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details.