curator

Transform raw data into structured training datasets using LLM APIs. Spec-driven, DataFrame-centric, no hallucinations.

What is this?

A pipeline that takes most data formats (text, JSON, parquet) and transforms it into training datasets through declarative YAML configs. Instead of writing custom code for each transformation, you write a config file (or just tell claude code what you want!) that specifies what to extract and how to transform it.

Key idea: You provide raw data + extraction specs → LLM extracts faithfully → structured training data.

Requirements

Python 3.8+
Vertex AI API access (currently required, but you can modify io/upload_api_batch.py to support OpenAI, Anthropic, or other APIs)

Quick Start

# Install dependencies
pip install -r requirements.txt

# Copy example config
cp example_qa_extraction.yaml my_config.yaml

# Edit my_config.yaml to point to your data and Vertex project
# Then run:
python workflows/simple_unified_processor.py --config my_config.yaml

How It Works: A Complete Example

Let's walk through example_qa_extraction.yaml to understand the system:

1. The Config File

# example_qa_extraction.yaml
goal: Extract Q&A pairs from text documents and augment with reasoning traces
teacher_model: gemini-2.5-flash
api_type: vertex

vertex_config:
  project_id: your-project-id      # Your GCP project
  location: us-central1
  bucket_name: your-bucket-name

input_data_metadata:
  path: data/your_data_directory/  # Point to your text files
  chunk_size: 100000               # Chunk large files
  tokens_per_part: 100000000       # API batch size

data_gen_procedures:
  round_1:
    prompt_name: extract_qa_pairs
    extraction_config: extraction_configs/list_extraction_example.yaml
    has_branching: true  # 1:N transformation (one chunk → many Q&A pairs)

  round_2:
    prompt_name: add_thinking_traces
    extraction_config: extraction_configs/add_thinking_traces.yaml
    has_branching: false  # 1:1 transformation (one Q&A → one Q&A+thinking)
    validation:
      enabled: true
      filter_1: is_valid == True only if the value of column thinking is not null

output:
  output_format: format_templates/chat_format_qa_with_thinking.py

2. What Happens in Each Round

Round 1: Extract Q&A pairs

System loads your text files, chunks them
Applies prompts/extract_qa_pairs.py template to each chunk
Sends to LLM API: "Extract Q&A pairs from this text"
Uses extraction_configs/list_extraction_example.yaml to parse responses
Output: Multiple Q&A pairs per chunk (that's why has_branching: true)

Round 2: Add thinking traces

Takes Round 1 output (Q&A pairs)
Applies prompts/add_thinking_traces.py template
Sends to LLM API: "Given this Q&A, generate reasoning that leads to the answer"
Uses extraction_configs/add_thinking_traces.yaml to parse responses
Filters out items where thinking is null
Output: Q&A pairs with thinking traces

3. The Critical Part: Extraction Configs

This is what you'll customize most often. Extraction configs define how to parse LLM responses using regex.

Example 1: List extraction (extraction_configs/list_extraction_example.yaml)

extraction_mode: list  # Extract multiple items from one response
item_pattern: '<ITEM>(.*?)</ITEM>'  # Find each item block

patterns:
  question: '<QUESTION>(.*?)</QUESTION>'  # Extract question
  answer: '<ANSWER>(.*?)</ANSWER>'        # Extract answer

defaults:
  question: null
  answer: null

This tells the system: "The LLM will return multiple <ITEM> blocks, and within each block, extract the question and answer."

Example 2: Single extraction (extraction_configs/add_thinking_traces.yaml)

patterns:
  thinking: '<THINKING>(.*?)</THINKING>'  # Extract just the thinking

defaults:
  thinking: null

This tells the system: "Extract one thinking trace per response."

The LLM response (controlled by your prompt) must match these patterns:

<ITEM>
<QUESTION>What is X?</QUESTION>
<ANSWER>X is...</ANSWER>
</ITEM>
<ITEM>
<QUESTION>How does Y work?</QUESTION>
<ANSWER>Y works by...</ANSWER>
</ITEM>

4. Running the Pipeline

# Generate API requests
python workflows/simple_unified_processor.py --config example_qa_extraction.yaml

# This creates: outputs/round_1_requests.jsonl

# Upload to Vertex AI
python io/upload_api_batch.py outputs/round_1_requests.jsonl --config example_qa_extraction.yaml

# Wait for batch to complete, then download
python io/download_extract_results.py BATCH_JOB_NAME \
  --config example_qa_extraction.yaml \
  --extraction-config extraction_configs/list_extraction_example.yaml

# Repeat for round 2...

5. What You Get

Final output in chat format ready for training:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful expert assistant.",
      "thinking": null
    },
    {
      "role": "user",
      "content": "What is X?",
      "thinking": null
    },
    {
      "role": "assistant",
      "content": "X is...",
      "thinking": "To answer this, I need to consider... Therefore X is..."
    }
  ]
}

Key Concepts

Spec-Driven = Declarative

You declare what you want in YAML. The system handles all the DataFrame operations, API batching, result merging automatically.

Extraction Configs = The Key

This is where you'll spend most time. Your extraction config must match what the LLM outputs (which is controlled by your prompt template). If the LLM returns <THINKING> tags, your extraction config needs a thinking pattern that matches <THINKING>(.*?)</THINKING>.

Branching vs Non-Branching

has_branching: true → 1:N transformation (one input creates multiple outputs)
- Example: One text chunk → 10 Q&A pairs
- IDs change: chunk_001 → chunk_001_000, chunk_001_001, etc.
has_branching: false → 1:1 transformation (one input creates one output)
- Example: One Q&A pair → same Q&A pair + thinking
- IDs stay same: chunk_001_000 → chunk_001_000

DataFrame-Centric

Every operation uses pandas DataFrames. At any point you can inspect:

df = load_source_data_as_df('data/')
print(df.head())  # See your data
print(df['text'].iloc[0])  # Inspect first chunk

Customizing for Your Use Case

To create your own pipeline:

Write a prompt template (Python file in prompts/)
- Uses f-strings to reference DataFrame columns
- Tells LLM what to extract/generate
- Defines output format (XML tags)
Write an extraction config (YAML file in extraction_configs/)
- Regex patterns matching your prompt's output format
- Defines what fields to extract
- Single vs list extraction mode
Create a config file (YAML)
- Specify rounds, prompts, extraction configs
- Point to your data
- Set validation rules
Run the pipeline

Using Other APIs (OpenAI, Anthropic, etc.)

Currently uses Vertex AI. To use other APIs:

Modify io/upload_api_batch.py to call your API
Modify io/download_extract_results.py to download from your API
Update api_type in your config

The rest of the system (DataFrame operations, extraction, formatting) works with any API.

Examples Included

example_qa_extraction.yaml: Two-round pipeline (extract Q&A → add thinking)
example_add_thinking.yaml: Single-round (augment existing Q&A data)
prompts/: Example prompt templates
extraction_configs/: Example extraction patterns
format_templates/: Example output formats

Why This Approach?

No hallucinations: Extract from existing data only
Reproducible: Same config = same output
Flexible: Works with any data format or domain
Debuggable: Inspect DataFrames at every step
Scalable: Automatic API batching for large datasets

Documentation

CLAUDE.md: Complete technical documentation
Example configs demonstrate all features
All core functions in processing/dataframe_core.py

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

curator

What is this?

Requirements

Quick Start

How It Works: A Complete Example

1. The Config File

2. What Happens in Each Round

3. The Critical Part: Extraction Configs

4. Running the Pipeline

5. What You Get

Key Concepts

Spec-Driven = Declarative

Extraction Configs = The Key

Branching vs Non-Branching

DataFrame-Centric

Customizing for Your Use Case

Using Other APIs (OpenAI, Anthropic, etc.)

Examples Included

Why This Approach?

Documentation

Contributing

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

curator

What is this?

Requirements

Quick Start

How It Works: A Complete Example

1. The Config File

2. What Happens in Each Round

3. The Critical Part: Extraction Configs

4. Running the Pipeline

5. What You Get

Key Concepts

Spec-Driven = Declarative

Extraction Configs = The Key

Branching vs Non-Branching

DataFrame-Centric

Customizing for Your Use Case

Using Other APIs (OpenAI, Anthropic, etc.)

Examples Included

Why This Approach?

Documentation

Contributing

License