Skip to content

ltjed/curaytah

Repository files navigation

curator

Transform raw data into structured training datasets using LLM APIs. Spec-driven, DataFrame-centric, no hallucinations.

What is this?

A pipeline that takes most data formats (text, JSON, parquet) and transforms it into training datasets through declarative YAML configs. Instead of writing custom code for each transformation, you write a config file (or just tell claude code what you want!) that specifies what to extract and how to transform it.

Key idea: You provide raw data + extraction specs → LLM extracts faithfully → structured training data.

Requirements

  • Python 3.8+
  • Vertex AI API access (currently required, but you can modify io/upload_api_batch.py to support OpenAI, Anthropic, or other APIs)

Quick Start

# Install dependencies
pip install -r requirements.txt

# Copy example config
cp example_qa_extraction.yaml my_config.yaml

# Edit my_config.yaml to point to your data and Vertex project
# Then run:
python workflows/simple_unified_processor.py --config my_config.yaml

How It Works: A Complete Example

Let's walk through example_qa_extraction.yaml to understand the system:

1. The Config File

# example_qa_extraction.yaml
goal: Extract Q&A pairs from text documents and augment with reasoning traces
teacher_model: gemini-2.5-flash
api_type: vertex

vertex_config:
  project_id: your-project-id      # Your GCP project
  location: us-central1
  bucket_name: your-bucket-name

input_data_metadata:
  path: data/your_data_directory/  # Point to your text files
  chunk_size: 100000               # Chunk large files
  tokens_per_part: 100000000       # API batch size

data_gen_procedures:
  round_1:
    prompt_name: extract_qa_pairs
    extraction_config: extraction_configs/list_extraction_example.yaml
    has_branching: true  # 1:N transformation (one chunk → many Q&A pairs)

  round_2:
    prompt_name: add_thinking_traces
    extraction_config: extraction_configs/add_thinking_traces.yaml
    has_branching: false  # 1:1 transformation (one Q&A → one Q&A+thinking)
    validation:
      enabled: true
      filter_1: is_valid == True only if the value of column thinking is not null

output:
  output_format: format_templates/chat_format_qa_with_thinking.py

2. What Happens in Each Round

Round 1: Extract Q&A pairs

  • System loads your text files, chunks them
  • Applies prompts/extract_qa_pairs.py template to each chunk
  • Sends to LLM API: "Extract Q&A pairs from this text"
  • Uses extraction_configs/list_extraction_example.yaml to parse responses
  • Output: Multiple Q&A pairs per chunk (that's why has_branching: true)

Round 2: Add thinking traces

  • Takes Round 1 output (Q&A pairs)
  • Applies prompts/add_thinking_traces.py template
  • Sends to LLM API: "Given this Q&A, generate reasoning that leads to the answer"
  • Uses extraction_configs/add_thinking_traces.yaml to parse responses
  • Filters out items where thinking is null
  • Output: Q&A pairs with thinking traces

3. The Critical Part: Extraction Configs

This is what you'll customize most often. Extraction configs define how to parse LLM responses using regex.

Example 1: List extraction (extraction_configs/list_extraction_example.yaml)

extraction_mode: list  # Extract multiple items from one response
item_pattern: '<ITEM>(.*?)</ITEM>'  # Find each item block

patterns:
  question: '<QUESTION>(.*?)</QUESTION>'  # Extract question
  answer: '<ANSWER>(.*?)</ANSWER>'        # Extract answer

defaults:
  question: null
  answer: null

This tells the system: "The LLM will return multiple <ITEM> blocks, and within each block, extract the question and answer."

Example 2: Single extraction (extraction_configs/add_thinking_traces.yaml)

patterns:
  thinking: '<THINKING>(.*?)</THINKING>'  # Extract just the thinking

defaults:
  thinking: null

This tells the system: "Extract one thinking trace per response."

The LLM response (controlled by your prompt) must match these patterns:

<ITEM>
<QUESTION>What is X?</QUESTION>
<ANSWER>X is...</ANSWER>
</ITEM>
<ITEM>
<QUESTION>How does Y work?</QUESTION>
<ANSWER>Y works by...</ANSWER>
</ITEM>

4. Running the Pipeline

# Generate API requests
python workflows/simple_unified_processor.py --config example_qa_extraction.yaml

# This creates: outputs/round_1_requests.jsonl

# Upload to Vertex AI
python io/upload_api_batch.py outputs/round_1_requests.jsonl --config example_qa_extraction.yaml

# Wait for batch to complete, then download
python io/download_extract_results.py BATCH_JOB_NAME \
  --config example_qa_extraction.yaml \
  --extraction-config extraction_configs/list_extraction_example.yaml

# Repeat for round 2...

5. What You Get

Final output in chat format ready for training:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful expert assistant.",
      "thinking": null
    },
    {
      "role": "user",
      "content": "What is X?",
      "thinking": null
    },
    {
      "role": "assistant",
      "content": "X is...",
      "thinking": "To answer this, I need to consider... Therefore X is..."
    }
  ]
}

Key Concepts

Spec-Driven = Declarative

You declare what you want in YAML. The system handles all the DataFrame operations, API batching, result merging automatically.

Extraction Configs = The Key

This is where you'll spend most time. Your extraction config must match what the LLM outputs (which is controlled by your prompt template). If the LLM returns <THINKING> tags, your extraction config needs a thinking pattern that matches <THINKING>(.*?)</THINKING>.

Branching vs Non-Branching

  • has_branching: true → 1:N transformation (one input creates multiple outputs)

    • Example: One text chunk → 10 Q&A pairs
    • IDs change: chunk_001chunk_001_000, chunk_001_001, etc.
  • has_branching: false → 1:1 transformation (one input creates one output)

    • Example: One Q&A pair → same Q&A pair + thinking
    • IDs stay same: chunk_001_000chunk_001_000

DataFrame-Centric

Every operation uses pandas DataFrames. At any point you can inspect:

df = load_source_data_as_df('data/')
print(df.head())  # See your data
print(df['text'].iloc[0])  # Inspect first chunk

Customizing for Your Use Case

To create your own pipeline:

  1. Write a prompt template (Python file in prompts/)

    • Uses f-strings to reference DataFrame columns
    • Tells LLM what to extract/generate
    • Defines output format (XML tags)
  2. Write an extraction config (YAML file in extraction_configs/)

    • Regex patterns matching your prompt's output format
    • Defines what fields to extract
    • Single vs list extraction mode
  3. Create a config file (YAML)

    • Specify rounds, prompts, extraction configs
    • Point to your data
    • Set validation rules
  4. Run the pipeline

Using Other APIs (OpenAI, Anthropic, etc.)

Currently uses Vertex AI. To use other APIs:

  1. Modify io/upload_api_batch.py to call your API
  2. Modify io/download_extract_results.py to download from your API
  3. Update api_type in your config

The rest of the system (DataFrame operations, extraction, formatting) works with any API.

Examples Included

  • example_qa_extraction.yaml: Two-round pipeline (extract Q&A → add thinking)
  • example_add_thinking.yaml: Single-round (augment existing Q&A data)
  • prompts/: Example prompt templates
  • extraction_configs/: Example extraction patterns
  • format_templates/: Example output formats

Why This Approach?

  • No hallucinations: Extract from existing data only
  • Reproducible: Same config = same output
  • Flexible: Works with any data format or domain
  • Debuggable: Inspect DataFrames at every step
  • Scalable: Automatic API batching for large datasets

Documentation

  • CLAUDE.md: Complete technical documentation
  • Example configs demonstrate all features
  • All core functions in processing/dataframe_core.py

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

About

a spec-driven, dataframe-centered, data transformation agent that carry out the procedure you specify in natural language based on Claude Code

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages