Skip to content

context-machine-lab/agent-evaluator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent1 - Research & Data Science Pipeline

A clean, modular implementation using Claude Agent SDK for deep research, data science workflows, and benchmark evaluations.

Features

  • Deep Research Pipeline: Multi-phase research on any topic with web search
  • Data Science Workflows: Exploratory analysis, statistical analysis, and ML modeling
  • GAIA Benchmark Evaluation: Evaluate Claude agents on the GAIA dataset
  • Hydra Configuration: Clean configuration management with YAML files
  • Rich Console Output: Beautiful progress tracking and logging
  • Async Execution: Efficient concurrent task processing

Installation

cd agent1
uv venv
uv pip install -e .

Quick Start

Deep Research

Research a topic:

python examples/dr.py research.topic="Impact of AI on healthcare"

Different research depths:

python examples/dr.py research.topic="Climate change" research=quick
python examples/dr.py research.topic="Quantum computing" research=exhaustive

Save output to file:

python examples/dr.py research.topic="AI Ethics" research.output_file=report.md

Data Science

Analyze data:

python examples/ds.py data_science.task="Analyze sales trends" data_science.data_path=sales.csv

Build a model:

python examples/ds.py data_science.modeling.task="Predict customer churn" data_science.data_path=customers.csv

GAIA Benchmark Evaluation

Run GAIA evaluation on validation set:

python examples/run_gaia.py gaia.split=validation gaia.max_tasks=5

Full test set evaluation:

python examples/run_gaia.py gaia.split=test

Project Structure

agent1/
├── examples/
│   ├── dr.py               # Deep research CLI
│   ├── ds.py               # Data science CLI
│   └── run_gaia.py         # GAIA evaluation script
├── src/
│   ├── configs/
│   │   ├── deep_research.yaml  # Research configuration
│   │   ├── data_scientist.yaml # Data science configuration
│   │   └── gaia.yaml           # GAIA benchmark configuration
│   ├── claude.py           # Claude agent executor
│   ├── pipelines.py        # Pipeline implementations
│   ├── logger.py           # Rich console logger
│   └── gaia_utils.py       # GAIA dataset utilities
└── data/
    └── GAIA/              # GAIA dataset (add manually)

Configuration

All configurations use Hydra and are stored in src/configs/. Key options:

Model Configuration

  • model.name: Claude model to use (default: claude-sonnet-4-5-20250929)
  • model.temperature: Sampling temperature
  • model.max_tokens: Maximum tokens

Research Configuration

  • research.topic: Research topic (required)
  • research.depth: quick, standard, comprehensive, exhaustive
  • research.output_file: Optional output file path

GAIA Configuration

  • gaia.split: validation or test
  • gaia.max_tasks: Maximum tasks to evaluate
  • gaia.batch_size: Concurrent batch size
  • gaia.results_path: Output JSONL path

GAIA Benchmark

Setup

  1. Download GAIA dataset to data/GAIA/:

    • 2023_validation.json - Validation set with ground truth
    • 2023_test.json - Test set without ground truth
  2. Run evaluation:

python examples/run_gaia.py gaia.split=validation

Features

  • Smart Resume: Automatically skips completed tasks
  • Batch Processing: Concurrent execution with configurable batch size
  • Comprehensive Metrics: Accuracy calculation and detailed reports
  • Error Recovery: Graceful error handling with detailed logging
  • Result Persistence: JSONL format with metadata and costs

Output Format

Results saved in JSONL:

{
  "task_id": "test_001",
  "question": "What is 2 + 2?",
  "prediction": "4",
  "true_answer": "4",
  "tools_used": ["WebSearch"],
  "num_turns": 3,
  "cost_usd": 0.002,
  "duration_ms": 5432
}

Testing

Test GAIA setup:

python test_gaia_setup.py

API Components

AgentExecutor

Executes single agents with specified tools and configurations:

from src.claude import create_agent_executor

executor = create_agent_executor()
result = await executor.execute_agent(
    prompt="Research quantum computing",
    agent_type="research",
    allowed_tools=["WebSearch", "WebFetch"]
)

PipelineExecutor

Orchestrates multi-phase pipelines:

from src.claude import create_pipeline_executor

pipeline = create_pipeline_executor()
result = await pipeline.execute_pipeline(
    phases=[...],
    initial_context="Topic: AI"
)

Research & Data Science Pipelines

High-level interfaces for specific workflows:

from src.pipelines import DeepResearchPipeline, DataSciencePipeline

# Research
research = DeepResearchPipeline()
result = await research.research("AI ethics", depth="comprehensive")

# Data science
ds = DataSciencePipeline()
result = await ds.analyze_data(data_path="data.csv", analysis_type="exploratory")

Development

Adding New Pipelines

  1. Create configuration in src/configs/
  2. Extend BasePipeline in src/pipelines.py
  3. Add CLI script in examples/

Custom Agent Types

Modify allowed_tools in agent configurations:

  • Research: WebSearch, WebFetch, Read, Write
  • Analysis: Read, Write, Bash, Grep, Glob
  • Coding: Read, Write, Edit, Bash

Troubleshooting

  • Import Errors: Ensure dependencies installed with uv pip install -e .
  • API Errors: Check Claude API key is set
  • Dataset Not Found: Download GAIA dataset to data/GAIA/
  • Out of Memory: Reduce batch_size in configuration

License

MIT

About

A minimalism agent evaluation framework.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages