This document explains the separate BIRD-SQL-DEV evaluation system for text-to-SQL generation using SQLite databases with golden SQL comparison.
The BIRD-SQL-DEV evaluation system is designed specifically for traditional text-to-SQL evaluation tasks:
- Dataset: Local JSON file (
external/BIRD-SQL-DEV/dev.json) - Database: SQLite databases (not PostgreSQL)
- Evaluation: Both execution-based AND golden SQL comparison
- Focus: Traditional text-to-SQL generation with accuracy metrics
- Golden SQL: Available for exact match and normalized match comparison
- Execution Equivalence: Compares actual query results
| Feature | BIRD-CRITIC | BIRD-SQL-DEV |
|---|---|---|
| Purpose | SQL debugging/fixing | Text-to-SQL generation |
| Database | PostgreSQL with bird_ prefix |
SQLite files |
| Golden SQL | ❌ No | ✅ Yes |
| Evaluation | Execution only | Execution + SQL comparison + Result comparison |
| Use Case | Fixing broken SQL | Generating SQL from natural language |
Place the BIRD-SQL-DEV dataset at:
external/BIRD-SQL-DEV/dev.json
Place SQLite database files in:
external/BIRD-SQL-DEV/dev_databases/
Database files should be named: {db_id}.sqlite, {db_id}.db, or {db_id}.sqlite3
Example structure:
external/BIRD-SQL-DEV/
├── dev.json
└── dev_databases/
├── california_schools.sqlite
├── college_2.sqlite
├── concert_singer.sqlite
└── ...
# Show help
poetry run sens bird-sql-dev --help
# Run evaluation on all tasks
poetry run sens bird-sql-dev run \
--dataset external/BIRD-SQL-DEV/dev.json \
--output bird_sql_dev_results.json
# Run limited evaluation for testing
poetry run sens bird-sql-dev run \
--dataset external/BIRD-SQL-DEV/dev.json \
--limit 10 \
--output test_results.json
# Enable debug mode
poetry run sens bird-sql-dev run \
--dataset external/BIRD-SQL-DEV/dev.json \
--limit 5 \
--debug \
--output debug_results.json
# Custom database directory
poetry run sens bird-sql-dev run \
--dataset external/BIRD-SQL-DEV/dev.json \
--database-dir /path/to/sqlite/files \
--limit 5# Analyze results
poetry run sens bird-sql-dev analyze bird_sql_dev_results.json
# Check setup status
poetry run sens bird-sql-dev status
poetry run sens bird-sql-dev status --database-dir /custom/path🐦 Starting BIRD-SQL-DEV Evaluation
========================================
📥 Loading tasks from external/BIRD-SQL-DEV/dev.json
📊 Loaded 1534 tasks, evaluating 10
⚠️ Limiting evaluation to first 10 tasks for testing
🔍 Checking database availability...
✅ Database check passed (5/5 checked)
🚀 Starting evaluation of 10 tasks...
Evaluating tasks [####################################] 100%
🎯 BIRD-SQL-DEV Evaluation Results
========================================
Total tasks: 10
Execution success rate: 70.0%
Exact match rate: 40.0%
Normalized match rate: 50.0%
Execution match rate: 60.0%
Average execution time: 234.5ms
📈 Performance by Difficulty:
Simple: 80.0% success, 60.0% exact match (5 tasks)
Moderate: 66.7% success, 33.3% exact match (3 tasks)
Challenging: 50.0% success, 0.0% exact match (2 tasks)
💾 Saving results to bird_sql_dev_results.json
✅ Results saved!
📊 Metrics saved to bird_sql_dev_results_metrics.json
🎉 BIRD-SQL-DEV evaluation completed!POST /api/evaluation/bird-sql-dev/start
Content-Type: application/json
{
"dataset_path": "external/BIRD-SQL-DEV/dev.json",
"database_dir": "external/BIRD-SQL-DEV/dev_databases",
"limit": 10,
"debug_mode": true,
"max_iterations": 25
}Response:
{
"evaluation_id": "abc123...",
"status": "running",
"total_tasks": 10,
"completed_tasks": 0,
"successful_tasks": 0,
"failed_tasks": 0,
"start_time": "2024-01-15T10:30:00Z"
}GET /api/evaluation/bird-sql-dev/stream/{evaluation_id}
Accept: text/event-streamReal-time events:
evaluation_start: Evaluation beginstask_start: New task startsthinking: Agent reasoning stepsaction: Agent actionsobservation: Action resultssql_generated: SQL query generatedtask_complete: Task finished with resultsmetrics_update: Updated metricsevaluation_complete: All tasks finished
GET /api/evaluation/bird-sql-dev/status/{evaluation_id}- Execution Success Rate: Percentage of tasks where generated SQL executes without errors
- Exact Match Rate: Percentage of tasks where generated SQL exactly matches golden SQL
- Normalized Match Rate: Percentage of tasks where generated SQL matches after normalization
- Execution Match Rate: Percentage of tasks where generated and golden SQL produce identical results
- Performance by Difficulty: Success rates broken down by task difficulty
- Database Performance: Success rates by database
- Error Analysis: Breakdown of error types
- Execution Time: Average query execution time
{
"total_tasks": 100,
"execution_success_rate": 0.75,
"exact_match_rate": 0.45,
"normalized_match_rate": 0.52,
"execution_match_rate": 0.68,
"avg_execution_time_ms": 187.3,
"difficulty_breakdown": {
"simple": {
"total": 40,
"success": 32,
"exact_match": 25,
"success_rate": 0.80,
"exact_match_rate": 0.625
},
"moderate": {
"total": 35,
"success": 24,
"exact_match": 12,
"success_rate": 0.686,
"exact_match_rate": 0.343
},
"challenging": {
"total": 25,
"success": 19,
"exact_match": 8,
"success_rate": 0.76,
"exact_match_rate": 0.32
}
}
}{
"question_id": "123",
"db_id": "california_schools",
"question": "What is the highest eligible free rate for K-12 students?",
"generated_sql": "SELECT MAX(free_rate) FROM schools WHERE type = 'K-12'",
"golden_sql": "SELECT MAX(eligible_free_rate) FROM frpm WHERE grade_level = 'K-12'",
"execution_success": true,
"execution_error": null,
"result_count": 1,
"execution_time_ms": 45.2,
"exact_match": false,
"normalized_match": false,
"execution_match": false,
"difficulty": "simple",
"evidence": "Free rate refers to percentage of students eligible for free meals"
}-
Database File Not Found
Database file not found: /path/to/california_schools.sqliteSolution: Ensure SQLite files exist with correct naming convention
-
Dataset File Missing
Dataset file not found: external/BIRD-SQL-DEV/dev.jsonSolution: Download and place the BIRD-SQL-DEV dataset correctly
-
SQLite Connection Error
no such table: frpmSolution: Verify database schema matches expected tables
-
SQL Generation Failure
SQL generation failed: Unable to map user terms to database valuesSolution: This indicates the agent couldn't generate valid SQL
Enable debug mode for detailed logging:
poetry run sens bird-sql-dev run \
--dataset external/BIRD-SQL-DEV/dev.json \
--limit 1 \
--debugDebug output includes:
- Detailed task information
- Agent thinking steps
- SQL generation process
- Execution details
- Error stack traces
The BIRD-SQL-DEV evaluation system can be integrated with the web interface:
- Dataset Selection: Separate tab for BIRD-SQL-DEV evaluation
- Real-time Streaming: Shows agent thinking steps like BIRD-CRITIC
- SQL Comparison View: Side-by-side view of generated vs golden SQL
- Metrics Dashboard: BIRD-SQL-DEV specific metrics display
- Difficulty Analysis: Performance breakdown by difficulty level
# Run simple structure test
poetry run python test_bird_sql_dev_simple.py
# Test with actual dataset (if available)
poetry run sens bird-sql-dev run \
--dataset external/BIRD-SQL-DEV/dev.json \
--limit 1 \
--debug- Custom SQL Comparison: Extend
_compare_sqlmethod - Additional Metrics: Update
calculate_metricsmethod - New Database Support: Modify
_get_database_pathfor other formats - Enhanced Streaming: Add new event types to
_transform_agent_event_for_bird_sql_dev
- SQLite Performance: Generally faster than PostgreSQL for read-only operations
- Parallel Evaluation: Tasks are evaluated sequentially (consider parallelization for production)
- Memory Usage: Results are stored in memory during evaluation
- File I/O: SQLite files are accessed for each task (consider connection pooling)
- Semantic SQL Comparison: Beyond text matching using AST comparison
- Execution Plan Analysis: Compare query execution plans
- Schema-aware Evaluation: Consider database schema in evaluation
- Batch Processing: Parallel task evaluation
- Result Caching: Cache evaluation results for analysis