This module provides a pipeline for evaluating Text-to-SQL model outputs on the LLMSQL benchmark. It checks your model’s SQL predictions against the gold-standard queries and database, logs mismatches, and generates detailed evaluation reports.
You can now use it directly via the evaluate() function.
pip install llmsqlfrom llmsql import evaluate
# Evaluate outputs from a JSONL file
report = evaluate("path_to_your_outputs.jsonl")
print(report)# Or evaluate from a list of prediction dicts
predictions = [
{"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"},
{"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"},
]
report = evaluate(predictions)
print(report)evaluate(
outputs,
*,
workdir_path: str | None = "llmsql_workdir",
questions_path: str | None = None,
db_path: str | None = None,
save_report: str | None = None,
show_mismatches: bool = True,
max_mismatches: int = 5,
)| Argument | Description |
|---|---|
outputs |
Required. Either a path to a JSONL file or a list of dicts with predictions. |
workdir_path |
Directory for automatic download of benchmark files (ignored if both questions_path and db_path are provided). Default: "llmsql_workdir". |
questions_path |
Optional path to benchmark questions JSONL file. |
db_path |
Optional path to SQLite DB with evaluation tables. |
save_report |
Optional path to save detailed JSON report. Defaults to evaluation_results_{uuid}.json. |
show_mismatches |
Print mismatches while evaluating. Default: True. |
max_mismatches |
Maximum number of mismatches to print. Default: 5. |
Your model predictions must be in JSONL format (one JSON object per line):
{"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"}
{"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"}
{"question_id": "3", "predicted_sql": "SELECT * FROM Table WHERE active=1"}question_idmust match IDs inquestions.jsonl.predicted_sqlshould contain your model’s SQL output (extra text is allowed; SQL is extracted automatically).
The evaluation returns a dictionary containing:
total– Total queries evaluatedmatches– Queries where predicted SQL results match gold resultspred_none– Queries where the model returnedNULLor no resultgold_none– Queries where gold reference isNULLor no resultsql_errors– Invalid SQL or execution errorsaccuracy– Overall exact match accuracymismatches– List of mismatched queries with detailstimestamp– Evaluation timestampinput_mode– Whether results were provided as JSONL path or dict list
Example console output:
Evaluating ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100
Total: 100 | Matches: 82 | Pred None: 5 | Gold None: 3 | SQL Errors: 2
Report Saving
- By default, the report is saved as
evaluation_results_{uuid}.jsonin the current directory. - Includes timestamp and input mode (JSONL path or dict list).
- You can override the save path via the
save_reportargument.