Skip to content

Commit fb4a5f4

Browse files
Merge pull request #57 from LLMSQL/56-convert-evaluation-class-to-a-function
evaluation function added
2 parents 53f00b7 + 98dc147 commit fb4a5f4

31 files changed

+1657
-525
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,5 @@ dist/
1010

1111
.coverage
1212
llmsql_workdir
13+
14+
evaluation_*

README.md

Lines changed: 30 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Our datasets are available for different scenarios on our [HuggingFace page](htt
2121
pip3 install llmsql
2222
```
2323

24-
This repository provides the **LLMSQL Benchmark** — a modernized, cleaned, and extended version of WikiSQL, designed for evaluating and fine-tuning large language models (LLMs) on **Text-to-SQL** tasks.
24+
This repository provides the **LLMSQL Benchmark** — a modernized, cleaned, and extended version of WikiSQL, designed for evaluating large language models (LLMs) on **Text-to-SQL** tasks.
2525

2626
### Note
2727
The package doesn't have the dataset, it is stored on our [HuggingFace page](https://huggingface.co/llmsql-bench).
@@ -79,6 +79,8 @@ pip3 install llmsql
7979

8080
### 1. Run Inference
8181

82+
#### Transformers inference
83+
8284
```python
8385
from llmsql import inference_transformers
8486

@@ -94,22 +96,9 @@ results = inference_transformers(
9496
"torch_dtype": "bfloat16",
9597
}
9698
)
97-
9899
```
99100

100-
### 2. Evaluate Results
101-
102-
```python
103-
from llmsql import LLMSQLEvaluator
104-
105-
evaluator = LLMSQLEvaluator(workdir_path="llmsql_workdir")
106-
report = evaluator.evaluate(outputs_path="path_to_your_outputs.jsonl")
107-
print(report)
108-
```
109-
110-
111-
112-
## Vllm inference (Recommended)
101+
#### Vllm inference (Recommended)
113102

114103
To speed up your inference we recommend using vllm inference. You can do it with optional llmsql[vllm] dependency group
115104
```bash
@@ -128,12 +117,36 @@ results = inference_vllm(
128117
```
129118
for fast inference.
130119

120+
### 2. Evaluate Results
121+
122+
```python
123+
from llmsql import evaluate
124+
125+
report =evaluate(outputs="path_to_your_outputs.jsonl")
126+
print(report)
127+
```
128+
129+
Or with ther results from the infernece:
130+
131+
```python
132+
from llmsql import evaluate
133+
134+
# results = inference_transformers(...) or infernce_vllm(...)
135+
136+
report =evaluate(outputs=results)
137+
print(report)
138+
```
139+
140+
141+
142+
143+
131144

132145

133146
## Suggested Workflow
134147

135-
* **Primary**: Run inference on `dataset/questions.jsonl` with vllm → Evaluate with `evaluation/`.
136-
* **Secondary (optional)**: Fine-tune on `train/val` → Test on `test_questions.jsonl`.
148+
* **Primary**: Run inference on all questions with vllm or transformers → Evaluate with `evaluate()`.
149+
* **Secondary (optional)**: Fine-tune on `train/val` → Test on `test_questions.jsonl`. You can find the datasets here [HF Finetune Ready](https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready).
137150

138151

139152
## Contributing
6.26 KB
Binary file not shown.
326 Bytes
Binary file not shown.

docs/_build/html/_modules/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ <h3>Navigation</h3>
3939
<div class="body" role="main">
4040

4141
<h1>All modules for which code is available</h1>
42-
<ul><li><a href="llmsql/evaluation/evaluator.html">llmsql.evaluation.evaluator</a></li>
42+
<ul><li><a href="llmsql/evaluation/evaluate.html">llmsql.evaluation.evaluate</a></li>
4343
<li><a href="llmsql/inference/inference_transformers.html">llmsql.inference.inference_transformers</a></li>
4444
<li><a href="llmsql/inference/inference_vllm.html">llmsql.inference.inference_vllm</a></li>
4545
</ul>

docs/_build/html/_modules/llmsql/evaluation/evaluate.html

Lines changed: 237 additions & 0 deletions
Large diffs are not rendered by default.

docs/_build/html/_modules/llmsql/inference/inference_transformers.html

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
77
<title>llmsql.inference.inference_transformers &#8212; LLMSQL 0.1.13 documentation</title>
88
<link rel="stylesheet" type="text/css" href="../../../_static/pygments.css?v=5349f25f" />
9-
<link rel="stylesheet" type="text/css" href="../../../_static/basic.css?v=29da98fa" />
9+
<link rel="stylesheet" type="text/css" href="../../../_static/basic.css?v=5c69cfe2" />
1010
<link rel="stylesheet" type="text/css" href="../../../_static/copybutton.css?v=76b2166b" />
1111
<link rel="stylesheet" type="text/css" href="../../../_static/styles/front_page.css?v=9e26f69c" />
1212
<script src="../../../_static/documentation_options.js?v=8d02545a"></script>
@@ -15,9 +15,9 @@
1515
<script src="../../../_static/clipboard.min.js?v=a7894cd8"></script>
1616
<script src="../../../_static/copybutton.js?v=ccdb6887"></script>
1717
<script src="../../../_static/scripts/front_page.js?v=a59558f4"></script>
18-
<link rel="icon" href="../../../_static/logo.jpg"/>
18+
<link rel="icon" href="../../../_static/favicon.png"/>
1919
<link rel="index" title="Index" href="../../../genindex.html" />
20-
<link rel="search" title="Search" href="../../../search.html" />
20+
<link rel="search" title="Search" href="../../../search.html" />
2121
</head><body>
2222
<div class="related" role="navigation" aria-label="Related">
2323
<h3>Navigation</h3>
@@ -30,15 +30,15 @@ <h3>Navigation</h3>
3030
>modules</a> |</li>
3131
<li class="nav-item nav-item-0"><a href="../../../index.html">LLMSQL 0.1.13 documentation</a> &#187;</li>
3232
<li class="nav-item nav-item-1"><a href="../../index.html" accesskey="U">Module code</a> &#187;</li>
33-
<li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li>
33+
<li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li>
3434
</ul>
35-
</div>
35+
</div>
3636

3737
<div class="document">
3838
<div class="documentwrapper">
3939
<div class="bodywrapper">
4040
<div class="body" role="main">
41-
41+
4242
<h1>Source code for llmsql.inference.inference_transformers</h1><div class="highlight"><pre>
4343
<span></span><span class="sd">&quot;&quot;&quot;</span>
4444
<span class="sd">LLMSQL Transformers Inference Function</span>
@@ -364,10 +364,10 @@ <h3>Navigation</h3>
364364
>modules</a> |</li>
365365
<li class="nav-item nav-item-0"><a href="../../../index.html">LLMSQL 0.1.13 documentation</a> &#187;</li>
366366
<li class="nav-item nav-item-1"><a href="../../index.html" >Module code</a> &#187;</li>
367-
<li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li>
367+
<li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li>
368368
</ul>
369369
</div>
370370
<div class="footer" role="contentinfo">
371371
</div>
372372
</body>
373-
</html>
373+
</html>

docs/_build/html/_modules/llmsql/inference/inference_vllm.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
<script src="../../../_static/clipboard.min.js?v=a7894cd8"></script>
1616
<script src="../../../_static/copybutton.js?v=ccdb6887"></script>
1717
<script src="../../../_static/scripts/front_page.js?v=a59558f4"></script>
18+
<link rel="icon" href="../../../_static/favicon.png"/>
1819
<link rel="index" title="Index" href="../../../genindex.html" />
1920
<link rel="search" title="Search" href="../../../search.html" />
2021
</head><body>

docs/_build/html/_sources/docs/evaluation.rst.txt

Lines changed: 105 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,111 @@
11
Evaluation API Reference
22
========================
33

4-
.. automodule:: llmsql.evaluation.evaluator
4+
The `evaluate()` function allows you to benchmark Text-to-SQL model outputs
5+
against the LLMSQL gold queries and SQLite database. It prints metrics, logs
6+
mismatches, and saves detailed reports automatically.
7+
8+
Features
9+
--------
10+
- Evaluate model predictions from JSONL files or Python dicts.
11+
- Automatically download benchmark questions and SQLite DB if missing.
12+
- Prints mismatch summaries and supports configurable reporting.
13+
- Saves detailed JSON report with metrics, mismatches, timestamp, and input mode.
14+
15+
Usage Examples
16+
--------------
17+
18+
Evaluate from a JSONL file:
19+
20+
.. code-block:: python
21+
22+
from llmsql.evaluation.evaluate import evaluate
23+
24+
report = evaluate("path_to_outputs.jsonl")
25+
print(report)
26+
27+
Evaluate from a list of Python dicts:
28+
29+
.. code-block:: python
30+
31+
predictions = [
32+
{"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"},
33+
{"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"},
34+
]
35+
36+
report = evaluate(predictions)
37+
print(report)
38+
39+
Providing your own DB and questions (skip workdir):
40+
41+
.. code-block:: python
42+
43+
report = evaluate(
44+
"path_to_outputs.jsonl",
45+
questions_path="bench/questions.jsonl",
46+
db_path="bench/sqlite_tables.db",
47+
workdir_path=None
48+
)
49+
50+
Function Arguments
51+
------------------
52+
53+
.. list-table::
54+
:header-rows: 1
55+
:widths: 20 80
56+
57+
* - Argument
58+
- Description
59+
* - outputs
60+
- Path to JSONL file or a list of prediction dicts (required).
61+
* - workdir_path
62+
- Directory for automatic benchmark downloads. Ignored if both questions_path and db_path are provided. Default: "llmsql_workdir".
63+
* - questions_path
64+
- Optional path to benchmark questions JSONL file.
65+
* - db_path
66+
- Optional path to SQLite DB with evaluation tables.
67+
* - save_report
68+
- Path to save detailed JSON report. Defaults to "evaluation_results_{uuid}.json".
69+
* - show_mismatches
70+
- Print mismatches while evaluating. Default True.
71+
* - max_mismatches
72+
- Maximum number of mismatches to display. Default 5.
73+
74+
Input Format
75+
------------
76+
77+
The predictions should be in JSONL format:
78+
79+
.. code-block:: json
80+
81+
{"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"}
82+
{"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"}
83+
{"question_id": "3", "predicted_sql": "SELECT * FROM Table WHERE active=1"}
84+
85+
Output Metrics
86+
--------------
87+
88+
The function returns a dictionary with the following keys:
89+
90+
- total – Total queries evaluated
91+
- matches – Queries where predicted SQL results match gold results
92+
- pred_none – Queries where the model returned NULL or no result
93+
- gold_none – Queries where the reference result was NULL or no result
94+
- sql_errors – Invalid SQL or execution errors
95+
- accuracy – Overall exact match accuracy
96+
- mismatches – List of mismatched queries with details
97+
- timestamp – Evaluation timestamp
98+
- input_mode – How results were provided ("jsonl_path" or "dict_list")
99+
100+
Report Saving
101+
-------------
102+
103+
By default, a report is saved automatically as `evaluation_results_{uuid}.json` in the current directory.
104+
It contains metrics, mismatches, timestamp, and input mode. You can override this path using `save_report`.
105+
106+
---
107+
108+
.. automodule:: llmsql.evaluation.evaluate
5109
:members:
6110
:undoc-members:
7111
:show-inheritance:

0 commit comments

Comments
 (0)