LLMSQL
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 30 additions & 17 deletions b/‎README.md‎
Lines changed: 30 additions & 17 deletions
diff --git a/‎docs/_build/html/.doctrees/docs/evaluation.doctree‎
6.26 KB b/‎docs/_build/html/.doctrees/docs/evaluation.doctree‎
6.26 KB
diff --git a/‎docs/_build/html/.doctrees/environment.pickle‎
326 Bytes b/‎docs/_build/html/.doctrees/environment.pickle‎
326 Bytes
diff --git a/‎docs/_build/html/_modules/index.html‎
Lines changed: 1 addition & 1 deletion b/‎docs/_build/html/_modules/index.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/_build/html/_modules/llmsql/evaluation/evaluate.html‎
Lines changed: 237 additions & 0 deletions b/‎docs/_build/html/_modules/llmsql/evaluation/evaluate.html‎
Lines changed: 237 additions & 0 deletions
diff --git a/‎docs/_build/html/_modules/llmsql/inference/inference_transformers.html‎
Lines changed: 8 additions & 8 deletions b/‎docs/_build/html/_modules/llmsql/inference/inference_transformers.html‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/_build/html/_modules/llmsql/inference/inference_vllm.html‎
Lines changed: 1 addition & 0 deletions b/‎docs/_build/html/_modules/llmsql/inference/inference_vllm.html‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/_build/html/_sources/docs/evaluation.rst.txt‎
Lines changed: 105 additions & 1 deletion b/‎docs/_build/html/_sources/docs/evaluation.rst.txt‎
Lines changed: 105 additions & 1 deletion
@@ -10,3 +10,5 @@ dist/
 
 .coverage
 llmsql_workdir
+
+evaluation_*
@@ -21,7 +21,7 @@ Our datasets are available for different scenarios on our [HuggingFace page](htt
 pip3 install llmsql
 ```
 
-This repository provides the **LLMSQL Benchmark** — a modernized, cleaned, and extended version of WikiSQL, designed for evaluating and fine-tuning large language models (LLMs) on **Text-to-SQL** tasks.
+This repository provides the **LLMSQL Benchmark** — a modernized, cleaned, and extended version of WikiSQL, designed for evaluating large language models (LLMs) on **Text-to-SQL** tasks.
 
 ### Note
 The package doesn't have the dataset, it is stored on our [HuggingFace page](https://huggingface.co/llmsql-bench).
@@ -79,6 +79,8 @@ pip3 install llmsql
 
 ### 1. Run Inference
 
+#### Transformers inference
+
 ```python
 from llmsql import inference_transformers
 
@@ -94,22 +96,9 @@ results = inference_transformers(
         "torch_dtype": "bfloat16",
     }
 )
-
 ```
 
-### 2. Evaluate Results
-
-```python
-from llmsql import LLMSQLEvaluator
-
-evaluator = LLMSQLEvaluator(workdir_path="llmsql_workdir")
-report = evaluator.evaluate(outputs_path="path_to_your_outputs.jsonl")
-print(report)
-```
-
-
-
-## Vllm inference (Recommended)
+#### Vllm inference (Recommended)
 
 To speed up your inference we recommend using vllm inference. You can do it with optional llmsql[vllm] dependency group
 ```bash
@@ -128,12 +117,36 @@ results = inference_vllm(
 ```
 for fast inference.
 
+### 2. Evaluate Results
+
+```python
+from llmsql import evaluate
+
+report =evaluate(outputs="path_to_your_outputs.jsonl")
+print(report)
+```
+
+Or with ther results from the infernece:
+
+```python
+from llmsql import evaluate
+
+# results = inference_transformers(...) or infernce_vllm(...)
+
+report =evaluate(outputs=results)
+print(report)
+```
+
+
+
+
+
 
 
 ## Suggested Workflow
 
-* **Primary**: Run inference on `dataset/questions.jsonl` with vllm → Evaluate with `evaluation/`.
-* **Secondary (optional)**: Fine-tune on `train/val` → Test on `test_questions.jsonl`.
+* **Primary**: Run inference on all questions with vllm or transformers → Evaluate with `evaluate()`.
+* **Secondary (optional)**: Fine-tune on `train/val` → Test on `test_questions.jsonl`. You can find the datasets here [HF Finetune Ready](https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready).
 
 
 ## Contributing
 
@@ -39,7 +39,7 @@ <h3>Navigation</h3>
           <div class="body" role="main">
 
   <h1>All modules for which code is available</h1>
-<ul><li><a href="llmsql/evaluation/evaluator.html">llmsql.evaluation.evaluator</a></li>
+<ul><li><a href="llmsql/evaluation/evaluate.html">llmsql.evaluation.evaluate</a></li>
 <li><a href="llmsql/inference/inference_transformers.html">llmsql.inference.inference_transformers</a></li>
 <li><a href="llmsql/inference/inference_vllm.html">llmsql.inference.inference_vllm</a></li>
 </ul>
 
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>llmsql.inference.inference_transformers &#8212; LLMSQL 0.1.13 documentation</title>
     <link rel="stylesheet" type="text/css" href="../../../_static/pygments.css?v=5349f25f" />
-    <link rel="stylesheet" type="text/css" href="../../../_static/basic.css?v=29da98fa" />
+    <link rel="stylesheet" type="text/css" href="../../../_static/basic.css?v=5c69cfe2" />
     <link rel="stylesheet" type="text/css" href="../../../_static/copybutton.css?v=76b2166b" />
     <link rel="stylesheet" type="text/css" href="../../../_static/styles/front_page.css?v=9e26f69c" />
     <script src="../../../_static/documentation_options.js?v=8d02545a"></script>
@@ -15,9 +15,9 @@
     <script src="../../../_static/clipboard.min.js?v=a7894cd8"></script>
     <script src="../../../_static/copybutton.js?v=ccdb6887"></script>
     <script src="../../../_static/scripts/front_page.js?v=a59558f4"></script>
-    <link rel="icon" href="../../../_static/logo.jpg"/>
+    <link rel="icon" href="../../../_static/favicon.png"/>
     <link rel="index" title="Index" href="../../../genindex.html" />
-    <link rel="search" title="Search" href="../../../search.html" /> 
+    <link rel="search" title="Search" href="../../../search.html" />
   </head><body>
     <div class="related" role="navigation" aria-label="Related">
       <h3>Navigation</h3>
@@ -30,15 +30,15 @@ <h3>Navigation</h3>
              >modules</a> |</li>
         <li class="nav-item nav-item-0"><a href="../../../index.html">LLMSQL 0.1.13 documentation</a> &#187;</li>
           <li class="nav-item nav-item-1"><a href="../../index.html" accesskey="U">Module code</a> &#187;</li>
-        <li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li> 
+        <li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li>
       </ul>
-    </div>  
+    </div>
 
     <div class="document">
       <div class="documentwrapper">
         <div class="bodywrapper">
           <div class="body" role="main">
-            
+
   <h1>Source code for llmsql.inference.inference_transformers</h1><div class="highlight"><pre>
 <span></span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">LLMSQL Transformers Inference Function</span>
@@ -364,10 +364,10 @@ <h3>Navigation</h3>
              >modules</a> |</li>
         <li class="nav-item nav-item-0"><a href="../../../index.html">LLMSQL 0.1.13 documentation</a> &#187;</li>
           <li class="nav-item nav-item-1"><a href="../../index.html" >Module code</a> &#187;</li>
-        <li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li> 
+        <li class="nav-item nav-item-this"><a href="">llmsql.inference.inference_transformers</a></li>
       </ul>
     </div>
     <div class="footer" role="contentinfo">
     </div>
   </body>
-</html>
+</html>
@@ -15,6 +15,7 @@
     <script src="../../../_static/clipboard.min.js?v=a7894cd8"></script>
     <script src="../../../_static/copybutton.js?v=ccdb6887"></script>
     <script src="../../../_static/scripts/front_page.js?v=a59558f4"></script>
+    <link rel="icon" href="../../../_static/favicon.png"/>
     <link rel="index" title="Index" href="../../../genindex.html" />
     <link rel="search" title="Search" href="../../../search.html" />
   </head><body>
 
@@ -1,7 +1,111 @@
 Evaluation API Reference
 ========================
 
-.. automodule:: llmsql.evaluation.evaluator
+The `evaluate()` function allows you to benchmark Text-to-SQL model outputs
+against the LLMSQL gold queries and SQLite database. It prints metrics, logs
+mismatches, and saves detailed reports automatically.
+
+Features
+--------
+- Evaluate model predictions from JSONL files or Python dicts.
+- Automatically download benchmark questions and SQLite DB if missing.
+- Prints mismatch summaries and supports configurable reporting.
+- Saves detailed JSON report with metrics, mismatches, timestamp, and input mode.
+
+Usage Examples
+--------------
+
+Evaluate from a JSONL file:
+
+.. code-block:: python
+
+    from llmsql.evaluation.evaluate import evaluate
+
+    report = evaluate("path_to_outputs.jsonl")
+    print(report)
+
+Evaluate from a list of Python dicts:
+
+.. code-block:: python
+
+    predictions = [
+        {"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"},
+        {"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"},
+    ]
+
+    report = evaluate(predictions)
+    print(report)
+
+Providing your own DB and questions (skip workdir):
+
+.. code-block:: python
+
+    report = evaluate(
+        "path_to_outputs.jsonl",
+        questions_path="bench/questions.jsonl",
+        db_path="bench/sqlite_tables.db",
+        workdir_path=None
+    )
+
+Function Arguments
+------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Argument
+     - Description
+   * - outputs
+     - Path to JSONL file or a list of prediction dicts (required).
+   * - workdir_path
+     - Directory for automatic benchmark downloads. Ignored if both questions_path and db_path are provided. Default: "llmsql_workdir".
+   * - questions_path
+     - Optional path to benchmark questions JSONL file.
+   * - db_path
+     - Optional path to SQLite DB with evaluation tables.
+   * - save_report
+     - Path to save detailed JSON report. Defaults to "evaluation_results_{uuid}.json".
+   * - show_mismatches
+     - Print mismatches while evaluating. Default True.
+   * - max_mismatches
+     - Maximum number of mismatches to display. Default 5.
+
+Input Format
+------------
+
+The predictions should be in JSONL format:
+
+.. code-block:: json
+
+    {"question_id": "1", "predicted_sql": "SELECT name FROM Table WHERE age > 30"}
+    {"question_id": "2", "predicted_sql": "SELECT COUNT(*) FROM Table"}
+    {"question_id": "3", "predicted_sql": "SELECT * FROM Table WHERE active=1"}
+
+Output Metrics
+--------------
+
+The function returns a dictionary with the following keys:
+
+- total – Total queries evaluated
+- matches – Queries where predicted SQL results match gold results
+- pred_none – Queries where the model returned NULL or no result
+- gold_none – Queries where the reference result was NULL or no result
+- sql_errors – Invalid SQL or execution errors
+- accuracy – Overall exact match accuracy
+- mismatches – List of mismatched queries with details
+- timestamp – Evaluation timestamp
+- input_mode – How results were provided ("jsonl_path" or "dict_list")
+
+Report Saving
+-------------
+
+By default, a report is saved automatically as `evaluation_results_{uuid}.json` in the current directory.
+It contains metrics, mismatches, timestamp, and input mode. You can override this path using `save_report`.
+
+---
+
+.. automodule:: llmsql.evaluation.evaluate
    :members:
    :undoc-members:
    :show-inheritance: