Merge pull request #4 from sciknoworg/dev

HamedBabaei · web-flow · commit 0ba1e1ee1614 · 2025-12-20T15:45:48.000+01:00
add more rubrics
diff --git a/.gitignore b/.gitignore
@@ -133,6 +133,7 @@ celerybeat.pid
 # Environments
 .env
 .venv
+myenv/
 env/
 venv/
 ENV/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 ## Changelog
 
+### v0.3.0 (December 20, 2025)
+- Add more rubrics (PR #3)
+- Update documentation for new rubrics
+- Minor bug fixing
+- Update Readme
+
 ### v0.2.0 (May 30, 2025)
 - Add custom judge module.
 - Add documentation.
diff --git a/README.md b/README.md
@@ -87,32 +87,52 @@ Judges within YESciEval are defined as follows:
 | `AutoJudge`      | Base class for loading and running evaluation models with PEFT adapters.                     |
 | `AskAutoJudge`   | Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. |
 | `BioASQAutoJudge` | Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge.               |
-| `CustomAutoJudge`| Custom LLM that can be used as a judge within YESciEval rubrics                              |
+| `CustomAutoJudge`| Custom LLM (open-source LLMs) that can be used as a judge within YESciEval rubrics           |
 
-A total of nine evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
+A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
 
 ```python
-from yescieval import Informativeness, Correctness, Completeness, 
-                      Coherence, Relevancy, Integration, 
-                      Cohesion, Readability, Conciseness
+from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy, \
+                      Integration, Cohesion, Readability, Conciseness, GeographicCoverage, \ 
+                      InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale, \
+                      MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, \
+                      StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \ 
+                      SpeculativeStatements, NoveltyIndicators
+
 ```
 
 A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.
 
 ## 💡 Acknowledgements
 
-If you use YESciEval in your research, please cite:
+If you find this repository helpful or use YESciEval in your work or research, feel free to cite our publication:
+
 
 ```bibtex
-@article{d2025yescieval,
-      title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering},
-      author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin},
-      journal={arXiv preprint arXiv:2505.14279},
-      year={2025}
-   }
+@inproceedings{dsouza-etal-2025-yescieval,
+    title = "{YES}ci{E}val: Robust {LLM}-as-a-Judge for Scientific Question Answering",
+    author = {D{'}Souza, Jennifer  and
+              Babaei Giglou, Hamed  and
+              M{\"u}nch, Quentin},
+    editor = "Che, Wanxiang  and
+      Nabende, Joyce  and
+      Shutova, Ekaterina  and
+      Pilehvar, Mohammad Taher",
+    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    month = jul,
+    year = "2025",
+    address = "Vienna, Austria",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2025.acl-long.675/",
+    doi = "10.18653/v1/2025.acl-long.675",
+    pages = "13749--13783",
+    ISBN = "979-8-89176-251-0"
+}
 ```
+> For other type of citations please refer to https://aclanthology.org/2025.acl-long.675/.
+
 
-This work is licensed under a [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT).
+This software is licensed under a [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT).
 
 
 
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -1,7 +1,7 @@
 Quickstart
 =================
 
-YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based on **informativeness** using a pretrained judge and parse LLM output into structured JSON.
+YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based on **informativeness** and **gap identification** using a pretrained & a custom judge and parse LLM output into structured JSON.
 
 
 **Example: Evaluating an Answer Using Informativeness + AskAutoJudge**
@@ -46,6 +46,26 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi
     - Use the ``device="cuda"`` if running on GPU for better performance.
     - Add more rubrics such as ``Informativeness``, ``Relevancy``, etc for multi-criteria evaluation.
 
+
+**Example: Evaluating an Answer Using GapIdentification + CustomAutoJudge**
+
+.. code-block:: python
+
+   from yescieval import GapIdentification, CustomAutoJudge
+
+   # Step 1: Create a rubric
+   rubric = GapIdentification(papers=papers, question=question, answer=answer)
+   instruction_prompt = rubric.instruct()
+
+   # Step 2: Load the evaluation model (judge)
+   judge = CustomAutoJudge()
+   judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")
+
+   # Step 3: Evaluate the answer
+   result = judge.evaluate(rubric=rubric)
+   print("Raw Evaluation Output:")
+   print(result)
+
 **Parsing Raw Output with GPTParser**
 
 If the model outputs unstructured or loosely structured text, you can use GPTParser to parse it into valid JSON.
diff --git a/docs/source/rubrics.rst b/docs/source/rubrics.rst
@@ -2,7 +2,7 @@
 Rubrics
 ===================
 
-A total of nine evaluation rubrics were defined as part of the YESciEval test framework.
+A total of twenty three (23) evaluation rubrics were defined as part of the YESciEval test framework.
 
 Linguistic & Stylistic Quality
 ---------------------------------
@@ -59,6 +59,99 @@ Following ``Content Accuracy & Informativeness`` ensures that the response is bo
    * - **9. Informativeness:**
      - Is the answer a useful and informative reply to the problem?
 
+Research Depth Assessment
+---------------------------------
+
+Following ``Research Depth Assessment`` quantifies the mechanistic and analytical sophistication of synthesis outputs.
+
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Evaluation Rubric
+     - Description
+   * - **10. Mechanistic Understanding:**
+     - Does the answer show understanding of ecological processes, using indicators like “feedback,” “nutrient cycling,” or “trophic cascade”?
+   * - **11. Causal Reasoning:**
+     - Does the answer show clear cause-effect relationships using words like “because,” “results in,” or “drives”?
+   * - **12. Temporal Precision:**
+     - Does the answer include specific time references, like intervals (“within 6 months”) or dates (“1990–2020”)?
+
+Research Breadth Assessment
+---------------------------------
+
+Following ``Research Breadth Assessment`` evaluates the diversity of evidence across spatial, ecological, and methodological contexts.
+
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Evaluation Rubric
+     - Description
+   * - **13. Geographic Coverage:**
+     - Does the answer cover multiple biogeographic zones, such as “Tropical” or “Boreal”?
+   * - **14. Intervention Diversity:**
+     - Does the answer include a variety of management practices?
+   * - **15. Biodiversity Dimensions:**
+     - Does the answer mention different aspects of biodiversity, like taxonomic, functional, phylogenetic, or spatial diversity?
+   * - **16. Ecosystem Services:**
+     - Does the answer include relevant ecosystem services, based on the Millennium Ecosystem Assessment vocabulary?
+   * - **17. Spatial Scale:**
+     - Does the answer specify the spatial scale, using terms like “local,” “regional,” or “continental” and area measures?
+
+Scientific Rigor Assessment
+---------------------------------
+
+Following ``Scientific Rigor Assessment`` assesses the evidentiary and methodological integrity of the synthesis.
+
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Evaluation Rubric
+     - Description
+   * - **18. Statistical Sophistication:**
+     - Does the answer use statistical methods or analyses, showing quantitative rigor and depth?
+   * - **19. Citation Practices:**
+     - Does the answer properly cite sources, using parenthetical or narrative citations (e.g., “(Smith et al., 2021)”)?
+   * - **20. Uncertainty Acknowledgment:**
+     - Does the answer explicitly mention limitations or uncertainty, using terms like “unknown,” “limited evidence,” or “unclear”?
+
+Innovation Capacity Assessment
+---------------------------------
+
+Following ``Innovation Capacity Assessment`` evaluates the novelty of the synthesis.
+
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Evaluation Rubric
+     - Description
+   * - **21. Speculative Statements:**
+     - Does the answer include cautious or hypothetical statements, using words like “might,” “could,” or “hypothetical”?
+   * - **22. Novelty Indicators :**
+     - Does the answer highlight innovation using terms like “novel,” “pioneering,” or “emerging”?
+
+
+Research Gap Assessment
+---------------------------------
+
+Following ``Research Gap Assessment`` detects explicit acknowledgment of unanswered questions or understudied areas in the synthesis.
+
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 80
+
+   * - Evaluation Rubric
+     - Description
+   * - **23. Gap Identification:**
+     - Does the answer point out unanswered questions or understudied areas, using terms like “research gap” or “understudied”?
 
 
 Usage Example
@@ -68,9 +161,12 @@ Here is a simple example of how to import rubrics in your code:
 
 .. code-block:: python
 
-    from yescieval import Informativeness, Correctness, Completeness,
-                          Coherence, Relevancy, Integration,
-                          Cohesion, Readability, Conciseness
+    from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy,
+                          Integration, Cohesion, Readability, Conciseness, GeographicCoverage, 
+                          InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale,
+                          MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, 
+                          StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, 
+                          SpeculativeStatements, NoveltyIndicators
 
 And to use rubrics:
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,8 +1,7 @@
 [tool.poetry]
 name = "YESciEval"
 
-version = "0.2.0"
-
+version = "0.0.0"
 description = "YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering."
 authors = ["Hamed Babaei Giglou <hamedbabaeigiglou@gmail.com>"]
 license = "MIT License"
@@ -30,6 +29,12 @@ wheel = "*"
 twine = "*"
 pytest = "*"
 
+[tool.poetry-dynamic-versioning]
+enable = true
+style = "semver"
+source = "attr"
+attr = "yescieval.__version__"
+
 [build-system]
-requires = ["poetry-core>=1.0.0"]
-build-backend = "poetry.core.masonry.api"
+requires = ["poetry-core>=1.0.0", "poetry-dynamic-versioning>=1.4.0"]
+build-backend = "poetry_dynamic_versioning.backend"
diff --git a/setup.py b/setup.py
@@ -1,11 +1,12 @@
 from setuptools import setup, find_packages
+import os
 
 with open("README.md", encoding="utf-8") as f:
     long_description = f.read()
 
 setup(
     name="YESciEval",
-    version="0.2.0",
+    version=open(os.path.join(os.path.dirname(__file__), 'yescieval/VERSION')).read().strip(),
     author="Hamed Babaei Giglou",
     author_email="hamedbabaeigiglou@gmail.com",
     description="YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering.",
diff --git a/yescieval/VERSION b/yescieval/VERSION
@@ -0,0 +1 @@
+0.3.0
diff --git a/yescieval/__init__.py b/yescieval/__init__.py
@@ -1,9 +1,14 @@
+from pathlib import Path
 
-__version__ = "0.2.0"
+__version__ = (Path(__file__).parent / "VERSION").read_text().strip()
 
 from .base import Rubric, Parser
 from .rubric import (Informativeness, Correctness, Completeness, Coherence, Relevancy,
-                    Integration, Cohesion, Readability, Conciseness)
+                    Integration, Cohesion, Readability, Conciseness, GeographicCoverage, 
+                    InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale,
+                    MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, 
+                    StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, 
+                    SpeculativeStatements, NoveltyIndicators)
 from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
 from .parser import GPTParser
 
diff --git a/yescieval/judge/judges.py b/yescieval/judge/judges.py
@@ -42,13 +42,13 @@ class AskAutoJudge(AutoJudge):
     def from_pretrained(self, model_id:str="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
                          device:str="auto",
                          token:str =""):
-        return super()._from_pretrained(model_id=model_id, device=device, token=token)
+        self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)
 
 class BioASQAutoJudge(AutoJudge):
     def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B",
                          device: str = "auto",
                          token: str = ""):
-        return super()._from_pretrained(model_id=model_id, device=device, token=token)
+        self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)
 
 
 
diff --git a/yescieval/rubric/__init__.py b/yescieval/rubric/__init__.py
@@ -1,7 +1,16 @@
 from .informativeness import Informativeness, Correctness, Completeness
 from .structural import Coherence, Relevancy, Integration
 from .stylistic import Cohesion, Readability, Conciseness
+from .breadth import GeographicCoverage, InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale
+from .depth import MechanisticUnderstanding, CausalReasoning, TemporalPrecision
+from .gap import GapIdentification
+from .rigor import StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment
+from .innovation import SpeculativeStatements, NoveltyIndicators
 
 __all__ = ["Informativeness", "Correctness", "Completeness",
            "Coherence", "Relevancy", "Integration",
-           "Cohesion", "Readability", "Conciseness"]
+           "Cohesion", "Readability", "Conciseness", "GeographicCoverage", 
+           "InterventionDiversity", "BiodiversityDimensions", "EcosystemServices", 
+           "SpatialScale", "MechanisticUnderstanding", "CausalReasoning", "TemporalPrecision",
+           "GapIdentification", "StatisticalSophistication", "CitationPractices",
+           "UncertaintyAcknowledgment", "SpeculativeStatements", "NoveltyIndicators"]
diff --git a/yescieval/rubric/breadth.py b/yescieval/rubric/breadth.py
diff --git a/yescieval/rubric/depth.py b/yescieval/rubric/depth.py
diff --git a/yescieval/rubric/gap.py b/yescieval/rubric/gap.py
diff --git a/yescieval/rubric/innovation.py b/yescieval/rubric/innovation.py
diff --git a/yescieval/rubric/rigor.py b/yescieval/rubric/rigor.py

-Original file line number
+Diff line change
 # Environments
 .env
 .venv
 +myenv/
 env/
 venv/
 ENV/