Skip to content

Commit 0ba1e1e

Browse files
authored
Merge pull request #4 from sciknoworg/dev
add more rubrics
2 parents 316d48e + 38c3393 commit 0ba1e1e

File tree

16 files changed

+944
-28
lines changed

16 files changed

+944
-28
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ celerybeat.pid
133133
# Environments
134134
.env
135135
.venv
136+
myenv/
136137
env/
137138
venv/
138139
ENV/

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
## Changelog
22

3+
### v0.3.0 (December 20, 2025)
4+
- Add more rubrics (PR #3)
5+
- Update documentation for new rubrics
6+
- Minor bug fixing
7+
- Update Readme
8+
39
### v0.2.0 (May 30, 2025)
410
- Add custom judge module.
511
- Add documentation.

README.md

Lines changed: 33 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -87,32 +87,52 @@ Judges within YESciEval are defined as follows:
8787
| `AutoJudge` | Base class for loading and running evaluation models with PEFT adapters. |
8888
| `AskAutoJudge` | Multidisciplinary judge tuned on the ORKGSyn dataset from the Open Research Knowledge Graph. |
8989
| `BioASQAutoJudge` | Biomedical domain judge tuned on the BioASQ dataset from the BioASQ challenge. |
90-
| `CustomAutoJudge`| Custom LLM that can be used as a judge within YESciEval rubrics |
90+
| `CustomAutoJudge`| Custom LLM (open-source LLMs) that can be used as a judge within YESciEval rubrics |
9191

92-
A total of nine evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
92+
A total of **23** evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. Following simple example shows how to import rubrics in your code:
9393

9494
```python
95-
from yescieval import Informativeness, Correctness, Completeness,
96-
Coherence, Relevancy, Integration,
97-
Cohesion, Readability, Conciseness
95+
from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy, \
96+
Integration, Cohesion, Readability, Conciseness, GeographicCoverage, \
97+
InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale, \
98+
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification, \
99+
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment, \
100+
SpeculativeStatements, NoveltyIndicators
101+
98102
```
99103

100104
A complete list of rubrics are available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page.
101105

102106
## 💡 Acknowledgements
103107

104-
If you use YESciEval in your research, please cite:
108+
If you find this repository helpful or use YESciEval in your work or research, feel free to cite our publication:
109+
105110

106111
```bibtex
107-
@article{d2025yescieval,
108-
title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering},
109-
author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin},
110-
journal={arXiv preprint arXiv:2505.14279},
111-
year={2025}
112-
}
112+
@inproceedings{dsouza-etal-2025-yescieval,
113+
title = "{YES}ci{E}val: Robust {LLM}-as-a-Judge for Scientific Question Answering",
114+
author = {D{'}Souza, Jennifer and
115+
Babaei Giglou, Hamed and
116+
M{\"u}nch, Quentin},
117+
editor = "Che, Wanxiang and
118+
Nabende, Joyce and
119+
Shutova, Ekaterina and
120+
Pilehvar, Mohammad Taher",
121+
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
122+
month = jul,
123+
year = "2025",
124+
address = "Vienna, Austria",
125+
publisher = "Association for Computational Linguistics",
126+
url = "https://aclanthology.org/2025.acl-long.675/",
127+
doi = "10.18653/v1/2025.acl-long.675",
128+
pages = "13749--13783",
129+
ISBN = "979-8-89176-251-0"
130+
}
113131
```
132+
> For other type of citations please refer to https://aclanthology.org/2025.acl-long.675/.
133+
114134

115-
This work is licensed under a [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT).
135+
This software is licensed under a [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT).
116136

117137

118138

docs/source/quickstart.rst

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Quickstart
22
=================
33

4-
YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based on **informativeness** using a pretrained judge and parse LLM output into structured JSON.
4+
YESciEval is a library designed to evaluate the quality of synthesized scientific answers using predefined rubrics and advanced LLM-based judgment models. This guide walks you through how to evaluate answers based on **informativeness** and **gap identification** using a pretrained & a custom judge and parse LLM output into structured JSON.
55

66

77
**Example: Evaluating an Answer Using Informativeness + AskAutoJudge**
@@ -46,6 +46,26 @@ YESciEval is a library designed to evaluate the quality of synthesized scientifi
4646
- Use the ``device="cuda"`` if running on GPU for better performance.
4747
- Add more rubrics such as ``Informativeness``, ``Relevancy``, etc for multi-criteria evaluation.
4848

49+
50+
**Example: Evaluating an Answer Using GapIdentification + CustomAutoJudge**
51+
52+
.. code-block:: python
53+
54+
from yescieval import GapIdentification, CustomAutoJudge
55+
56+
# Step 1: Create a rubric
57+
rubric = GapIdentification(papers=papers, question=question, answer=answer)
58+
instruction_prompt = rubric.instruct()
59+
60+
# Step 2: Load the evaluation model (judge)
61+
judge = CustomAutoJudge()
62+
judge.from_pretrained(model_id="Qwen/Qwen3-8B", device="cpu", token="your_huggingface_token")
63+
64+
# Step 3: Evaluate the answer
65+
result = judge.evaluate(rubric=rubric)
66+
print("Raw Evaluation Output:")
67+
print(result)
68+
4969
**Parsing Raw Output with GPTParser**
5070

5171
If the model outputs unstructured or loosely structured text, you can use GPTParser to parse it into valid JSON.

docs/source/rubrics.rst

Lines changed: 100 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Rubrics
33
===================
44

5-
A total of nine evaluation rubrics were defined as part of the YESciEval test framework.
5+
A total of twenty three (23) evaluation rubrics were defined as part of the YESciEval test framework.
66

77
Linguistic & Stylistic Quality
88
---------------------------------
@@ -59,6 +59,99 @@ Following ``Content Accuracy & Informativeness`` ensures that the response is bo
5959
* - **9. Informativeness:**
6060
- Is the answer a useful and informative reply to the problem?
6161

62+
Research Depth Assessment
63+
---------------------------------
64+
65+
Following ``Research Depth Assessment`` quantifies the mechanistic and analytical sophistication of synthesis outputs.
66+
67+
68+
.. list-table::
69+
:header-rows: 1
70+
:widths: 20 80
71+
72+
* - Evaluation Rubric
73+
- Description
74+
* - **10. Mechanistic Understanding:**
75+
- Does the answer show understanding of ecological processes, using indicators like “feedback,” “nutrient cycling,” or “trophic cascade”?
76+
* - **11. Causal Reasoning:**
77+
- Does the answer show clear cause-effect relationships using words like “because,” “results in,” or “drives”?
78+
* - **12. Temporal Precision:**
79+
- Does the answer include specific time references, like intervals (“within 6 months”) or dates (“1990–2020”)?
80+
81+
Research Breadth Assessment
82+
---------------------------------
83+
84+
Following ``Research Breadth Assessment`` evaluates the diversity of evidence across spatial, ecological, and methodological contexts.
85+
86+
87+
.. list-table::
88+
:header-rows: 1
89+
:widths: 20 80
90+
91+
* - Evaluation Rubric
92+
- Description
93+
* - **13. Geographic Coverage:**
94+
- Does the answer cover multiple biogeographic zones, such as “Tropical” or “Boreal”?
95+
* - **14. Intervention Diversity:**
96+
- Does the answer include a variety of management practices?
97+
* - **15. Biodiversity Dimensions:**
98+
- Does the answer mention different aspects of biodiversity, like taxonomic, functional, phylogenetic, or spatial diversity?
99+
* - **16. Ecosystem Services:**
100+
- Does the answer include relevant ecosystem services, based on the Millennium Ecosystem Assessment vocabulary?
101+
* - **17. Spatial Scale:**
102+
- Does the answer specify the spatial scale, using terms like “local,” “regional,” or “continental” and area measures?
103+
104+
Scientific Rigor Assessment
105+
---------------------------------
106+
107+
Following ``Scientific Rigor Assessment`` assesses the evidentiary and methodological integrity of the synthesis.
108+
109+
110+
.. list-table::
111+
:header-rows: 1
112+
:widths: 20 80
113+
114+
* - Evaluation Rubric
115+
- Description
116+
* - **18. Statistical Sophistication:**
117+
- Does the answer use statistical methods or analyses, showing quantitative rigor and depth?
118+
* - **19. Citation Practices:**
119+
- Does the answer properly cite sources, using parenthetical or narrative citations (e.g., “(Smith et al., 2021)”)?
120+
* - **20. Uncertainty Acknowledgment:**
121+
- Does the answer explicitly mention limitations or uncertainty, using terms like “unknown,” “limited evidence,” or “unclear”?
122+
123+
Innovation Capacity Assessment
124+
---------------------------------
125+
126+
Following ``Innovation Capacity Assessment`` evaluates the novelty of the synthesis.
127+
128+
129+
.. list-table::
130+
:header-rows: 1
131+
:widths: 20 80
132+
133+
* - Evaluation Rubric
134+
- Description
135+
* - **21. Speculative Statements:**
136+
- Does the answer include cautious or hypothetical statements, using words like “might,” “could,” or “hypothetical”?
137+
* - **22. Novelty Indicators :**
138+
- Does the answer highlight innovation using terms like “novel,” “pioneering,” or “emerging”?
139+
140+
141+
Research Gap Assessment
142+
---------------------------------
143+
144+
Following ``Research Gap Assessment`` detects explicit acknowledgment of unanswered questions or understudied areas in the synthesis.
145+
146+
147+
.. list-table::
148+
:header-rows: 1
149+
:widths: 20 80
150+
151+
* - Evaluation Rubric
152+
- Description
153+
* - **23. Gap Identification:**
154+
- Does the answer point out unanswered questions or understudied areas, using terms like “research gap” or “understudied”?
62155

63156

64157
Usage Example
@@ -68,9 +161,12 @@ Here is a simple example of how to import rubrics in your code:
68161

69162
.. code-block:: python
70163
71-
from yescieval import Informativeness, Correctness, Completeness,
72-
Coherence, Relevancy, Integration,
73-
Cohesion, Readability, Conciseness
164+
from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy,
165+
Integration, Cohesion, Readability, Conciseness, GeographicCoverage,
166+
InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale,
167+
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
168+
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
169+
SpeculativeStatements, NoveltyIndicators
74170
75171
And to use rubrics:
76172

pyproject.toml

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
[tool.poetry]
22
name = "YESciEval"
33

4-
version = "0.2.0"
5-
4+
version = "0.0.0"
65
description = "YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering."
76
authors = ["Hamed Babaei Giglou <[email protected]>"]
87
license = "MIT License"
@@ -30,6 +29,12 @@ wheel = "*"
3029
twine = "*"
3130
pytest = "*"
3231

32+
[tool.poetry-dynamic-versioning]
33+
enable = true
34+
style = "semver"
35+
source = "attr"
36+
attr = "yescieval.__version__"
37+
3338
[build-system]
34-
requires = ["poetry-core>=1.0.0"]
35-
build-backend = "poetry.core.masonry.api"
39+
requires = ["poetry-core>=1.0.0", "poetry-dynamic-versioning>=1.4.0"]
40+
build-backend = "poetry_dynamic_versioning.backend"

setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
from setuptools import setup, find_packages
2+
import os
23

34
with open("README.md", encoding="utf-8") as f:
45
long_description = f.read()
56

67
setup(
78
name="YESciEval",
8-
version="0.2.0",
9+
version=open(os.path.join(os.path.dirname(__file__), 'yescieval/VERSION')).read().strip(),
910
author="Hamed Babaei Giglou",
1011
author_email="[email protected]",
1112
description="YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering.",

yescieval/VERSION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
0.3.0

yescieval/__init__.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
1+
from pathlib import Path
12

2-
__version__ = "0.2.0"
3+
__version__ = (Path(__file__).parent / "VERSION").read_text().strip()
34

45
from .base import Rubric, Parser
56
from .rubric import (Informativeness, Correctness, Completeness, Coherence, Relevancy,
6-
Integration, Cohesion, Readability, Conciseness)
7+
Integration, Cohesion, Readability, Conciseness, GeographicCoverage,
8+
InterventionDiversity, BiodiversityDimensions, EcosystemServices, SpatialScale,
9+
MechanisticUnderstanding, CausalReasoning, TemporalPrecision, GapIdentification,
10+
StatisticalSophistication, CitationPractices, UncertaintyAcknowledgment,
11+
SpeculativeStatements, NoveltyIndicators)
712
from .judge import AutoJudge, AskAutoJudge, BioASQAutoJudge, CustomAutoJudge
813
from .parser import GPTParser
914

yescieval/judge/judges.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,13 +42,13 @@ class AskAutoJudge(AutoJudge):
4242
def from_pretrained(self, model_id:str="SciKnowOrg/YESciEval-ASK-Llama-3.1-8B",
4343
device:str="auto",
4444
token:str =""):
45-
return super()._from_pretrained(model_id=model_id, device=device, token=token)
45+
self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)
4646

4747
class BioASQAutoJudge(AutoJudge):
4848
def from_pretrained(self, model_id: str = "SciKnowOrg/YESciEval-BioASQ-Llama-3.1-8B",
4949
device: str = "auto",
5050
token: str = ""):
51-
return super()._from_pretrained(model_id=model_id, device=device, token=token)
51+
self.model, self.tokenizer = super()._from_pretrained(model_id=model_id, device=device, token=token)
5252

5353

5454

0 commit comments

Comments
 (0)