You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor MEDEC environment to match new envs (#93)
* Refactor MEDEC environment to use less LLM calls and with updated No Free Labels prompt
* add judge setting helpers
* use mutli-axis rubric for judge
The environment supports two distinct evaluation modes, controlled by the `eval_method` argument:
35
35
36
36
***`"judge"` (Default Mode)**
37
-
Uses a **multi-part rubric** where the primary score is derived from a robust *LLM-as-a-Judge* evaluation.
38
-
39
-
* Also calculates ROUGE, BERTScore, and BLEURT for reference.
40
-
* These are assigned `weight=0` and **do not affect the primary score**.
41
-
* Recommended for **semantically nuanced evaluation**.
37
+
Uses a **multi-part rubric** where the primary score is derived from a robust *LLM-as-a-Judge* evaluation using a No Free Labels inspired multi-axis judge rubric.
42
38
43
39
***`"metrics"` (Replication Mode)**
44
40
45
41
* Designed for **direct replication** of the paper's results.
46
-
* Disables the LLM-as-a-Judge.
42
+
* Disables the LLM-as-a-Judge and calculates ROUGE, BERTScore, and BLEURT.
47
43
* Primary score = **weighted average** of `flag_accuracy` and the paper’s original metrics.
48
44
45
+
***`"both"` (Combined Mode)**
46
+
* Computes both the LLM-as-a-Judge score and the ROUGE, BERTScore, and BLEURT replication metrics.
47
+
* These are assigned `weight=0` and **do not affect the primary score**.
48
+
* Recommended for **semantically nuanced evaluation**.
49
+
* Useful for **comprehensive analysis**.
50
+
49
51
50
52
## Quickstart
51
53
52
54
### 1. Export API Key
53
55
54
-
The default judge and model under evaluation is **DeepSeek Chat**, which expects the `DEEPSEEK_API_KEY` environment variable.
56
+
The default judge model is **GPT-4o-mini**, which expects the `OPENAI_API_KEY` environment variable.
55
57
56
58
```bash
57
-
exportDEEPSEEK_API_KEY="your-deepseek-api-key"
59
+
exportOPENAI_API_KEY="your-openai-api-key"
58
60
```
59
61
60
62
### 2. Run Evaluation (Default Judge Mode)
61
63
62
64
Run an evaluation on **10 examples** from the `test_ms` split using the **LLM-as-a-Judge** for scoring.
63
65
64
66
```bash
65
-
uv run vf-eval medec -m deepseek-chat -n 10 -s
67
+
uv run vf-eval medec -m gpt-4o-mini -n 10 -s
66
68
```
67
69
68
70
### 3. Run Evaluation (Paper Replication Mode)
69
71
70
72
To replicate the paper’s scoring methodology, explicitly set the evaluation mode to `"metrics"`.
71
73
72
74
```bash
73
-
uv run vf-eval medec -m deepseek-chat -a '{"eval_method": "metrics"}' -n 10 -s
75
+
uv run vf-eval medec -m gpt-4o-mini -a '{"eval_method": "metrics"}' -n 10 -s
74
76
```
75
77
76
78
### 4. Evaluate a Different Model (e.g., Anthropic)
77
79
78
-
To evaluate an **Anthropic model** while using the default DeepSeek judge:
80
+
To evaluate an **Anthropic model** while using the default OpenAI judge:
0 commit comments