Skip to content

Commit bf97c90

Browse files
Refactor MEDEC environment to match new envs (#93)
* Refactor MEDEC environment to use less LLM calls and with updated No Free Labels prompt * add judge setting helpers * use mutli-axis rubric for judge
1 parent 31e330b commit bf97c90

File tree

3 files changed

+337
-259
lines changed

3 files changed

+337
-259
lines changed

environments/medec/README.md

Lines changed: 57 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -26,59 +26,61 @@ A benchmark for medical error detection, extraction, and correction in clinical
2626

2727
**Type:** `single-turn`
2828
**Parser:** `vf.XMLParser`
29-
**Fields:** `error_flag`, `error_sentence`, `corrected_sentence`
29+
**Fields:** `error_id`, `incorrect_sentence`, `correction`
3030

3131

3232
## Rubric Overview
3333

3434
The environment supports two distinct evaluation modes, controlled by the `eval_method` argument:
3535

3636
* **`"judge"` (Default Mode)**
37-
Uses a **multi-part rubric** where the primary score is derived from a robust *LLM-as-a-Judge* evaluation.
38-
39-
* Also calculates ROUGE, BERTScore, and BLEURT for reference.
40-
* These are assigned `weight=0` and **do not affect the primary score**.
41-
* Recommended for **semantically nuanced evaluation**.
37+
Uses a **multi-part rubric** where the primary score is derived from a robust *LLM-as-a-Judge* evaluation using a No Free Labels inspired multi-axis judge rubric.
4238

4339
* **`"metrics"` (Replication Mode)**
4440

4541
* Designed for **direct replication** of the paper's results.
46-
* Disables the LLM-as-a-Judge.
42+
* Disables the LLM-as-a-Judge and calculates ROUGE, BERTScore, and BLEURT.
4743
* Primary score = **weighted average** of `flag_accuracy` and the paper’s original metrics.
4844

45+
* **`"both"` (Combined Mode)**
46+
* Computes both the LLM-as-a-Judge score and the ROUGE, BERTScore, and BLEURT replication metrics.
47+
* These are assigned `weight=0` and **do not affect the primary score**.
48+
* Recommended for **semantically nuanced evaluation**.
49+
* Useful for **comprehensive analysis**.
50+
4951

5052
## Quickstart
5153

5254
### 1. Export API Key
5355

54-
The default judge and model under evaluation is **DeepSeek Chat**, which expects the `DEEPSEEK_API_KEY` environment variable.
56+
The default judge model is **GPT-4o-mini**, which expects the `OPENAI_API_KEY` environment variable.
5557

5658
```bash
57-
export DEEPSEEK_API_KEY="your-deepseek-api-key"
59+
export OPENAI_API_KEY="your-openai-api-key"
5860
```
5961

6062
### 2. Run Evaluation (Default Judge Mode)
6163

6264
Run an evaluation on **10 examples** from the `test_ms` split using the **LLM-as-a-Judge** for scoring.
6365

6466
```bash
65-
uv run vf-eval medec -m deepseek-chat -n 10 -s
67+
uv run vf-eval medec -m gpt-4o-mini -n 10 -s
6668
```
6769

6870
### 3. Run Evaluation (Paper Replication Mode)
6971

7072
To replicate the paper’s scoring methodology, explicitly set the evaluation mode to `"metrics"`.
7173

7274
```bash
73-
uv run vf-eval medec -m deepseek-chat -a '{"eval_method": "metrics"}' -n 10 -s
75+
uv run vf-eval medec -m gpt-4o-mini -a '{"eval_method": "metrics"}' -n 10 -s
7476
```
7577

7678
### 4. Evaluate a Different Model (e.g., Anthropic)
7779

78-
To evaluate an **Anthropic model** while using the default DeepSeek judge:
80+
To evaluate an **Anthropic model** while using the default OpenAI judge:
7981

8082
```bash
81-
export DEEPSEEK_API_KEY="your-deepseek-api-key"
83+
export OPENAI_API_KEY="your-openai-api-key"
8284
export ANTHROPIC_API_KEY="your-anthropic-api-key"
8385

8486
uv run vf-eval medec \
@@ -91,28 +93,45 @@ uv run vf-eval medec \
9193

9294
## Environment Arguments
9395

94-
| Argument | Type | Default | Description |
95-
| ---------------- | ---- | ------------------------------- | -------------------------------------------------------------------------------- |
96-
| `repo_id` | str | `"sauravlmx/MEDEC-MS"` | Hugging Face Hub repository ID for the dataset. |
97-
| `split` | str | `"test_ms"` | Dataset split to use (`train_ms`, `validation_ms`, `test_ms`). |
98-
| `eval_method` | str | `"judge"` | Evaluation mode (`"judge"` or `"metrics"`). |
99-
| `judge_model` | str | `"deepseek-chat"` | Model used for judge-based scoring (in `"judge"` mode). |
100-
| `judge_base_url` | str | `"https://api.deepseek.com/v1"` | API endpoint for the judge model. |
101-
| `judge_api_key` | str | `None` | API key for the judge model (defaults to `DEEPSEEK_API_KEY`). |
102-
| `device` | str | `None` | Device to use for metrics (`cpu`, `cuda:0`, etc.). Defaults to GPU if available. |
103-
104-
105-
## Metrics (in Default `"Judge"` Mode)
106-
107-
When `eval_method="judge"`, the following metrics are calculated.
108-
The **primary reward score** is the weighted sum of the first three metrics.
109-
110-
| Metric | Weight | Meaning |
111-
| ------------------------ | ------ | --------------------------------------------------------------------------- |
112-
| `flag_accuracy` | 0.2 | 1.0 if predicted `error_flag` matches ground truth; else 0.0. |
113-
| `extraction_similarity` | 0.4 | 1.0 if LLM judge deems `error_sentence` semantically equivalent; else 0.0. |
114-
| `correction_equivalence` | 0.4 | 1.0 if LLM judge deems `corrected_sentence` medically equivalent; else 0.0. |
115-
| `rouge_reward` | 0 | ROUGE-1 F1 score (for analysis only). |
116-
| `bertscore_reward` | 0 | BERTScore F1 (for analysis only). |
117-
| `bleurt_reward` | 0 | BLEURT score (for analysis only). |
118-
| `reward` | N/A | Final weighted sum of non-zero weight metrics (0.0–1.0). |
96+
| Argument | Type | Default | Description |
97+
| ---------------- | ---- | --------------- | -------------------------------------------------------------------------------- |
98+
| `judge_model` | str | `"gpt-4o-mini"` | Model used for judge-based scoring (in `"judge"` or `"both"` mode). |
99+
| `judge_base_url` | str | `None` | API endpoint for the judge model (defaults to OpenAI API). |
100+
| `judge_api_key` | str | `None` | API key for the judge model (defaults to `OPENAI_API_KEY`). |
101+
| `eval_method` | str | `"judge"` | Evaluation mode (`"judge"`, `"metrics"`, or `"both"`). |
102+
| `device` | str | `None` | Device to use for metrics (`cpu`, `cuda:0`, etc.). Defaults to GPU if available. |
103+
104+
105+
## Metrics
106+
107+
### Judge Mode (`eval_method="judge"`)
108+
109+
The **primary reward score** is the weighted sum of three metrics:
110+
111+
| Metric | Weight | Meaning |
112+
| ------------------ | ------ | --------------------------------------------------------------------------- |
113+
| `error_flag` | 1/3 | 1.0 if predicted `error_id` matches ground truth; else 0.0. |
114+
| `error_sentence` | 1/3 | 1.0 if predicted `incorrect_sentence` matches ground truth; else 0.0. |
115+
| `error_correction` | 1/3 | 1.0 if LLM judge deems `correction` medically equivalent; else 0.0. |
116+
117+
### Metrics Mode (`eval_method="metrics"`)
118+
119+
For replicating the paper's evaluation methodology:
120+
121+
| Metric | Weight | Meaning |
122+
| ---------------- | ------ | --------------------------------------------------------------------- |
123+
| `error_flag` | 1/3 | 1.0 if predicted `error_id` matches ground truth; else 0.0. |
124+
| `error_sentence` | 1/3 | 1.0 if predicted `incorrect_sentence` matches ground truth; else 0.0. |
125+
| `rouge_score` | 1/6 | ROUGE-1 F1 score. |
126+
| `bertscore` | 1/6 | BERTScore F1. |
127+
| `bleurt` | 1/6 | BLEURT score. |
128+
129+
### Both Mode (`eval_method="both"`)
130+
131+
Same as Judge mode, plus paper's evaluation metrics with weight 0 (for analysis only):
132+
133+
| Metric | Weight | Meaning |
134+
| ------------- | ------ | ----------------------------- |
135+
| `rouge_score` | 0 | ROUGE-1 F1 score. |
136+
| `bertscore` | 0 | BERTScore F1. |
137+
| `bleurt` | 0 | BLEURT score. |

0 commit comments

Comments
 (0)