Casereportbench environment by ss8319 · Pull Request #99 · MedARC-AI/med-lm-envs

ss8319 · 2026-01-20T13:39:07Z

This PR implements the CaseReportBench environment for dense clinical information extraction from case reports.

Metric Replication: Implemented Token Set Ratio (TSR), BLEU-1/4, ROUGE-L, Omission, and Hallucination metrics exactly as found in the author's eval_metrics.py.

Prompts: Extracted DSPy prompts.

Two items remained ambiguous in the source repository (which I created issue in original repo for):

Missing Lab_Image Prompt: This category appears in the author's evaluation scripts but has no corresponding DSPy signature in the source code. It has been excluded from this implementation.
Hence, scope Includes 13 extraction categories instead of 14.

Missing Preprocessing Logic: The original repository references a preprocessing_llm_output.py file that was not included in the public repo. Used medarc_verifiers.parsers.JSONParser and flattening logic based on the paper's description for reliable extraction of structured data from model outputs.

UCP/UGP prompting methods faithful DSPy prompt repoduction

ss8319 · 2026-01-24T03:19:08Z

This commit adds a few changes
Implemented Zero Shot (ZS) and Few-Shot (FS) prompt as in the paper.
Extracted out the ZERO_SHOT Prompt from the FULL prompt (FS) prompt. This is just FS without the examples.
Added FS/ZS Toggle: implemented and verified.
Added UCP/UGP prompting methods
faithful DSPy prompt reproduction

ss8319 · 2026-01-24T04:00:37Z

@warner-benjamin Adding a short summary for your reference.

There is 2 types of 'Prompting, Zero Shot (ZS) and Few-Shot (FS) prompt and 3 types of 'Methods' being Unified Global Prompting (UGP), Uniform Category-Specific Prompting (UCP) and Filtered Category-Specific Prompting (FCSP).

ZS contains only instructions while FS contains instructions + output examples. To be clear, there is also ZS-CoT:. Add the phrase "Let's think step by step" to the prompt. The paper did suggests skipping this as it did not significantly improve clinical extraction. It was only applied to Qwen2.5:32B.

UGP is the Baseline Approach where 1 large prompt is created combines 13 categories into 1 LLM call.
UCP treats each category as a separate call for the model. Resulting in 13 calls for each case report.
FCSP is very hard to reproduce. Probably not reproducible.
Like UCP, you make 14 calls. However, instead of sending the whole report, you only send the relevant text segment to the corresponding category prompt.

I am not sure if the input segments is available in the hf dataset. Authors dont provide the inputs for FCSP prompting.
This code seems to be related to this task but not clear how to use it.

ss8319 added 2 commits January 21, 2026 00:32

casereportbench environment

bf1b4dd

FS/ZS Toggle: implemented and verified.

72b875f

UCP/UGP prompting methods faithful DSPy prompt repoduction

ss8319 added 5 commits January 24, 2026 15:33

add numpy for metrics calculation; remove tool.hatch.build

7773b8a

update README to reflect changes

7343f5b

shorten documentation, no code change

ff2a230

fix UGP prompt loading, add enums for args

6163439

remove system prompt, not in original repo

e1ea6bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Casereportbench environment#99

Casereportbench environment#99
ss8319 wants to merge 7 commits intoMedARC-AI:mainfrom
ss8319:casereportbench

ss8319 commented Jan 20, 2026 •

edited

Loading

Uh oh!

ss8319 commented Jan 24, 2026

Uh oh!

ss8319 commented Jan 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ss8319 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ss8319 commented Jan 24, 2026

Uh oh!

ss8319 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ss8319 commented Jan 20, 2026 •

edited

Loading

ss8319 commented Jan 24, 2026 •

edited

Loading