Skip to content

Casereportbench environment#99

Open
ss8319 wants to merge 7 commits intoMedARC-AI:mainfrom
ss8319:casereportbench
Open

Casereportbench environment#99
ss8319 wants to merge 7 commits intoMedARC-AI:mainfrom
ss8319:casereportbench

Conversation

@ss8319
Copy link
Contributor

@ss8319 ss8319 commented Jan 20, 2026

This PR implements the CaseReportBench environment for dense clinical information extraction from case reports.

Metric Replication: Implemented Token Set Ratio (TSR), BLEU-1/4, ROUGE-L, Omission, and Hallucination metrics exactly as found in the author's eval_metrics.py.

Prompts: Extracted DSPy prompts.

Two items remained ambiguous in the source repository (which I created issue in original repo for):

Missing Lab_Image Prompt: This category appears in the author's evaluation scripts but has no corresponding DSPy signature in the source code. It has been excluded from this implementation.
Hence, scope Includes 13 extraction categories instead of 14.

Missing Preprocessing Logic: The original repository references a preprocessing_llm_output.py file that was not included in the public repo. Used medarc_verifiers.parsers.JSONParser and flattening logic based on the paper's description for reliable extraction of structured data from model outputs.

UCP/UGP prompting methods
faithful DSPy prompt repoduction
@ss8319
Copy link
Contributor Author

ss8319 commented Jan 24, 2026

This commit adds a few changes
Implemented Zero Shot (ZS) and Few-Shot (FS) prompt as in the paper.
Extracted out the ZERO_SHOT Prompt from the FULL prompt (FS) prompt. This is just FS without the examples.
Added FS/ZS Toggle: implemented and verified.
Added UCP/UGP prompting methods
faithful DSPy prompt reproduction

@ss8319
Copy link
Contributor Author

ss8319 commented Jan 24, 2026

@warner-benjamin Adding a short summary for your reference.

There is 2 types of 'Prompting, Zero Shot (ZS) and Few-Shot (FS) prompt and 3 types of 'Methods' being Unified Global Prompting (UGP), Uniform Category-Specific Prompting (UCP) and Filtered Category-Specific Prompting (FCSP).

ZS contains only instructions while FS contains instructions + output examples. To be clear, there is also ZS-CoT:. Add the phrase "Let's think step by step" to the prompt. The paper did suggests skipping this as it did not significantly improve clinical extraction. It was only applied to Qwen2.5:32B.

UGP is the Baseline Approach where 1 large prompt is created combines 13 categories into 1 LLM call.
UCP treats each category as a separate call for the model. Resulting in 13 calls for each case report.
FCSP is very hard to reproduce. Probably not reproducible.
Like UCP, you make 14 calls. However, instead of sending the whole report, you only send the relevant text segment to the corresponding category prompt.

I am not sure if the input segments is available in the hf dataset. Authors dont provide the inputs for FCSP prompting.
This code seems to be related to this task but not clear how to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant