This repository contains the code and evaluation scripts for the paper:
Leveraging Hierarchical Organization for Medical Multi-Document Summarization
Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
National Tsing Hua University Β· University of Washington Β· Allen Institute for AI
arXiv:2510.23104
This work explores whether hierarchical organization of inputs can improve the quality of medical multi-document summarization (MDS) compared to traditional flat summarization.
We introduce and compare three settings:
- Plain-MDS β Concatenate all source claims or documents directly.
- Hierarchical-MDS (HMDS) β Inject hierarchical category labels and structures before summarization.
- Recursive-HMDS β Generate intermediate summaries for each sub-category recursively and then combine them into a higher-level summary.
Our findings show that:
- Human experts prefer hierarchical summaries for clarity and understandability.
- Hierarchical approaches preserve factuality, coverage, and coherence, improving human preference.
- GPT-4-simulated evaluations align with human judgments in factuality and complexity but diverge on subjective aspects (clarity, relevance).
Experiments are conducted on the CHIME dataset (Hsu et al., 2024), which contains:
- Expert-written Cochrane review abstracts
- One-sentence claims extracted from included studies
- Hierarchical structures generated by LLMs and curated by experts
Each hierarchy forms a tree with multi-level nodes and an average depth of 2.5.
We sample 30 topics, each with ~25 studies, and evaluate summaries using both LLM-generated and expert-curated hierarchies.
We experiment with:
- GPT-4 (0613)
- Claude 3 (Opus 20240229)
- Mistral-7B (Instruct v0.2)
Key configurations:
- Output limit: 1024 tokens
- Temperature and top-p: model defaults
- Evaluation across Plain, Hierarchical, and Recursive setups
We employ both automated metrics and expert evaluations:
- ROUGE-L (overlap)
- BERTScore (semantic similarity)
- Pyramid & Reversed-Pyramid Scores (coverage and factuality)
- FIZZ (fact-level entailment consistency)
- LLM-prompt-based Evaluation with GPT-4 for informativeness, coherence, and faithfulness.
Three tasks were designed:
- Task 1 β Pairwise comparison across systems for preference, clarity, understandability, complexity, relevance.
- Task 2 β Within-model comparison between hierarchical and non-hierarchical setups.
- Task 3 β Likert-scale scoring for factuality, coverage, coherence against source abstracts.
Experts (MS/PhD in Biomedicine) annotated 30 topics, working 55+50 hours across tasks, compensated at USD $25/hour.
- Hierarchical MDS boosts clarity and understandability (especially for Mistral-7B).
- Larger models (GPT-4, Claude 3) show smaller gainsβsuggesting hierarchy benefits smaller LLMs more.
- Automated metrics fail to capture subjective dimensions; GPT-4 alignment ΞΊβ0.35β0.5 on factuality/coverage.
- Human experts strongly prefer hierarchical outputs even when factuality is similar.
| Setting | Model | Preference β | Clarity β | Understandability β | Complexity β | Relevance β |
|---|---|---|---|---|---|---|
| Plain MDS | GPT-4 | 0.54 | 0.57 | 0.51 | 0.46 | 0.32 |
| HMDS | GPT-4 | 0.57 | 0.52 | 0.52 | 0.57 | 0.50 |
| Plain MDS | Claude 3 | 0.81 | 0.81 | 0.79 | 0.81 | 0.16 |
| HMDS | Claude 3 | 0.58 | 0.76 | 0.55 | 0.88 | 0.11 |
| Recursive HMDS | Mistral-7B | 0.58 | 0.62 | 0.47 | 0.62 | 0.27 |