Leveraging Hierarchical Organization for Medical Multi-Document Summarization (HMDS)

This repository contains the code and evaluation scripts for the paper:

Leveraging Hierarchical Organization for Medical Multi-Document Summarization
Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
National Tsing Hua University · University of Washington · Allen Institute for AI
arXiv:2510.23104

🧠 Overview

This work explores whether hierarchical organization of inputs can improve the quality of medical multi-document summarization (MDS) compared to traditional flat summarization.

We introduce and compare three settings:

Plain-MDS – Concatenate all source claims or documents directly.
Hierarchical-MDS (HMDS) – Inject hierarchical category labels and structures before summarization.
Recursive-HMDS – Generate intermediate summaries for each sub-category recursively and then combine them into a higher-level summary.

Our findings show that:

Human experts prefer hierarchical summaries for clarity and understandability.
Hierarchical approaches preserve factuality, coverage, and coherence, improving human preference.
GPT-4-simulated evaluations align with human judgments in factuality and complexity but diverge on subjective aspects (clarity, relevance).

🧩 Dataset

Experiments are conducted on the CHIME dataset (Hsu et al., 2024), which contains:

Expert-written Cochrane review abstracts
One-sentence claims extracted from included studies
Hierarchical structures generated by LLMs and curated by experts

Each hierarchy forms a tree with multi-level nodes and an average depth of 2.5.
We sample 30 topics, each with ~25 studies, and evaluate summaries using both LLM-generated and expert-curated hierarchies.

🧪 Models & Settings

We experiment with:

GPT-4 (0613)
Claude 3 (Opus 20240229)
Mistral-7B (Instruct v0.2)

Key configurations:

Output limit: 1024 tokens
Temperature and top-p: model defaults
Evaluation across Plain, Hierarchical, and Recursive setups

📊 Evaluation

We employ both automated metrics and expert evaluations:

Automated Metrics

ROUGE-L (overlap)
BERTScore (semantic similarity)
Pyramid & Reversed-Pyramid Scores (coverage and factuality)
FIZZ (fact-level entailment consistency)
LLM-prompt-based Evaluation with GPT-4 for informativeness, coherence, and faithfulness.

Human Expert Evaluation

Three tasks were designed:

Task 1 – Pairwise comparison across systems for preference, clarity, understandability, complexity, relevance.
Task 2 – Within-model comparison between hierarchical and non-hierarchical setups.
Task 3 – Likert-scale scoring for factuality, coverage, coherence against source abstracts.

Experts (MS/PhD in Biomedicine) annotated 30 topics, working 55+50 hours across tasks, compensated at USD $25/hour.

Key Findings

Hierarchical MDS boosts clarity and understandability (especially for Mistral-7B).
Larger models (GPT-4, Claude 3) show smaller gains—suggesting hierarchy benefits smaller LLMs more.
Automated metrics fail to capture subjective dimensions; GPT-4 alignment κ≈0.35–0.5 on factuality/coverage.
Human experts strongly prefer hierarchical outputs even when factuality is similar.

📈 Results Summary

Setting	Model	Preference ↑	Clarity ↑	Understandability ↑	Complexity ↓	Relevance ↑
Plain MDS	GPT-4	0.54	0.57	0.51	0.46	0.32
HMDS	GPT-4	0.57	0.52	0.52	0.57	0.50
Plain MDS	Claude 3	0.81	0.81	0.79	0.81	0.16
HMDS	Claude 3	0.58	0.76	0.55	0.88	0.11
Recursive HMDS	Mistral-7B	0.58	0.62	0.47	0.62	0.27

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
MDS_dataset_evaluation_results.csv		MDS_dataset_evaluation_results.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leveraging Hierarchical Organization for Medical Multi-Document Summarization (HMDS)

🧠 Overview

🧩 Dataset

🧪 Models & Settings

📊 Evaluation

Automated Metrics

Human Expert Evaluation

Key Findings

📈 Results Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Leveraging Hierarchical Organization for Medical Multi-Document Summarization (HMDS)

🧠 Overview

🧩 Dataset

🧪 Models & Settings

📊 Evaluation

Automated Metrics

Human Expert Evaluation

Key Findings

📈 Results Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages