Skip to content

yilihsu/Hierarchical-MDS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Leveraging Hierarchical Organization for Medical Multi-Document Summarization (HMDS)

Paper License GitHub

This repository contains the code and evaluation scripts for the paper:

Leveraging Hierarchical Organization for Medical Multi-Document Summarization
Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
National Tsing Hua University Β· University of Washington Β· Allen Institute for AI
arXiv:2510.23104


🧠 Overview

This work explores whether hierarchical organization of inputs can improve the quality of medical multi-document summarization (MDS) compared to traditional flat summarization.

We introduce and compare three settings:

  1. Plain-MDS – Concatenate all source claims or documents directly.
  2. Hierarchical-MDS (HMDS) – Inject hierarchical category labels and structures before summarization.
  3. Recursive-HMDS – Generate intermediate summaries for each sub-category recursively and then combine them into a higher-level summary.

Our findings show that:

  • Human experts prefer hierarchical summaries for clarity and understandability.
  • Hierarchical approaches preserve factuality, coverage, and coherence, improving human preference.
  • GPT-4-simulated evaluations align with human judgments in factuality and complexity but diverge on subjective aspects (clarity, relevance).

🧩 Dataset

Experiments are conducted on the CHIME dataset (Hsu et al., 2024), which contains:

  • Expert-written Cochrane review abstracts
  • One-sentence claims extracted from included studies
  • Hierarchical structures generated by LLMs and curated by experts

Each hierarchy forms a tree with multi-level nodes and an average depth of 2.5.
We sample 30 topics, each with ~25 studies, and evaluate summaries using both LLM-generated and expert-curated hierarchies.


πŸ§ͺ Models & Settings

We experiment with:

  • GPT-4 (0613)
  • Claude 3 (Opus 20240229)
  • Mistral-7B (Instruct v0.2)

Key configurations:

  • Output limit: 1024 tokens
  • Temperature and top-p: model defaults
  • Evaluation across Plain, Hierarchical, and Recursive setups

πŸ“Š Evaluation

We employ both automated metrics and expert evaluations:

Automated Metrics

  • ROUGE-L (overlap)
  • BERTScore (semantic similarity)
  • Pyramid & Reversed-Pyramid Scores (coverage and factuality)
  • FIZZ (fact-level entailment consistency)
  • LLM-prompt-based Evaluation with GPT-4 for informativeness, coherence, and faithfulness.

Human Expert Evaluation

Three tasks were designed:

  1. Task 1 – Pairwise comparison across systems for preference, clarity, understandability, complexity, relevance.
  2. Task 2 – Within-model comparison between hierarchical and non-hierarchical setups.
  3. Task 3 – Likert-scale scoring for factuality, coverage, coherence against source abstracts.

Experts (MS/PhD in Biomedicine) annotated 30 topics, working 55+50 hours across tasks, compensated at USD $25/hour.

Key Findings

  • Hierarchical MDS boosts clarity and understandability (especially for Mistral-7B).
  • Larger models (GPT-4, Claude 3) show smaller gainsβ€”suggesting hierarchy benefits smaller LLMs more.
  • Automated metrics fail to capture subjective dimensions; GPT-4 alignment ΞΊβ‰ˆ0.35–0.5 on factuality/coverage.
  • Human experts strongly prefer hierarchical outputs even when factuality is similar.

πŸ“ˆ Results Summary

Setting Model Preference ↑ Clarity ↑ Understandability ↑ Complexity ↓ Relevance ↑
Plain MDS GPT-4 0.54 0.57 0.51 0.46 0.32
HMDS GPT-4 0.57 0.52 0.52 0.57 0.50
Plain MDS Claude 3 0.81 0.81 0.79 0.81 0.16
HMDS Claude 3 0.58 0.76 0.55 0.88 0.11
Recursive HMDS Mistral-7B 0.58 0.62 0.47 0.62 0.27

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors