Inspiration & Reference: This project is a scaled-down, edge-optimized implementation inspired by the architectural and ethical principles outlined in Fisher et al., "HEAL-Summ: a lightweight and ethical framework for accessible summarization of health information."
HEAL-Summ-lite is a decoupled, multi-stage NLP pipeline designed to summarize complex public health advisories (such as CDC notices) into accessible, plain language. Building on the core concepts of the original HEAL-Summ framework, this "lite" version is specifically engineered for edge deployment, utilizing small-parameter models and deterministic heuristics rather than computationally expensive LLM-as-a-judge systems.
The pipeline consists of a 3.8B parameter generator followed by two deterministic safety gates:
- Generator (
microsoft/Phi-4-mini-instruct, 3.8B): Selected because it dominates its weight class on the Open LLM Leaderboard (IFEval: 73.78, BBH: 38.74). Furthermore, as demonstrated in the HEAL-Summ literature, the Phi family consistently achieves the most accessible Flesch-Kincaid Grade Levels (FKGL). - Gate 1 | Readability (FKGL): A deterministic check using the Flesch-Kincaid Grade Level formula to ensure summaries meet a strict
<8.0threshold. Failures do not abort the process; instead, they trigger an autonomous agentic retry loop instructing the LLM to simplify the text. - Gate 2 | Hallucination Check (NEHR): A CPU-bound Extended Named Entity Hallucination Risk (NEHR) check using
spaCy. It extracts numbers (NUM) and entities (ORG,GPE). To counter brittle statistical NER tagging, it employs a deterministic substring fallback: any extracted entity that appears literally anywhere in the source text (case-insensitive) is cleared. Only strictly fabricated strings trigger a human review flag.
- Draft-Embedded Agentic Retries: To ensure 100% reproducibility, the generator uses deterministic greedy decoding (
do_sample=False). Because greedy decoding typically causes retry loops to output identical text, the pipeline dynamically embeds the failed draft into the retry prompt, forcing the model to explicitly "edit" its mistakes rather than regenerate from scratch. - Zero-Tolerance Risk Mitigation: In public health communications, missing or fabricated numbers are the highest-risk vector. The NEHR gate enforces a strict 0% tolerance set-difference rule for entity hallucination.
- The FKGL Domain Clash: Despite the retry loop enforcing
<8 wordsper sentence and simpler synonyms, the pipeline struggles to consistently reach the<8.0threshold on dense epidemiological texts (e.g., Ebola and Marburg notices). FKGL heavily penalizes syllables, and stripping out all polysyllabic medical terminology (e.g., "hemorrhagic", "incubation") risks degrading clinical accuracy. - Word-to-Digit Conversion: The NEHR check suffers from formatting false positives. If the LLM correctly summarizes "three days" as "3 days", the system flags it because the character "3" has zero substring presence in the source text.
- Domain-Specific Readability: Explore alternative readability metrics (such as the SMOG index or health-specific formulas) that do not artificially inflate scores due to necessary medical terminology.
- Pipeline Upgrades: Add a word-to-digit normalization pre-processing step to fix the remaining false positives in the NEHR check.
- Semantic Verification: To catch intrinsic semantic hallucinations (where numbers are correct but relationships are distorted), upgrade the safety heuristic to a claim-level factual consistency encoder like MiniCheck (Tang et al., 2024).