Evaluating LLM Agents for Genetic Testing Insurance Workflows

This repository is a standalone export of `RESCUE-n8n/eval/insurance` (insurance_agent branch from https://github.com/stormliucong/RESCUE-n8n/tree/main/eval/insurance) for reproducibility.

📁 This folder contains the source codes, experimental outputs, and evaluation files associated with our study: Evaluating Large Language Model Agents for Genetic Testing Insurance Workflows: An End-to-End Assessment of Retrieval and Reliability.

Requirements & Reproducibility Notes

Python 3.10+
OpenAI API key
Perplexity API key
SentenceTransformer model (all-MiniLM-L6-v2)
OpenAI text-embedding-3-small
Publicly available insurance policy documents (see below section)

⚠️ Due to policy version updates over time, retrieval results may differ slightly if policies have been modified after the study period.

Installation

Clone the repository

git clone https://github.com/CptAswadu/LLMInsuranceWorkflow
cd LLMInsuranceWorkflow

Install dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set API Keys Create a .env file in the project root:

touch .env
OPEN_AI_API_KEY=<your_openai_key_here>
PERPLEXITY_API_KEY=<your_perplexity_key_here>

🧠 Purpose

This study systematically evaluates the reliability of web-search-enabled LLM agents in supporting insurance workflows for genetic testing. We assess performance across four sequential tasks:

In-Network Insurance Provider Retrieval
Policy Document Retrieval
Patient-Policy Match
LLM Agent for Answering Relevant Questions

The goal is to quantify retrieval sensitivity, ranking robustness, and downstream decision accuracy.

📂 Folder Descriptions

`codes/`

Description: Contains all the scripts and notebooks for this research.

`dataset/`

Description: Contains all data sources except insurance policies utilized for this research.

`results/`

Description: Contains all the experiment results for this research.

🧪 Experimental Tasks

In-Network Provider Retrieval

Objective: Evaluate whether LLM agents can correctly identify GeneDx in-network payers.
Input: Prompt-based queries (model × prompt × iteration).
Ground Truth: dataset/In-Network_Providers_Update.csv
Output: results/name_retrieval/
Evaluation: GPT-4o-based matching and computation ('codes/name_retrieval/execute_analysis.py').

Policy Document Retrieval

Objective: Assess whether LLM agents retrieve the correct genetic testing coverage policies.
Input: Insurance provider name + prompt configuration.
Output: Retrieved policy documents (PDF/HTML) + MD5 verification results.
Output Location: results/policy_retrieval/
Evaluation: MD5-based ground-truth comparison ('codes/policy_retrieval/assess.py').

Patient-Policy match

Objective: Retrieve the most relevant policy document for a given synthetic patient case.
Input: Synthetic patient case + policy embedding corpus (789 policies).
Output: Ranked policy candidates (Top-K) + reranking artifacts.
Output Location: results/patient_policy_match/
Evaluation: Match rate and reranking analysis ('codes/analysis_figures/match_rate_analysis.ipynb').

LLM Agent for Answering Relevant Questions

Objective: Evaluate downstream insurance QA accuracy under document conditioning.
Input: Patient case ± policy document.
Settings:
- Baseline (no document)
- All-Correct (ground-truth document)
- All-Incorrect (high-similarity incorrect document)
Output Location: results/LLM_QnA/
Evaluation: Accuracy, Adjusted_Accuracy, case-level statistical analysis between settings, question-wise statistical analysis ('codes/analysis_figures/Analysis.ipynb').

⚠️ The LLM Agent for Answering Relevant Questions task depends on the outputs of the Patient–Policy Matching task when running document-conditioned settings (matched, unmatched, all_correct, all_incorrect).

Therefore, the Patient–Policy Matching task must be executed first to generate the required policy-document assignments.

The 'Baseline' setting (patient narrative only, without policy documents) can be executed independently.

📄 Insurance Policy Documents

A total of 789 publicly available genetic testing policy documents were used for embedding and retrieval experiments.

These documents are not redistributed in this repository due to size and licensing considerations.

In this study, three different document sets were used for different evaluation tasks:

insurance_policy/
Full corpus (789 documents) used for the main embedding and retrieval experiments.
insurance_ret/
A subset of the corpus used for the policy document retrieval evaluation.
insurance_answer/
A small document set used only for testing and debugging the patient–policy match and LLM QA tasks.

Please send email to Dr.Cong Liu if you want to get access to collected policy documents (Cong.Liu@childrens.harvard.edu)

All policy documents are publicly accessible via the official payer websites:

Aetna
Blue Cross Blue Shield Federal Employee Program (BCBS FEP)
Cigna
United Healthcare

The experiments were conducted using the policy snapshot available at the time of study.

To reproduce embedding-based experiments:

Download the relevant genetic testing policies from payer websites.
Place them under a local directory.

Ground-truth md5 mappings between synthetic cases and policy documents are provided in dataset/final_ground_truth.json.

📊 Evaluation Datasets

All evaluation datasets are located in the dataset/ folder

'In-Network_Providers_Update.csv' for In-Network Provider Retrieval task
'qna_free_text_sample.json' for patient narrative samples for Patient-Policy match and LLM Agent for Answering Relevant Questions tasks
'final_ground_truth.json' contains evaluation ground truth answer for Patient-Policy match and LLM Agent for Answering Relevant Questions tasks

📘 Related Manuscript

This project supports the manuscript titled:

"Evaluating Large Language Model Agents for Genetic Testing Insurance Workflows: An End-to-End Assessment of Retrieval and Reliability"

Key contributions:

Evaluation of real-time LLM-based retrieval
Task-specific prompting & evaluation metrics
Assessing LLM agent performance on End-To-End genetic testing insurance workflow
Quantify failure modes: retrieval sensitivity, abstention rate

@misc{kim2026evaluating,
  title        = {Evaluating Large Language Model Agents for Genetic Testing Insurance Workflows: An End-to-End Assessment of Retrieval and Reliability},
  author       = {Kim, Junyoung and Ravi, Kamalakkannan and Liu, Cong},
  year         = {2026},
  eprint       = {arXiv:XXXX.XXXXX},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
codes		codes
dataset		dataset
results		results
.gitignore		.gitignore
README.md		README.md
llmagent_insurance_policy.enl		llmagent_insurance_policy.enl
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating LLM Agents for Genetic Testing Insurance Workflows

This repository is a standalone export of `RESCUE-n8n/eval/insurance` (insurance_agent branch from https://github.com/stormliucong/RESCUE-n8n/tree/main/eval/insurance) for reproducibility.

Requirements & Reproducibility Notes

⚠️ Due to policy version updates over time, retrieval results may differ slightly if policies have been modified after the study period.

Installation

🧠 Purpose

📂 Folder Descriptions

`codes/`

`dataset/`

`results/`

🧪 Experimental Tasks

The 'Baseline' setting (patient narrative only, without policy documents) can be executed independently.

📄 Insurance Policy Documents

📊 Evaluation Datasets

📘 Related Manuscript

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating LLM Agents for Genetic Testing Insurance Workflows

This repository is a standalone export of RESCUE-n8n/eval/insurance (insurance_agent branch from https://github.com/stormliucong/RESCUE-n8n/tree/main/eval/insurance) for reproducibility.

Requirements & Reproducibility Notes

⚠️ Due to policy version updates over time, retrieval results may differ slightly if policies have been modified after the study period.

Installation

🧠 Purpose

📂 Folder Descriptions

codes/

dataset/

results/

🧪 Experimental Tasks

The 'Baseline' setting (patient narrative only, without policy documents) can be executed independently.

📄 Insurance Policy Documents

📊 Evaluation Datasets

📘 Related Manuscript

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

This repository is a standalone export of `RESCUE-n8n/eval/insurance` (insurance_agent branch from https://github.com/stormliucong/RESCUE-n8n/tree/main/eval/insurance) for reproducibility.

`codes/`

`dataset/`

`results/`