A monorepo containing a hybrid deterministic + LLM evaluation framework for clinical SOAP notes, with a React dashboard for visualization.
Frontend Dashboard: https://soap-evaluation.vercel.app/
SOAP_Evaluation/
βββ backend/ # Python/FastAPI backend (deploy to Render)
β βββ src/ # Python source code
β βββ requirements.txt
β βββ Dockerfile # For Render deployment
β βββ .env.example # Backend environment variables
βββ frontend/ # React dashboard (deploy to Vercel)
β βββ src/ # React source code
β βββ package.json
β βββ .env.local.example # Frontend environment variables
βββ README.md
This evaluation framework assesses the quality of generated SOAP notes by comparing them against:
- Transcripts: Original doctor-patient dialogues
- Reference SOAP notes: Ground truth SOAP notes from the dataset (optional, not available in production mode)
- Generated SOAP notes: Notes to be evaluated
Two modes:
- Evaluation mode (default): Uses transcript + generated note + reference note for comprehensive evaluation
- Production mode: Uses only transcript + generated note (no reference available)
- Missing Critical Findings - Important facts present in the reference/transcript but omitted from the generated note
- Hallucinated / Unsupported Facts - Statements in the generated note that are not grounded in the transcript/reference
- Clinical Accuracy Issues - Clinically incorrect or misleading content
We use the Hugging Face dataset omi-health/medical-dialogue-to-soap-summary as our source of dialogues and SOAP notes.
- The dataset's
dialoguecolumn provides the doctor-patient transcript - The dataset's
soapcolumn provides the reference SOAP note - Generated notes are created by synthetically corrupting the reference notes (dropping sentences, truncating details) to simulate model output
-
Deterministic Layer - Fast, cheap metrics without LLM calls:
- SOAP structure detection (presence of S:, O:, A:, P: sections)
- Coverage detection (sentence-level matching between reference and generated notes)
- Hallucination rate detection (sentences in generated note not found in source)
- Negation/Uncertainty-aware metrics: Enhanced deterministic metrics that handle negated statements (e.g., "denies fever", "no chest pain") and uncertain statements (e.g., "possible pneumonia", "rule out PE") correctly, reducing false positives in hallucination and coverage detection
- Always computed, regardless of LLM usage
-
LLM-as-a-Judge Layer - Nuanced clinical evaluation:
- Uses OpenAI's GPT-4o-mini (configurable) to review notes
- Identifies specific issues with categories, severity, and spans
- Provides detailed scores for coverage, faithfulness, and accuracy
- Coverage (0.0-1.0): How well the note covers important facts from the transcript/reference
- Faithfulness (0.0-1.0): How closely the note sticks to the transcript/reference (absence of hallucinations)
- Accuracy (0.0-1.0): Clinical correctness and safety
- Overall Quality (0.0-1.0): Weighted combination (40% coverage + 30% faithfulness + 30% accuracy)
- Python 3.11+
- Node.js 18+
- OpenAI API key (for LLM evaluation)
- Navigate to backend directory:
cd backend- Set up environment variables:
cp .env.example .env
# Edit .env and set your OPENAI_API_KEY and other configuration- Install dependencies:
pip install -r requirements.txt- Run evaluation:
python -m src.run_eval_env- Start the API server:
uvicorn src.api.app:app --reload --host 0.0.0.0 --port 8000The backend API will be available at http://localhost:8000
- Navigate to frontend directory:
cd frontend- Set up environment variables:
cp .env.local.example .env.local
# Edit .env.local if needed (defaults to http://localhost:8000)- Install dependencies:
npm install- Start the development server:
npm run devThe frontend will be available at http://localhost:5173
-
Create a new Web Service on Render:
- Connect your GitHub repository
- Set Root Directory to
backend - Choose Docker as the build method (or use the Dockerfile)
-
Set environment variables in Render dashboard: Copy from
backend/.env.example:USE_LLM=trueNUM_EXAMPLES=50PRODUCTION_MODE=falseDATASET_NAME=omi-health/medical-dialogue-to-soap-summaryDATASET_SPLIT=testBACKEND_PORT=8000FRONTEND_ORIGIN=https://your-frontend.vercel.app(set after deploying frontend)OPENAI_API_KEY=your_key_hereOPENAI_MODEL=gpt-4o-miniOPENAI_TEMPERATURE=0.0OUTPUT_DIR=results
-
Deploy:
- Render will build the Docker image
- On container start, it runs evaluation then starts the API server
- Note the HTTPS URL (e.g.,
https://your-backend.onrender.com)
-
Create a new Vercel project:
- Connect your GitHub repository
- Set Root Directory to
frontend - Framework Preset: Vite (React + Vite)
-
Set environment variables in Vercel dashboard:
VITE_API_BASE_URL=https://your-backend.onrender.com(use your Render backend URL)
-
Build settings (should auto-detect):
- Build Command:
npm run build - Output Directory:
dist - Install Command:
npm install
- Build Command:
-
Deploy:
- Vercel will build and deploy the React app
- Note the Vercel URL (e.g.,
https://your-app.vercel.app)
-
Update backend CORS:
- Go back to Render dashboard
- Update
FRONTEND_ORIGINto your Vercel URL (e.g.,https://your-app.vercel.app) - Redeploy the backend if needed
All backend configuration is in backend/.env (or set via Render dashboard):
USE_LLM: Enable/disable LLM judge (true/false)NUM_EXAMPLES: Number of examples to evaluatePRODUCTION_MODE: Production mode - no reference notes (true/false)DATASET_NAME: Hugging Face dataset nameDATASET_SPLIT: Dataset split to use (default: "test")OPENAI_API_KEY: Your OpenAI API keyOPENAI_MODEL: OpenAI model to use (default: "gpt-4o-mini")OPENAI_TEMPERATURE: Temperature for LLM (default: 0.0)BACKEND_PORT: Backend API port (default: 8000)FRONTEND_ORIGIN: Frontend origin for CORS (comma-separated for multiple origins)OUTPUT_DIR: Output directory for results (default: "results")
Frontend configuration in frontend/.env.local (or set via Vercel dashboard):
VITE_API_BASE_URL: Backend API base URL (default: "http://localhost:8000")
The backend provides the following endpoints:
GET /api/summary- Get evaluation summary statisticsGET /api/notes- Get list of notes with optional filtering- Query params:
min_quality,max_quality,hallucination_only,missing_critical_only,major_issues_only
- Query params:
GET /api/notes/{example_id}- Get detailed information for a specific note
After running evaluation, results are written to backend/results/:
per_note.jsonl- One JSON object per line with detailed results for each notesummary.json- Aggregated metrics and statisticssummary.csv- Same aggregated metrics in CSV format
The backend uses CORS middleware to allow requests from the frontend. The FRONTEND_ORIGIN environment variable controls which origins are allowed. For production:
- Set
FRONTEND_ORIGINto your Vercel app URL (e.g.,https://your-app.vercel.app) - Multiple origins can be comma-separated:
https://app1.vercel.app,https://app2.vercel.app