The Surgical Agentic Framework Demo is a multimodal agentic AI framework tailored for surgical procedures. It supports:
- Speech-to-Text: Real-time audio is captured, transcribed by Whisper.
- VLM/LLM-based Conversational Agents: A selector agent decides which specialized agent to invoke:
- ChatAgent for general Q&A,
- NotetakerAgent to record specific notes,
- AnnotationAgent to automatically annotate progress in the background,
- PostOpNoteAgent to summarize all data into a final post-operative note.
- Text-to-Speech: The system can speak back the AI's response if you enable TTS. There are options for local TTS models (Coqui), as well as an ElevenLabs API.
- Computer Vision or multimodal features are supported via a finetuned VLM (Vision Language Model), launched by vLLM.
- Video Upload and Processing: Support for uploading and analyzing surgical videos.
- Post-Operation Note Generation: Automatic generation of structured post-operative notes based on the procedure data.
- Microphone: The user clicks "Start Mic" in the web UI, or types a question.
- Whisper ASR: Transcribes speech into text (via servers/whisper_online_server.py).
- SelectorAgent: Receives text from the UI, corrects it (if needed), decides whether to direct it to:
- ChatAgent (general Q&A about the procedure)
- NotetakerAgent (records a note with timestamp + optional image frame)
- In the background, AnnotationAgent is also generating structured "annotations" every 10 seconds.
- NotetakerAgent: If chosen, logs the note in a JSON file.
- AnnotationAgent: Runs automatically, storing procedure annotations in
procedure_..._annotations.json
. - PostOpNoteAgent (optional final step): Summarizes the entire procedure, reading from both the annotation JSON and the notetaker JSON, producing a final structured post-op note.
- Python 3.12 or higher
- Node.js 14.x or higher
- CUDA-compatible GPU (recommended) for model inference
- Microphone for voice input (optional)
- 16GB+ VRAM recommended
- Clone or Download this repository:
git clone https://github.com/project-monai/vlm-surgical-agent-framework.git
cd VLM-Surgical-Agent-Framework
- Setup vLLM (Optional)
vLLM is already configured in the project scripts. If you need to set up a custom vLLM server, see https://docs.vllm.ai/en/latest/getting_started/installation.html
- Install Dependencies:
conda create -n surgical_agent_framework python=3.12
conda activate surgical_agent_framework
pip install -r requirements.txt
Note for Linux (PyAudio build): If pip install pyaudio
fails with a missing header error like portaudio.h: No such file or directory
, install the PortAudio development package first, then rerun pip install:
sudo apt-get update && sudo apt-get install -y portaudio19-dev
pip install -r requirements.txt
- Install Node.js dependencies (for UI development):
Before installing, verify your Node/npm versions (Node ≥14; 18 LTS recommended):
node -v && npm -v
npm install
- Models Folder:
-
Where to put things
- LLM checkpoints live in models/llm/
- Whisper (speech‑to‑text) checkpoints live in models/whisper/ (they will be downloaded automatically at runtime the first time you invoke Whisper).
-
Default LLM
- This repository is pre‑configured for NVIDIA Llama‑3.2‑11B‑Vision‑Surgical‑CholecT50, a surgical‑domain fine‑tuned variant of Llama 3.2‑11B. You may choose to replace it with a finetuned VLM of your choosing.
Download the default model from Hugging Face with Git LFS:
# Download the checkpoint into the expected folder
huggingface-cli download nvidia/Llama-3.2-11B-Vision-Surgical-CholecT50 \
--local-dir models/llm/Llama-3.2-11B-Vision-Surgical-CholecT50 \
--local-dir-use-symlinks False
-
Serving engine
- All LLMs are served through vLLM for streaming. Change the model path once in
configs/global.yaml
undermodel_name
— both the agents andscripts/run_vllm_server.sh
read this. You can override at runtime withVLLM_MODEL_NAME
. To enable auto‑download when the folder is missing, setmodel_repo
inconfigs/global.yaml
(or exportMODEL_REPO
).
- All LLMs are served through vLLM for streaming. Change the model path once in
-
Resulting folder layout
models/
├── llm/
│ └── Llama-3.2-11B-Vision-Surgical-CholecT50/ ← LLM model files
└── whisper/ ← Whisper models (auto‑downloaded)
If you want to adapt the framework to a different procedure (e.g., appendectomy, colectomy), you can fine‑tune a VLM and plug it into this stack with only config file changes. See:
- FINETUNE.md — end‑to‑end guide covering:
- Data curation and scene metadata
- Visual‑instruction data generation (teacher–student)
- Packing data in LLaVA‑style format
- Training (LoRA/QLoRA) and validation
- Exporting and serving with vLLM, and updating configs
- Setup:
- Edit
scripts/start_app.sh
if you need to change ports. - Edit
scripts/run_vllm_server.sh
if you need to change quantization or VRAM utilization (4bit requires ~10GB VRAM). Model selection is controlled viaconfigs/global.yaml
.
- Create necessary directories:
mkdir -p annotations uploaded_videos
For easier deployment and isolation, you can use Docker containers instead of the traditional installation:
cd docker
./run-surgical-agents.sh
This will automatically download models, build all necessary containers, and start the services. See docker/README.md for detailed Docker deployment instructions.
- Run the full stack with all services:
npm start
Or using the script directly:
./scripts/start_app.sh
What it does:
- Builds the CSS with Tailwind
- Starts vLLM server with the model on port 8000
- Waits 45 seconds for the model to load
- Starts Whisper (servers/whisper_online_server.py) on port 43001 (for ASR)
- Waits 5 seconds
- Launches
python servers/app.py
(the main Flask + WebSockets application) - Waits for all processes to complete
For UI development with hot-reloading CSS changes:
npm run dev:web
This starts:
- The CSS watch process for automatic Tailwind compilation
- The web server only (no LLM or Whisper)
For full stack development:
npm run dev:full
This is the same as production mode but also watches for CSS changes.
You can also use the development script for faster startup during development:
./scripts/dev.sh
-
Open your browser at
http://127.0.0.1:8050
. You should see the Surgical Agentic Framework Demo interface:- A video sample (
sample_video.mp4
) - Chat console
- A "Start Mic" button to begin ASR.
- A video sample (
-
Try speaking or Typing:
- If you say "Take a note: The gallbladder is severely inflamed," the system routes you to NotetakerAgent.
- If you say "What are the next steps after dissecting the cystic duct?" it routes you to ChatAgent.
-
Background Annotations:
- Meanwhile,
AnnotationAgent
writes a file like:procedure_2025_01_18__10_25_03_annotations.json
in the annotations folder very 10 seconds with structured timeline data.
- Meanwhile,
- Click on the "Upload Video" button to add your own surgical videos
- Browse the video library by clicking "Video Library"
- Select a video to analyze
- Use the chat interface to ask questions about the video or create annotations
After accumulating annotations and notes during a procedure:
- Click the "Generate Post-Op Note" button
- The system will analyze all annotations and notes
- A structured post-operation note will be generated with:
- Procedure information
- Key findings
- Procedure timeline
- Complications
Common issues and solutions:
-
WebSocket Connection Errors:
- Check firewall settings to ensure ports 49000 and 49001 are open
- Ensure no other applications are using these ports
- If you experience frequent timeouts, adjust the WebSocket configuration in
servers/web_server.py
-
Model Loading Errors:
- Verify model paths are correct in configuration files
- Ensure you have sufficient GPU memory for the models
- Check the log files for specific error messages
-
Audio Transcription Issues:
- Verify your microphone is working correctly
- Check that the Whisper server is running
- Adjust microphone settings in your browser
The framework supports both local and cloud-based TTS options:
Benefits: Private, GPU-accelerated, Offline-capable
The TTS service uses a high-quality English VITS model (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech) (tts_models/en/ljspeech/vits
) that automatically downloads on first use. The model is stored persistently in ./tts-service/models/
and will be available across container restarts.
For cloud-based premium quality TTS:
- Configure your ElevenLabs API key in the web interface
- No local storage or GPU resources required
A brief overview:
surgical_agentic_framework/
├── agents/ <-- Agent implementations
│ ├── annotation_agent.py
│ ├── base_agent.py
│ ├── chat_agent.py
│ ├── notetaker_agent.py
│ ├── post_op_note_agent.py
│ └── selector_agent.py
├── configs/ <-- Configuration files
│ ├── annotation_agent.yaml
│ ├── chat_agent.yaml
│ ├── notetaker_agent.yaml
│ ├── post_op_note_agent.yaml
│ └── selector.yaml
├── models/ <-- Model files
│ ├── llm/ <-- LLM model files
│ │ └── Llama-3.2-11B-lora-surgical-4bit/
│ └── whisper/ <-- Whisper models (downloaded at runtime)
├── scripts/ <-- Shell scripts for starting services
│ ├── dev.sh <-- Development script for quick startup
│ ├── run_vllm_server.sh
│ ├── start_app.sh <-- Main script to launch everything
│ └── start_web_dev.sh <-- Web UI development script
├── servers/ <-- Server implementations
│ ├── app.py <-- Main application server
│ ├── uploaded_videos/ <-- Storage for uploaded videos
│ ├── web_server.py <-- Web interface server
│ └── whisper_online_server.py <-- Whisper ASR server
├── utils/ <-- Utility classes and functions
│ ├── chat_history.py
│ ├── logging_utils.py
│ └── response_handler.py
├── web/ <-- Web interface assets
│ ├── static/ <-- CSS, JS, and other static assets
│ │ ├── audio.js
│ │ ├── bootstrap.bundle.min.js
│ │ ├── bootstrap.css
│ │ ├── chat.css
│ │ ├── jquery-3.6.3.min.js
│ │ ├── main.js
│ │ ├── nvidia-logo.png
│ │ ├── styles.css
│ │ ├── tailwind-custom.css
│ │ └── websocket.js
│ └── templates/
│ └── index.html
├── annotations/ <-- Stored procedure annotations
├── uploaded_videos/ <-- Uploaded video storage
├── README.md <-- This file
├── package.json <-- Node.js dependencies and scripts
├── postcss.config.js <-- PostCSS configuration for Tailwind
├── tailwind.config.js <-- Tailwind CSS configuration
├── vite.config.js <-- Vite build configuration
└── requirements.txt <-- Python dependencies