@@ -4,54 +4,42 @@ Multi-agent medical diagnosis environment for evaluating LLMs on clinical diagno
44
55## Quick Start
66
7- ### 1. Install
8-
97``` bash
8+ # Install
109uv pip install -e .
11- ```
12-
13- ### 2. Set API Keys
1410
15- ``` bash
16- # Required for helper agents (patient, measurement, moderator)
11+ # Set API key
1712export OPENAI_API_KEY=" your-key"
18-
19- # Optional: For using other models
20- export ANTHROPIC_API_KEY=" your-key"
21- export MISTRAL_API_KEY=" your-key"
22-
2313```
2414
25- ---
26-
27- ## Running Evaluations
15+ ## Usage
2816
29- ### Basic Command Structure
17+ ### Basic Command
3018
3119``` bash
3220uv run --active -m verifiers.scripts.eval \
33- -m MODEL_NAME \ # Doctor model (evaluated)
34- -b API_BASE_URL \ # Doctor API endpoint
35- -k API_KEY_VAR \ # Doctor API key variable name
21+ -m MODEL_NAME \
22+ -b API_BASE_URL \
23+ -k API_KEY_VAR \
3624 agentclinic \
37- -n NUM_CASES \ # Number of cases to evaluate
38- --rollouts-per-example 3 \ # Rollouts
39- --max-concurrent 2 \ # Parallel requests
40- -T 0.0 \ # Temperature
41- -s \ # Save results
25+ -n NUM_CASES \
26+ --rollouts-per-example 3 \
27+ --max-concurrent 2 \
28+ -T 0.0 \
29+ -s \
4230 --env-args ' {
4331 "dataset_path": "DATASET.jsonl",
4432 "patient_model": "MODEL",
33+ "patient_base_url": "URL",
34+ "patient_api_key": "KEY",
4535 "measurement_model": "MODEL",
46- "moderator_model": "MODEL"
36+ "measurement_base_url": "URL",
37+ "moderator_model": "MODEL",
38+ "moderator_base_url": "URL"
4739 }'
4840```
4941
50- ---
51-
52- ## Examples
53-
54- ### MedQA with GPT-4o-mini (all agents)
42+ ### OpenAI Example (MedQA)
5543
5644``` bash
5745export OPENAI_API_KEY=" your-key"
@@ -61,15 +49,15 @@ uv run --active -m verifiers.scripts.eval \
6149 -b https://api.openai.com/v1 \
6250 -k OPENAI_API_KEY \
6351 agentclinic \
64- -n 50 \
65- --rollouts-per-example 3 \
52+ -n 10 \
53+ --rollouts-per-example 2 \
6654 --max-concurrent 2 \
6755 -T 0.0 \
6856 -s \
6957 --env-args ' {"dataset_path": "agentclinic_medqa_extended.jsonl"}'
7058```
7159
72- ### NEJM with GPT-4o-mini (all agents )
60+ ### OpenAI Example (NEJM )
7361
7462``` bash
7563export OPENAI_API_KEY=" your-key"
@@ -79,117 +67,64 @@ uv run --active -m verifiers.scripts.eval \
7967 -b https://api.openai.com/v1 \
8068 -k OPENAI_API_KEY \
8169 agentclinic \
82- -n 50 \
83- --rollouts-per-example 3 \
70+ -n 10 \
71+ --rollouts-per-example 2 \
8472 --max-concurrent 2 \
8573 -T 0.0 \
8674 -s \
8775 --env-args ' {"dataset_path": "agentclinic_nejm_extended.jsonl"}'
8876```
8977
90- ### Mistral Large (all agents )
78+ ### Mixed Providers (Mistral Doctor + OpenAI Helpers )
9179
9280``` bash
9381export MISTRAL_API_KEY=" your-key"
82+ export OPENAI_API_KEY=" your-key"
9483
9584uv run --active -m verifiers.scripts.eval \
9685 -m mistral-large-latest \
9786 -b https://api.mistral.ai/v1 \
9887 -k MISTRAL_API_KEY \
9988 agentclinic \
100- -n 50 \
101- --rollouts-per-example 3 \
89+ -n 10 \
90+ --rollouts-per-example 2 \
10291 --max-concurrent 2 \
10392 -T 0.0 \
10493 -s \
10594 --env-args ' {
106- "dataset_path": "agentclinic_nejm_extended.jsonl",
107- "patient_model": "mistral-large-latest",
108- "patient_backend": "mistral",
109- "measurement_model": "mistral-large-latest",
110- "measurement_backend": "mistral",
111- "moderator_model": "mistral-large-latest",
112- "moderator_backend": "mistral"
95+ "dataset_path": "agentclinic_medqa_extended.jsonl",
96+ "patient_model": "gpt-4o-mini",
97+ "patient_base_url": "https://api.openai.com/v1",
98+ "patient_api_key": "' $OPENAI_API_KEY ' ",
99+ "measurement_model": "gpt-4o-mini",
100+ "measurement_base_url": "https://api.openai.com/v1",
101+ "measurement_api_key": "' $OPENAI_API_KEY ' ",
102+ "moderator_model": "gpt-4o-mini",
103+ "moderator_base_url": "https://api.openai.com/v1",
104+ "moderator_api_key": "' $OPENAI_API_KEY ' "
113105 }'
114106```
115107
116- ### Claude 3.5 Sonnet (doctor) + GPT-4o-mini (helpers)
108+ ## Configuration
117109
118- ``` bash
119- export ANTHROPIC_API_KEY=" your-key"
120- export OPENAI_API_KEY=" your-key"
110+ ### Agent Parameters
121111
122- uv run --active -m verifiers.scripts.eval \
123- -m claude-3-5-sonnet-20241022 \
124- -b https://api.anthropic.com/v1 \
125- -k ANTHROPIC_API_KEY \
126- agentclinic \
127- -n 50 \
128- --rollouts-per-example 3 \
129- --max-concurrent 2 \
130- -T 0.0 \
131- -s \
132- --env-args ' {"dataset_path": "agentclinic_medqa_extended.jsonl"}'
133- ```
134-
135-
136-
137- ---
138-
139- ## Configuration Options
140-
141- ### Agent Configuration
142-
143- Configure each agent separately via ` --env-args ` :
144-
145- ``` json
146- {
147- "dataset_path" : " agentclinic_medqa_extended.jsonl" ,
148-
149- // Patient Agent
150- "patient_model" : " gpt-4o-mini" ,
151- "patient_backend" : " auto" ,
152- "patient_api_key" : null ,
153- "patient_api_base" : null ,
154-
155- // Measurement Agent
156- "measurement_model" : " gpt-4o-mini" ,
157- "measurement_backend" : " auto" ,
158- "measurement_api_key" : null ,
159- "measurement_api_base" : null ,
160-
161- // Moderator/Judge Agent
162- "moderator_model" : " gpt-4o-mini" ,
163- "moderator_backend" : " auto" ,
164- "moderator_api_key" : null ,
165- "moderator_api_base" : null ,
166-
167- // Other options
168- "max_turns" : 20 ,
169- "use_think" : false
170- }
171- ```
172-
173- ### Supported Backends
174-
175- - ` openai ` - OpenAI models (gpt-4, gpt-4o-mini, etc.)
176- - ` anthropic ` - Anthropic models (claude-3-5-sonnet, etc.)
177- - ` mistral ` - Mistral models (mistral-large-latest, etc.)
178- - ` gemini ` - Google Gemini models
179- - ` vllm ` - Local vLLM server
180- - ` auto ` - Auto-detect from model name (default)
112+ Each agent (patient, measurement, moderator) can be configured via ` --env-args ` :
181113
182114### Datasets
183115
184116- ** MedQA Extended** (214 cases): ` agentclinic_medqa_extended.jsonl `
185117- ** NEJM Extended** (120 cases): ` agentclinic_nejm_extended.jsonl `
186118
187- ---
119+ ### Other Options
120+
121+ - ` max_turns ` : Maximum conversation turns (default: 20)
122+ - ` use_think ` : Enable chain-of-thought prompting (default: false)
188123
189124## Agent Roles
190125
191- - ** Doctor** (evaluated model ): Asks questions, orders tests, makes diagnosis
192- - ** Patient** (helper): Simulates patient responses based on case data
193- - ** Measurement** (helper): Returns test results from case data
194- - ** Moderator** (helper): Judges if diagnosis matches ground truth
126+ - ** Doctor** (evaluated): Asks questions, orders tests, makes diagnosis
127+ - ** Patient** (helper): Simulates patient responses
128+ - ** Measurement** (helper): Returns test results
129+ - ** Moderator** (helper): Judges diagnosis accuracy
195130
0 commit comments