Skip to content

Commit 9082f61

Browse files
AymaneAymane
authored andcommitted
Refactor AgentClinic to use AsyncOpenAI clients
1 parent 645d139 commit 9082f61

File tree

3 files changed

+151
-422
lines changed

3 files changed

+151
-422
lines changed

environments/agentclinic/README.md

Lines changed: 48 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -4,54 +4,42 @@ Multi-agent medical diagnosis environment for evaluating LLMs on clinical diagno
44

55
## Quick Start
66

7-
### 1. Install
8-
97
```bash
8+
# Install
109
uv pip install -e .
11-
```
12-
13-
### 2. Set API Keys
1410

15-
```bash
16-
# Required for helper agents (patient, measurement, moderator)
11+
# Set API key
1712
export OPENAI_API_KEY="your-key"
18-
19-
# Optional: For using other models
20-
export ANTHROPIC_API_KEY="your-key"
21-
export MISTRAL_API_KEY="your-key"
22-
2313
```
2414

25-
---
26-
27-
## Running Evaluations
15+
## Usage
2816

29-
### Basic Command Structure
17+
### Basic Command
3018

3119
```bash
3220
uv run --active -m verifiers.scripts.eval \
33-
-m MODEL_NAME \ # Doctor model (evaluated)
34-
-b API_BASE_URL \ # Doctor API endpoint
35-
-k API_KEY_VAR \ # Doctor API key variable name
21+
-m MODEL_NAME \
22+
-b API_BASE_URL \
23+
-k API_KEY_VAR \
3624
agentclinic \
37-
-n NUM_CASES \ # Number of cases to evaluate
38-
--rollouts-per-example 3 \ # Rollouts
39-
--max-concurrent 2 \ # Parallel requests
40-
-T 0.0 \ # Temperature
41-
-s \ # Save results
25+
-n NUM_CASES \
26+
--rollouts-per-example 3 \
27+
--max-concurrent 2 \
28+
-T 0.0 \
29+
-s \
4230
--env-args '{
4331
"dataset_path": "DATASET.jsonl",
4432
"patient_model": "MODEL",
33+
"patient_base_url": "URL",
34+
"patient_api_key": "KEY",
4535
"measurement_model": "MODEL",
46-
"moderator_model": "MODEL"
36+
"measurement_base_url": "URL",
37+
"moderator_model": "MODEL",
38+
"moderator_base_url": "URL"
4739
}'
4840
```
4941

50-
---
51-
52-
## Examples
53-
54-
### MedQA with GPT-4o-mini (all agents)
42+
### OpenAI Example (MedQA)
5543

5644
```bash
5745
export OPENAI_API_KEY="your-key"
@@ -61,15 +49,15 @@ uv run --active -m verifiers.scripts.eval \
6149
-b https://api.openai.com/v1 \
6250
-k OPENAI_API_KEY \
6351
agentclinic \
64-
-n 50 \
65-
--rollouts-per-example 3 \
52+
-n 10 \
53+
--rollouts-per-example 2 \
6654
--max-concurrent 2 \
6755
-T 0.0 \
6856
-s \
6957
--env-args '{"dataset_path": "agentclinic_medqa_extended.jsonl"}'
7058
```
7159

72-
### NEJM with GPT-4o-mini (all agents)
60+
### OpenAI Example (NEJM)
7361

7462
```bash
7563
export OPENAI_API_KEY="your-key"
@@ -79,117 +67,64 @@ uv run --active -m verifiers.scripts.eval \
7967
-b https://api.openai.com/v1 \
8068
-k OPENAI_API_KEY \
8169
agentclinic \
82-
-n 50 \
83-
--rollouts-per-example 3 \
70+
-n 10 \
71+
--rollouts-per-example 2 \
8472
--max-concurrent 2 \
8573
-T 0.0 \
8674
-s \
8775
--env-args '{"dataset_path": "agentclinic_nejm_extended.jsonl"}'
8876
```
8977

90-
### Mistral Large (all agents)
78+
### Mixed Providers (Mistral Doctor + OpenAI Helpers)
9179

9280
```bash
9381
export MISTRAL_API_KEY="your-key"
82+
export OPENAI_API_KEY="your-key"
9483

9584
uv run --active -m verifiers.scripts.eval \
9685
-m mistral-large-latest \
9786
-b https://api.mistral.ai/v1 \
9887
-k MISTRAL_API_KEY \
9988
agentclinic \
100-
-n 50 \
101-
--rollouts-per-example 3 \
89+
-n 10 \
90+
--rollouts-per-example 2 \
10291
--max-concurrent 2 \
10392
-T 0.0 \
10493
-s \
10594
--env-args '{
106-
"dataset_path": "agentclinic_nejm_extended.jsonl",
107-
"patient_model": "mistral-large-latest",
108-
"patient_backend": "mistral",
109-
"measurement_model": "mistral-large-latest",
110-
"measurement_backend": "mistral",
111-
"moderator_model": "mistral-large-latest",
112-
"moderator_backend": "mistral"
95+
"dataset_path": "agentclinic_medqa_extended.jsonl",
96+
"patient_model": "gpt-4o-mini",
97+
"patient_base_url": "https://api.openai.com/v1",
98+
"patient_api_key": "'$OPENAI_API_KEY'",
99+
"measurement_model": "gpt-4o-mini",
100+
"measurement_base_url": "https://api.openai.com/v1",
101+
"measurement_api_key": "'$OPENAI_API_KEY'",
102+
"moderator_model": "gpt-4o-mini",
103+
"moderator_base_url": "https://api.openai.com/v1",
104+
"moderator_api_key": "'$OPENAI_API_KEY'"
113105
}'
114106
```
115107

116-
### Claude 3.5 Sonnet (doctor) + GPT-4o-mini (helpers)
108+
## Configuration
117109

118-
```bash
119-
export ANTHROPIC_API_KEY="your-key"
120-
export OPENAI_API_KEY="your-key"
110+
### Agent Parameters
121111

122-
uv run --active -m verifiers.scripts.eval \
123-
-m claude-3-5-sonnet-20241022 \
124-
-b https://api.anthropic.com/v1 \
125-
-k ANTHROPIC_API_KEY \
126-
agentclinic \
127-
-n 50 \
128-
--rollouts-per-example 3 \
129-
--max-concurrent 2 \
130-
-T 0.0 \
131-
-s \
132-
--env-args '{"dataset_path": "agentclinic_medqa_extended.jsonl"}'
133-
```
134-
135-
136-
137-
---
138-
139-
## Configuration Options
140-
141-
### Agent Configuration
142-
143-
Configure each agent separately via `--env-args`:
144-
145-
```json
146-
{
147-
"dataset_path": "agentclinic_medqa_extended.jsonl",
148-
149-
// Patient Agent
150-
"patient_model": "gpt-4o-mini",
151-
"patient_backend": "auto",
152-
"patient_api_key": null,
153-
"patient_api_base": null,
154-
155-
// Measurement Agent
156-
"measurement_model": "gpt-4o-mini",
157-
"measurement_backend": "auto",
158-
"measurement_api_key": null,
159-
"measurement_api_base": null,
160-
161-
// Moderator/Judge Agent
162-
"moderator_model": "gpt-4o-mini",
163-
"moderator_backend": "auto",
164-
"moderator_api_key": null,
165-
"moderator_api_base": null,
166-
167-
// Other options
168-
"max_turns": 20,
169-
"use_think": false
170-
}
171-
```
172-
173-
### Supported Backends
174-
175-
- `openai` - OpenAI models (gpt-4, gpt-4o-mini, etc.)
176-
- `anthropic` - Anthropic models (claude-3-5-sonnet, etc.)
177-
- `mistral` - Mistral models (mistral-large-latest, etc.)
178-
- `gemini` - Google Gemini models
179-
- `vllm` - Local vLLM server
180-
- `auto` - Auto-detect from model name (default)
112+
Each agent (patient, measurement, moderator) can be configured via `--env-args`:
181113

182114
### Datasets
183115

184116
- **MedQA Extended** (214 cases): `agentclinic_medqa_extended.jsonl`
185117
- **NEJM Extended** (120 cases): `agentclinic_nejm_extended.jsonl`
186118

187-
---
119+
### Other Options
120+
121+
- `max_turns`: Maximum conversation turns (default: 20)
122+
- `use_think`: Enable chain-of-thought prompting (default: false)
188123

189124
## Agent Roles
190125

191-
- **Doctor** (evaluated model): Asks questions, orders tests, makes diagnosis
192-
- **Patient** (helper): Simulates patient responses based on case data
193-
- **Measurement** (helper): Returns test results from case data
194-
- **Moderator** (helper): Judges if diagnosis matches ground truth
126+
- **Doctor** (evaluated): Asks questions, orders tests, makes diagnosis
127+
- **Patient** (helper): Simulates patient responses
128+
- **Measurement** (helper): Returns test results
129+
- **Moderator** (helper): Judges diagnosis accuracy
195130

0 commit comments

Comments
 (0)