Skip to content

Commit 99448fc

Browse files
Minor fixes for local (non-Docker) evals (#5604)
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
1 parent b03d03d commit 99448fc

File tree

5 files changed

+15
-18
lines changed

5 files changed

+15
-18
lines changed

packages/evals/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,26 +26,26 @@ echo "OPENROUTER_API_KEY=sk-or-v1-[...]" > packages/evals/.env.local
2626
Start the evals service:
2727

2828
```sh
29-
docker compose -f packages/evals/docker-compose.yml --profile server --profile runner up --build --scale runner=0
29+
pnpm evals
3030
```
3131

32-
The initial build process can take a minute or two. Upon success you should see output indicating that a web service is running on [localhost:3000](http://localhost:3000/):
33-
<img width="1182" alt="Screenshot 2025-06-05 at 12 05 38 PM" src="https://github.com/user-attachments/assets/34f25a59-1362-458c-aafa-25e13cdb2a7a" />
32+
The initial build process can take a minute or two. Upon success you should see output indicating that a web service is running on localhost:3000:
33+
<img width="1182" src="https://github.com/user-attachments/assets/34f25a59-1362-458c-aafa-25e13cdb2a7a" />
3434

3535
Additionally, you'll find in Docker Desktop that database and redis services are running:
36-
<img width="1283" alt="Screenshot 2025-06-05 at 12 07 09 PM" src="https://github.com/user-attachments/assets/ad75d791-9cc7-41e3-8168-df7b21b49da2" />
36+
<img width="1283" src="https://github.com/user-attachments/assets/ad75d791-9cc7-41e3-8168-df7b21b49da2" />
3737

3838
Navigate to [localhost:3446](http://localhost:3446/) in your browser and click the 🚀 button.
3939

4040
By default a evals run will run all programming exercises in [Roo Code Evals](https://github.com/RooCodeInc/Roo-Code-Evals) repository with the Claude Sonnet 4 model and default settings. For basic configuration you can specify the LLM to use and any subset of the exercises you'd like. For advanced configuration you can import a Roo Code settings file which will allow you to run the evals with Roo Code configured any way you'd like (this includes custom modes, a footgun prompt, etc).
4141

42-
<img width="1053" alt="Screenshot 2025-06-05 at 12 08 06 PM" src="https://github.com/user-attachments/assets/2367eef4-6ae9-4ac2-8ee4-80f981046486" />
42+
<img width="1053" src="https://github.com/user-attachments/assets/2367eef4-6ae9-4ac2-8ee4-80f981046486" />
4343

4444
After clicking "Launch" you should find that a "controller" container has spawned as well as `N` "task" containers where `N` is the value you chose for concurrency:
45-
<img width="1283" alt="Screenshot 2025-06-05 at 12 13 29 PM" src="https://github.com/user-attachments/assets/024413e2-c886-4272-ab59-909b4b114e7c" />
45+
<img width="1283" src="https://github.com/user-attachments/assets/024413e2-c886-4272-ab59-909b4b114e7c" />
4646

4747
The web app's UI should update in realtime with the results of the eval run:
48-
<img width="1053" alt="Screenshot 2025-06-05 at 12 14 52 PM" src="https://github.com/user-attachments/assets/6fe3b651-0898-4f14-a231-3cc8d66f0e1f" />
48+
<img width="1053" src="https://github.com/user-attachments/assets/6fe3b651-0898-4f14-a231-3cc8d66f0e1f" />
4949

5050
## Resource Usage
5151

@@ -60,7 +60,7 @@ CPU Limit = 2 * concurrency
6060

6161
The memory and CPU limits can be set from the "Resources" section of the Docker Desktop settings:
6262

63-
<img width="996" alt="Screenshot 2025-06-06 at 8 54 24 AM" src="https://github.com/user-attachments/assets/a1cbb27d-b09c-450c-9fa8-b662c0537d48" />
63+
<img width="996" src="https://github.com/user-attachments/assets/a1cbb27d-b09c-450c-9fa8-b662c0537d48" />
6464

6565
## Stopping
6666

packages/evals/docker-compose.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,6 @@ services:
1717
db:
1818
container_name: evals-db
1919
image: postgres:15.4
20-
# expose:
21-
# - 5432
2220
ports:
2321
- "${EVALS_DB_PORT:-5432}:5432"
2422
volumes:
@@ -40,8 +38,6 @@ services:
4038
redis:
4139
container_name: evals-redis
4240
image: redis:7-alpine
43-
# expose:
44-
# - 6379
4541
ports:
4642
- "${EVALS_REDIS_PORT:-6379}:6379"
4743
volumes:

packages/evals/src/cli/runEvals.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,16 +20,16 @@ export const runEvals = async (runId: number) => {
2020
throw new Error(`Run ${run.id} has no tasks.`)
2121
}
2222

23+
const containerized = isDockerContainer()
24+
2325
const logger = new Logger({
24-
logDir: `/var/log/evals/runs/${run.id}`,
26+
logDir: containerized ? `/var/log/evals/runs/${run.id}` : `/tmp/evals/runs/${run.id}`,
2527
filename: `controller.log`,
2628
tag: getTag("runEvals", { run }),
2729
})
2830

2931
logger.info(`running ${tasks.length} task(s)`)
3032

31-
const containerized = isDockerContainer()
32-
3333
if (!containerized) {
3434
await resetEvalsRepo({ run, cwd: EVALS_REPO_PATH })
3535
}

packages/evals/src/cli/runTask.ts

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,12 @@ export const processTask = async ({ taskId, logger }: { taskId: number; logger?:
4444
const run = await findRun(task.runId)
4545
await registerRunner({ runId: run.id, taskId })
4646

47+
const containerized = isDockerContainer()
48+
4749
logger =
4850
logger ||
4951
new Logger({
50-
logDir: `/var/log/evals/runs/${run.id}`,
52+
logDir: containerized ? `/var/log/evals/runs/${run.id}` : `/tmp/evals/runs/${run.id}`,
5153
filename: `${language}-${exercise}.log`,
5254
tag: getTag("runTask", { run, task }),
5355
})
@@ -298,7 +300,6 @@ export const runTask = async ({ run, task, publish, logger }: RunTaskOptions) =>
298300
...run.settings, // Allow the provided settings to override `openRouterApiKey`.
299301
},
300302
text: prompt,
301-
newTab: true,
302303
},
303304
})
304305

packages/types/src/global-settings.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ export const EVALS_SETTINGS: RooCodeSettings = {
177177
apiProvider: "openrouter",
178178
openRouterUseMiddleOutTransform: false,
179179

180-
lastShownAnnouncementId: "may-29-2025-3-19",
180+
lastShownAnnouncementId: "jul-09-2025-3-23-0",
181181

182182
pinnedApiConfigs: {},
183183

0 commit comments

Comments
 (0)