Skip to content

Commit cd58afa

Browse files
authored
🤖 ci: replace colons in TB artifact names with hyphens (#488)
## Problem The nightly Terminal-Bench workflow runs successfully but fails at the artifact upload step because artifact names contain colons (from model names like `anthropic:claude-sonnet-4-5`). GitHub Actions artifact names cannot contain colons due to filesystem restrictions (NTFS compatibility). ## Solution ### 1. Fixed Artifact Names Use the `replace()` function in the artifact name template to convert colons to hyphens: - `anthropic:claude-sonnet-4-5` → `anthropic-claude-sonnet-4-5` - `openai:gpt-5-codex` → `openai-gpt-5-codex` ### 2. Added Results Logging **Important**: Since the previous run had 720 files that failed to upload, added a new step to print `results.json` to workflow logs before artifact upload: ```yaml - name: Print results summary if: always() run: | # Outputs full results.json # Plus per-task summary: task_id: ✓ PASS / ✗ FAIL ``` This ensures task-level results are preserved in logs even if artifact upload fails. ### 3. Added Model Verification Logging Added logging in `agentSessionCli.ts` to confirm model names: ```typescript console.error(`[cmux-cli] Using model: ${model}`); ``` ## Investigation: Identical 42.50% Accuracy During the manual run (#18913267878), **both models achieved exactly 42.50% accuracy** (34/80 tasks). Investigation revealed: ### Facts: - Both models had 24 timeouts (out of 360s limit) - 50% overlap: Only 12 tasks timed out for both models - Each model attempted 56 non-timeout tasks and passed 34 (60.7% pass rate) - Results stored in separate timestamped directories (`runs/2025-10-29__15-29-47` vs `15-29-29`) - **720 files were ready to upload but artifact upload failed** ### Code Path Verification: Traced model parameter through the entire chain: 1. ✅ Workflow → `TB_ARGS: --agent-kwarg model_name=<model>` 2. ✅ Makefile → Passes `$TB_ARGS` to terminal-bench 3. ✅ cmux_agent.py → Constructor accepts `model_name`, sets `CMUX_MODEL` env var 4. ✅ cmux-run.sh → Passes `--model "${CMUX_MODEL}"` to CLI 5. ✅ agentSessionCli.ts → Parses `--model` flag and uses it **The code is correct.** The identical scores are statistically unlikely but possible with offsetting timeout patterns. ### Next Steps: With the new results logging, the next benchmark run will show: - ✅ Model name used (in stderr logs) - ✅ Full results.json (in workflow logs) - ✅ Per-task pass/fail breakdown (in workflow logs) - ✅ Artifacts uploaded successfully (with fixed names) This allows full verification that models produce different task-level results. ## Testing The next nightly run (tonight at 00:00 UTC) will: - Successfully upload artifacts with names like: - `terminal-bench-results-anthropic-claude-sonnet-4-5-<run_id>` - `terminal-bench-results-openai-gpt-5-codex-<run_id>` - Show task-level results in workflow logs (survives even if upload fails) - Confirm each model in logs: `[cmux-cli] Using model: <model_name>` --- _Generated with `cmux`_
1 parent cc13e60 commit cd58afa

File tree

2 files changed

+26
-1
lines changed

2 files changed

+26
-1
lines changed

.github/workflows/terminal-bench.yml

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,11 +108,30 @@ jobs:
108108
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
109109
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
110110

111+
- name: Print results summary
112+
if: always()
113+
run: |
114+
echo "=== Terminal-Bench Results Summary ==="
115+
if [ -f "$(find runs -name 'results.json' | head -1)" ]; then
116+
RESULTS_FILE=$(find runs -name 'results.json' | head -1)
117+
echo "Results file: $RESULTS_FILE"
118+
echo ""
119+
echo "Full results.json:"
120+
cat "$RESULTS_FILE" | jq '.' || cat "$RESULTS_FILE"
121+
echo ""
122+
echo "Per-task summary:"
123+
cat "$RESULTS_FILE" | jq -r '.trials[] | "\(.task_id): \(if .resolved then "✓ PASS" else "✗ FAIL" end)"' 2>/dev/null || echo "Failed to parse task details"
124+
else
125+
echo "No results.json found in runs/"
126+
ls -la runs/
127+
fi
128+
111129
- name: Upload benchmark results
112130
if: always()
113131
uses: actions/upload-artifact@v4
114132
with:
115-
name: terminal-bench-results-${{ inputs.model_name && format('{0}-{1}', inputs.model_name, github.run_id) || format('{0}', github.run_id) }}
133+
# Replace colons with hyphens to avoid GitHub artifact name restrictions
134+
name: terminal-bench-results-${{ inputs.model_name && replace(format('{0}-{1}', inputs.model_name, github.run_id), ':', '-') || format('{0}', github.run_id) }}
116135
path: |
117136
runs/
118137
if-no-files-found: warn

src/debug/agentSessionCli.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,12 @@ async function main(): Promise<void> {
187187
const emitJsonStreaming = values["json-streaming"] === true;
188188

189189
const suppressHumanOutput = emitJsonStreaming || emitFinalJson;
190+
191+
// Log model selection for terminal-bench verification
192+
if (!suppressHumanOutput) {
193+
console.error(`[cmux-cli] Using model: ${model}`);
194+
}
195+
190196
const humanStream = process.stdout;
191197
const writeHuman = (text: string) => {
192198
if (suppressHumanOutput) {

0 commit comments

Comments
 (0)