Skip to content

Commit 3663942

Browse files
authored
Merge pull request #19 from allenai/cross-bench-eval-v2
Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes
2 parents 2f1bcd7 + e917ee2 commit 3663942

File tree

238 files changed

+2319
-1170
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

238 files changed

+2319
-1170
lines changed

.claude/skills/run-evaluation/SKILL.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,19 @@ vla-eval merge -c configs/libero_spatial.yaml -o results/libero_spatial.json
249249
vla-eval test --all
250250
```
251251

252+
### Parallel evaluations of different models
253+
254+
Shard result files are named by benchmark + shard ID (e.g.
255+
`LIBEROBenchmark_libero_spatial_shard0of10.json`). If two evals use the
256+
same benchmark config, shard count, and output directory, they will
257+
collide. The orchestrator prevents this with a file lock — the second
258+
eval will **fail immediately** with `FileExistsError` rather than
259+
silently overwriting results.
260+
261+
If you hit this error, either:
262+
- Use **different output directories** (modify `output_dir` in the config), or
263+
- Use **different shard counts** (e.g. `--num-shards 10` vs `--num-shards 8`).
264+
252265
### Troubleshooting
253266

254267
| Problem | Solution |

.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,5 @@
77
!docker/calvin_validation_data/
88
!docker/init_states/
99
!docker/*_entrypoint.sh
10+
!docker/*.patch
1011

configs/model_servers/cogact/cogact.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
# Weight: CogACT/CogACT-Base (HuggingFace)
33
# Output: 16 actions × 7-DoF (future_action_window_size=15 → 16 steps)
44
#
5-
# Usage: vla-eval serve --config configs/model_servers/cogact/cogact.yaml
65
#
76
# Available checkpoints:
87
# CogACT/CogACT-Small (action_model_type: DiT-S)

configs/model_servers/groot/groot.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
# Weight: nvidia/GR00T-N1.6-3B (HuggingFace)
33
# Action chunking enabled (chunk_size=16).
44
#
5-
# Usage: vla-eval serve --config configs/model_servers/groot/groot.yaml
65
#
76
# Available embodiment_tags for foundation model: GR1, ROBOCASA_PANDA_OMRON, BEHAVIOR_R1_PRO
87
# Fine-tuned checkpoints may support: LIBERO_PANDA, OXE_GOOGLE, OXE_WIDOWX, UNITREE_G1
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# GR00T N1.6 — SimplerEnv Google Robot (official NVIDIA checkpoint)
2+
script: "src/vla_eval/model_servers/groot.py"
3+
args:
4+
model_path: nvidia/GR00T-N1.6-fractal
5+
embodiment_tag: OXE_GOOGLE
6+
chunk_size: 16
7+
port: 8000
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# GR00T N1.6 — SimplerEnv WidowX (official NVIDIA checkpoint)
2+
script: "src/vla_eval/model_servers/groot.py"
3+
args:
4+
model_path: nvidia/GR00T-N1.6-bridge
5+
embodiment_tag: OXE_WIDOWX
6+
image_resolution: 256
7+
chunk_size: 16
8+
bridge_rotation: true
9+
port: 8000

configs/model_servers/oft/libero_spatial.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
# Weight: moojink/openvla-7b-oft-finetuned-libero-spatial (HuggingFace)
33
# Action chunking enabled (parallel decoding, 26× faster than OpenVLA).
44
#
5-
# Usage: vla-eval serve --config configs/model_servers/oft/libero_spatial.yaml
65
extends: _base.yaml
76
args:
87
pretrained_checkpoint: moojink/openvla-7b-oft-finetuned-libero-spatial

configs/model_servers/openvla/openvla.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
# Weight: openvla/openvla-7b (HuggingFace)
33
# No action chunking (chunk_size=1), returns 7-dim actions.
44
#
5-
# Usage: vla-eval serve --config configs/model_servers/openvla/openvla.yaml
65
#
76
# For task-specific unnormalization, set unnorm_key to match the
87
# training dataset (e.g. "bridge_orig" for BridgeData V2).

configs/model_servers/pi0/libero.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# π₀.5 model server — LIBERO (direct OpenPI inference)
22
# Loads the policy checkpoint directly; no external server needed.
33
#
4-
# Usage: vla-eval serve --config configs/model_servers/pi0/libero.yaml
54
#
65
# Available config_name values (see openpi repo):
76
# pi0_fast_libero, pi05_libero, pi0_fast_droid, pi05_droid, ...

configs/model_servers/pi0/libero_fast.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
# π₀-FAST model server — LIBERO (direct OpenPI inference)
22
#
3-
# Usage: vla-eval serve --config configs/model_servers/pi0/libero_fast.yaml
43
#
54
# Pi0-FAST uses FAST tokenizer — a different, lower-performing variant than Pi0.5.
65
# For Pi0.5 (96.85% on LIBERO), use libero.yaml instead.

0 commit comments

Comments
 (0)