Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes by MilkClouds · Pull Request #19 · allenai/vla-evaluation-harness

MilkClouds · 2026-03-30T14:48:26Z

Summary

Cross-benchmark evaluation infrastructure and reproduction fixes for X-VLA, DB-CogACT, and GR00T across LIBERO and CALVIN. SimplerEnv reproductions are WIP (pending migration to simpler_env.make()).

Key changes

DimSpec interface convention system (specs.py) — every model server and benchmark declares action/observation format. Orchestrator validates at startup, catching convention mismatches before GPU hours are wasted.
X-VLA rot6d convention — LIBERO uses contiguous rot6d, CALVIN/SimplerEnv use interleaved. Profile-level rot6d_convention field selects the right functions at init time.
Docker patch isolation — X-VLA ManiSkill2 patches moved to Dockerfile.simpler_xvla, keeping base Dockerfile.simpler clean.
Filelock for shard results — prevent silent overwrites when multiple evals use the same output directory.
SimplerEnv infrastructure (WIP) — base-relative EE pose, accumulate_success, prepackaged_config, pass_rotation_raw, image_size resize. All implemented but SimplerEnv needs migration to simpler_env.make() for correct visual domain matching before results can be verified.

Reproduction Results

#	Pair	Reported	Result	Status
1	X-VLA × LIBERO (4 suites)	98.1%	97.4%	✅ Reproduced
2	X-VLA × CALVIN	4.43	4.30	✅ Reproduced
3	X-VLA × SimplerEnv	95.8%	—	WIP
4	DB-CogACT × LIBERO (4 suites)	94.9%	94.7%	✅ Reproduced
5	DB-CogACT × CALVIN	4.06	4.02	✅ Reproduced
6	DB-CogACT × SimplerEnv	69.5%	—	WIP
7	GR00T × LIBERO (4 suites)	97.0%	94.9%	✅ Approximate
8	GR00T × SimplerEnv	57.1%	—	WIP
9	GR00T × CALVIN	—	—	No checkpoint

SimplerEnv WIP: benchmark currently uses build_maniskill2_env() with explicit parameters instead of simpler_env.make() which applies prepackaged_config for correct visual domain. Migration planned as follow-up.

Result data

All verified JSON results archived in docs/reproductions/data/.

Test plan

make test — 212 passed, 1 skipped
make check — ruff + ty clean
X-VLA LIBERO: 97.4% (all 4 suites, 2000 episodes)
X-VLA CALVIN: 4.30 avg chain (1000 sequences)
DB-CogACT LIBERO: 94.7% (all 4 suites, 2000 episodes)
DB-CogACT CALVIN: 4.02 avg chain (1000 sequences)
GR00T LIBERO: 94.9% (no regression from code changes)

🤖 Generated with Claude Code

…results Verified changes only (cherry-picked from run-cross-benchmark-evals): SimplerEnv benchmark: - Add image_resolution param (resize in make_obs before sending to model) - Remove euler2axangle conversion (pass rotation directly to env.step, matching official eval pipelines). DB-CogACT regression test passes. GR00T model server: - Request image_resolution=256 via get_observation_params (model trained on 256x256, was receiving 480x640) New configs: - xvla/simpler_widowx.yaml, xvla/simpler_google_robot.yaml, xvla/robotwin.yaml - groot/simpler_widowx.yaml, groot/simpler_google_robot.yaml Result data (preliminary, not yet fully reproduced): - X-VLA CALVIN: avg_len 3.97 (reported 4.43) — EP_LEN=360, needs retest - GR00T SimplerEnv WidowX: 25.0% (reported 57.1%) — Eggplant 96% reproduced Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Comprehensive line-by-line comparison of all 10 model×benchmark pairs against official eval code. Covers image, state, rotation, gripper, chunking, action mode, and episode config for each pair. Pairs 1-7 verified correct. Pairs 8-10 have identified discrepancies (rot6d convention, missing state, EP_LEN, euler_offset, etc.). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduce a structured spec system to prevent convention mismatch bugs between model servers and benchmarks (rotation format, gripper polarity, action mode, state format, etc.). - specs.py: DimSpec frozen dataclass with validate() and is_compatible() methods, plus predefined constants for common conventions - base classes: get_action_spec() / get_observation_spec() on ModelServer and Benchmark (raises NotImplementedError — subclasses must override) - orchestrator: benchmark spec validation at evaluation start - rotation.py: add axisangle_to_rot6d_interleaved utility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Every concrete ModelServer and Benchmark subclass now declares its action and observation format via get_action_spec() / get_observation_spec(). This makes convention choices (gripper polarity, rotation format, action mode) explicit and machine-comparable at evaluation start. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

X-VLA CALVIN (Pair 8): - rot6d: switch to interleaved convention (matches official eval) - euler_state: CALVIN state[3:6] correctly interpreted as euler, not axis-angle - gripper_threshold: 0.5 → 0.8, with profile-required fields (no default) - absolute_action + ep_len=720: added to CALVIN obs_params - simpler_widowx profile: new profile with output_action_dim=7, gripper_threshold=0.7, gripper_close_above=False (Bridge domain) - euler_offset: new parameter for coordinate frame correction SimplerEnv benchmark: - send_state: new parameter to extract EEF state from ManiSkill2 obs GR00T SimplerEnv (Pair 10): - bridge_rotation: new parameter applying quat→bridge-euler transform - bridge_rotation=true in simpler_widowx config Configs: - simpler_xvla_widowx.yaml: max_episode_steps=1200 (vs 120 default) - simpler_groot_widowx.yaml: max_episode_steps=10000 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Pair 8: add Issue 6 (absolute_action missing from CALVIN obs_params) - Pair 8: correct Issue 3 (threshold only, not comparison direction) - Pair 9: correct Issue 2 (argparse crash, not silently ignored) - Pair 9: add gripper direction note (Bridge domain uses < threshold) - Mark all fixed issues with FIXED annotations - Update summary table (Pair 8 blockers 5 → 6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…AMLs Move max_episode_steps out of the shared simpler_all_tasks.yaml and into model server obs_params (X-VLA) / config args (GR00T). The orchestrator auto-merges unset params, so: - CogACT: no obs_params → SimplerEnvBenchmark default (120) applies - X-VLA simpler_widowx: obs_params sends 1200 - GR00T simpler_widowx: config arg sends 10000 Delete the redundant simpler_xvla_widowx.yaml and simpler_groot_widowx.yaml standalone configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Collapse multi-line `from vla_eval.specs import (...)` to single-line where they fit within the 119-char line limit (9 files). Multi-line imports in files exceeding 119 chars are left as-is. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- DimSpec: add to_dict() / from_dict() for wire serialization - serve.py: include action_spec and observation_spec in HELLO response - orchestrator.py: deserialize server specs from HELLO, cross-validate against benchmark specs via check_specs(), log mismatch warnings - base classes: tighten return type hint to dict[str, DimSpec] Replaces the TODO with full implementation. Convention mismatches (gripper polarity, rotation format, action mode, missing state) are now detected before the first episode runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- groot.py: hoist cv2/transforms3d imports out of per-obs loops, move bridge default_rot to class constant - xvla.py: move rotation imports to module level (already loaded) - simpler/benchmark.py: delete dead _euler2axangle function, fix get_observation_spec to reflect send_state - cogact/pi0/rtc: use RAW instead of GRIPPER_RAW for full action specs - specs.py: warn on empty key intersection in check_specs - orchestrator.py: narrow spec validation except, log at warning level - configs: fix simpler_google profile reference to "simpler" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace absolute paths (/mnt/harbor/..., /tmp/official-eval-code/...) with repo-relative paths and external source references. Local node names and paths should not appear in an open-source repository. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

`obs.get("controller_states") or obs.get("states")` crashes with "The truth value of an array with more than one element is ambiguous" when the observation contains numpy arrays. Use explicit None checks. Verified: X-VLA CALVIN 2-sequence test passes (1 SUCCESS at 375 steps). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e EE SimplerEnv benchmark now sends raw base_pose and tcp_pose from ManiSkill2 obs when send_state=True. X-VLA model server computes base-relative EE position (Pose(base).inv() * Pose(tcp)), matching the official eval which feeds [ee_pos_wrt_base, identity_rot6d, zeros] as proprio input. Also fixes numpy-unsafe `or` chain in _obs_state_array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: X-VLA WidowX outputs absolute base-frame EE targets, not deltas. The official eval uses a modified SimplerEnv (255isWhite fork) with `arm_pd_ee_target_base_pose` control mode (use_delta=False, frame=base). Our env used delta control, causing the robot to fly to workspace boundaries immediately. Changes: - docker: patch ManiSkill2 to add absolute EE control mode (from 255isWhite/SimplerEnv fork commit 54ae2e0e) - simpler/benchmark.py: add control_mode parameter (obs_params override) - xvla.py: simpler_widowx profile sends control_mode via obs_params Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Pair 11: DB-CogACT x RoboTwin 2.0 (config ready, per-task checkpoint) - Pair 12: X-VLA x RoboTwin 2.0 (action dim + checkpoint blockers) - Pair 13: StarVLA x RoboTwin 2.0 (Protocol B, config needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Line-level pipeline verification for 3 RoboTwin pairs: - Pair 11 (DB-CogACT): pipeline aligned, config-only fixes needed (test_num, expert_check) - Pair 12 (X-VLA): 3 BLOCKERS — action_type ee vs qpos, state source endpose vs joint_action, missing EE→quat action conversion. Benchmark built for CogACT's qpos space, X-VLA needs EE space. - Pair 13 (StarVLA): Protocol B vs A mismatch, config not created Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rewrote all 3 RoboTwin audit entries from scratch with proper evidence: Pair 11 (DB-CogACT): 18-item table, all verified. No code-level issues. Config-only: test_num=1→100, skip_expert_check. Ready to evaluate. Pair 12 (X-VLA): 18-item table, 3 BLOCKERS found: - action_type='ee' vs 'qpos' (IK vs direct joint) - State key mismatch: benchmark sends "joint_state", X-VLA reads "states"/"state" → proprio is zeros (not even wrong format) - Missing 20D→16D EE action conversion (rot6d→quat + gripper) - Latent: gripper_threshold 0.5 vs official 0.7 Pair 13 (StarVLA): 3 BLOCKERS — model server single-arm only, config not created, checkpoint not identified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Checkpoint: StarVLA/Qwen3-VL-OFT-Robotwin2 (HuggingFace) - Official eval: model2robotwin_interface.py — 14D qpos with action reordering [0,1,2,3,4,5,12,6,7,8,9,10,11,13], absolute mode, joint state from obs["joint_action"]["vector"] - Remove false "checkpoint not identified" blocker - Update to 2 real blockers: model server single-arm only + config needed - Remove incorrect "Protocol mismatch" — eval is per-task regardless Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pair 12: add missed seed convention discrepancy (100000 vs 2000), note preserve_env_grippers silent failure after endpose fix Pair 13: fix unnorm_key "new_embodiment" → "robotwin", add BLOCKER 3 (state key mismatch: "joint_state" vs "states"), add chunk_size (16), note official eval pops state before server send, note benchmark make_obs key inconsistency with its own observation spec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion transforms3d is not available in GR00T's uv environment, causing ModuleNotFoundError at runtime. Use quat_to_matrix + matrix_to_euler_xyz from vla_eval.rotation instead, with wxyz→xyzw quaternion conversion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Shard result filenames are determined by benchmark name + shard ID only — running two models against the same benchmark config with the same num-shards writes to identical filenames, silently overwriting each other. Add warning to run-evaluation skill with workarounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Shard result files are named by benchmark + shard ID only. Two evals using the same config + shard count silently overwrite each other. - Add filelock dependency - Acquire file lock at eval start (fail-fast with timeout=0) - Check for existing result file before starting (FileExistsError) - Release lock after successful save - Lock is auto-released on process crash (OS-level) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace "silently overwrite" warning with description of the new fail-fast FileExistsError behavior from filelock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Filelock already prevents collision. Sequential execution is obvious. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- xvla.py: simpler profile gets output_action_dim=7 (was None → 20D crash on SimplerEnv assert). get_action_spec reports ROTATION_EULER when euler_offset is active (was reporting AA → false mismatch). - simpler/benchmark.py: gripper spec CLOSE_NEG → CLOSE_POS (>0.5→+1 means +1=close, which is close-positive convention). - specs.py: check_specs warns when benchmark expects action keys the server doesn't declare (catches partial-spec gaps). - orchestrator.py: simplify filelock code — remove redundant variables, derive lock path from output path directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The euler2axangle removal (commit 06e85c7) broke DB-CogACT SimplerEnv (72.2% → 36.5%). ManiSkill2's delta control mode expects axis-angle rotation, not euler. Absolute EE control (X-VLA simpler_widowx) passes euler directly. Branch on control_mode_override: default delta → euler→axangle via rotation.py; absolute override → pass euler as-is. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DB-CogACT: LIBERO 94.8%, CALVIN 4.02, SimplerEnv 36.5% X-VLA: CALVIN 4.30, SimplerEnv 69.8% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

LIBERO uses contiguous rot6d layout, while CALVIN/SimplerEnv use interleaved. The previous change to interleaved-only broke LIBERO (97.2% → 21.8%). Add rot6d_convention field to _XVLABenchmarkProfile. LIBERO profile sets "contiguous", all others default to "interleaved". The rot6d encode/decode functions are selected at init time based on the profile. Verified: X-VLA LIBERO spatial 97.8% (target 97.2%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove xvla-libero/ (pre rot6d-fix, replaced by verified 97.8% result) - Remove dbcogact-simpler/ (invalid — ran with broken Docker patch) - Remove sv-q25groot-libero/ (failed experiment, 29.6%, no server info) - Rename xvla-libero-v2/ → xvla-libero/ (verified result) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The image_size parameter and cv2.resize were removed from SimplerEnv benchmark, causing 640x480 images to be sent directly to models that expect 224x224. This caused DB-CogACT SimplerEnv to drop from 70.8% to 48.9%. Restore image_size as an optional parameter on SimplerEnvBenchmark, auto-negotiated via get_observation_params() from the model server. Add image_resolution parameter to dexbotic/cogact.py (default 224). Verified: DB-CogACT SimplerEnv seed 0 = 70.8% (matches previous reproduction exactly). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The original xvla_absolute_ee.patch modified sink camera position, robot init qpos/height/y, and left debug prints. Split into: - xvla_absolute_ee.patch: control mode + euler interpretation only (safe for all models, applied at build time) - xvla_sink_camera.patch: sink camera/init alignment with Bridge dataset (X-VLA only) - Dockerfile.simpler_xvla: extends simpler:latest with sink patch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move all ManiSkill2 patches out of Dockerfile.simpler into Dockerfile.simpler_xvla. The base simpler image is now clean — no model-specific modifications. Dockerfile.simpler_xvla applies both xvla_absolute_ee.patch (absolute EE control) and xvla_sink_camera.patch (sink camera alignment) on top of the base image. Verified: DB-CogACT SimplerEnv 70.8% (seed 0) with patch-free image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- X-VLA LIBERO spatial: 97.8% (re-verified with rot6d convention fix) - X-VLA CALVIN: 4.30 avg chain (new, reported 4.43) - X-VLA SimplerEnv: 69.8% seed 0 (partial, reported 95.8%) - GR00T SimplerEnv: marked as not reproduced (needs SimPolicyWrapper) - Add data/ paths for all results - DB-CogACT SimplerEnv: clarify 3-seed average Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GR00T SimplerEnv was 0% because state was not reaching the model server (eef_pos missing from ManiSkill2 obs). NVIDIA's ManiSkill2 fork (youliangtan/ManiSkill2_real2sim) adds eef_pos as base-relative EE pose: inv(base_mat) @ tcp_mat → [pos3, quat4_wxyz, gripper_openness]. Compute this in SimplerEnv benchmark when eef_pos is not available, using base_pose + tcp_pose + get_gripper_closedness(). Also add pass_rotation_raw to GR00T obs_params (skip euler→axangle conversion, matching official eval which passes rotation directly to env.step). Verified: GR00T SimplerEnv PutCarrot 58.3% (official 65.5%), PutSpoon 45.8%/25.0% (official 64.5%), StackGreenCube 0% (official 5.5%). Previously all 0%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

….1%) PutEggplant 41.7% (official 93.0%) completes the 4-task evaluation. Full results: StackGreen 0%, PutCarrot 58.3%, PutSpoon 45.8%, PutEggplant 41.7%. Gap from tcp_pose vs ee_gripper_link + gripper closedness calculation difference. LIBERO: no regression (separate benchmark code path). CALVIN: no checkpoint available, cannot evaluate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…data/ db-cogact.md referenced results from older eval runs whose JSON data was already deleted. Remove it and update reproduced-performance.md to match current data/: - DB-CogACT LIBERO: 95.2% → 94.7% (current data) - DB-CogACT CALVIN: 4.05 → 4.02 - DB-CogACT SimplerEnv: 72.2% (3-seed) → 70.8% (seed 0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Spatial 97.8%, Object 98.6%, Goal 98.0%, 10 95.2% (reported 98.1%). All suites reproduced within expected variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The "Usage: vla-eval serve --config <this file>" comment just repeats the file path — zero information. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Temporary debugging scripts for X-VLA SimplerEnv — not needed in PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add assert for self._env before unwrapped access (ty unresolved-attribute). Mark X-VLA SimplerEnv and GR00T SimplerEnv as WIP in reproduced-performance (rerun in progress, previous numbers not yet verified with latest fixes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add accumulate_success param to SimplerEnvBenchmark: OR-accumulates success across episode steps, matching GR00T official eval (rollout_policy.py). Remove max_episode_steps from GR00T obs_params (benchmark config decides). Results (no overlay, accumulate_success, 300 steps): PutSpoon 70.8% (official 64.5%), PutCarrot 45.8% (65.5%), StackGreen 0.0% (5.5%), PutEggplant 4.2% (93.0%). 4-task avg 30.2% (official 62.1%). PutSpoon exceeds official. PutEggplant gap from missing rgb_overlay (sink camera env needs visual matching). Previous run with overlay scored 41.7% on PutEggplant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Header: branch name, date updated - Pair 1: rot6d convention now profile-level, score 97.4% - Pair 10: state/proprio FIXED, bridge rotation FIXED, accumulate_success FIXED, gripper resolved. Remaining gaps: n_action_steps=8 (currently 16) and max_episode_steps=504 (currently 300) identified as primary suspects for 30.2% vs 62.1%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace 3 monolithic docs (reproduced-performance.md, reported-performance.md, cross-bench-audit.md) with per-codebase files that each serve as a single source of truth: checkpoints, configs, reported scores, reproduced results, pipeline audit findings, and configuration notes. New structure: - dexbotic.md, xvla.md, groot.md, openpi.md, starvla.md (per-codebase) - common-pitfalls.md (reproduction failure taxonomy from audit findings) - running-guide.md (execution guide + supply/demand + measurement protocol) - README.md (summary matrix + verdict criteria + file index) Each benchmark section now includes an inline config table with checkpoint path, server/benchmark YAML paths, and result JSON paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- GR00T SimplerEnv: use 4-task subset (57.1%) as reported baseline, not 7-task (62.1%), since our eval covers the 4-task set - X-VLA CALVIN: add footnote that per-step values sum to 4.41 vs official avg_len 4.43 - running-guide: add missing "Partial" verdict level to match README - dexbotic: normalize suite names (drop "LIBERO-" prefix, "10" → "Long") - openpi: fix GitHub issue links — distinguish official openpi #799 from third-party SimplerEnv-OpenVLA #13/#28, clarify HaomingSong checkpoints are NOT official Pi0 checkpoints Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add prepackaged_config parameter to SimplerEnvBenchmark. When True, uses simpler_env's built-in env configuration (camera, lighting, scene) instead of explicit parameters. Without it, pixel diff=33 vs official env, causing 0% on sink tasks. Results (prepackaged_config, 504 steps, chunk_size=1, accumulate_success): PutSpoon 66.7% (official 64.5%), PutCarrot 54.2% (65.5%), StackGreen 4.2% (5.5%), PutEggplant 20.8% (93.0%). 4-task avg 36.5% (official 57.1%). PutSpoon exceeds official. StackGreen near-matches. PutEggplant gap from NVIDIA's custom ManiSkill2 fork (youliangtan/ManiSkill2_real2sim) having additional sink camera customizations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SimplerEnv benchmark uses build_maniskill2_env() with explicit params instead of simpler_env.make() which applies prepackaged_config for correct visual domain matching. All SimplerEnv results were produced in a different visual domain from the official eval. Remove unverified SimplerEnv result data (dbcogact-simpler, groot-simpler, xvla-simpler). Mark all SimplerEnv entries as WIP pending migration to simpler_env.make(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MilkClouds and others added 30 commits March 30, 2026 14:27

docs: update run-evaluation skill to reflect filelock protection

5f2abb9

Replace "silently overwrite" warning with description of the new fail-fast FileExistsError behavior from filelock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: remove redundant 'run sequentially' advice from skill doc

51c59af

Filelock already prevents collision. Sequential execution is obvious. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

data: add cross-bench reproduction result JSONs

d13a34f

DB-CogACT: LIBERO 94.8%, CALVIN 4.02, SimplerEnv 36.5% X-VLA: CALVIN 4.30, SimplerEnv 69.8% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MilkClouds and others added 16 commits March 31, 2026 16:07

data+docs: X-VLA LIBERO complete — 97.4% avg (all 4 suites)

578b480

Spatial 97.8%, Object 98.6%, Goal 98.0%, 10 95.2% (reported 98.1%). All suites reproduced within expected variance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove redundant Usage comments from model server configs

fb2c4e3

The "Usage: vla-eval serve --config <this file>" comment just repeats the file path — zero information. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove debug scripts from experiments/

b9d5184

Temporary debugging scripts for X-VLA SimplerEnv — not needed in PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MilkClouds changed the title ~~Cross-benchmark evaluation: DimSpec system + X-VLA/GR00T reproduction fixes~~ Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes Apr 1, 2026

MilkClouds marked this pull request as ready for review April 1, 2026 19:22

MilkClouds merged commit 3663942 into main Apr 1, 2026
5 checks passed

MilkClouds deleted the cross-bench-eval-v2 branch April 1, 2026 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes#19

Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes#19
MilkClouds merged 46 commits intomainfrom
cross-bench-eval-v2

MilkClouds commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MilkClouds commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Reproduction Results

Result data

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MilkClouds commented Mar 30, 2026 •

edited

Loading