Skip to content

Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes#19

Merged
MilkClouds merged 46 commits intomainfrom
cross-bench-eval-v2
Apr 1, 2026
Merged

Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes#19
MilkClouds merged 46 commits intomainfrom
cross-bench-eval-v2

Conversation

@MilkClouds
Copy link
Copy Markdown
Collaborator

@MilkClouds MilkClouds commented Mar 30, 2026

Summary

Cross-benchmark evaluation infrastructure and reproduction fixes for X-VLA, DB-CogACT, and GR00T across LIBERO and CALVIN. SimplerEnv reproductions are WIP (pending migration to simpler_env.make()).

Key changes

  1. DimSpec interface convention system (specs.py) — every model server and benchmark declares action/observation format. Orchestrator validates at startup, catching convention mismatches before GPU hours are wasted.

  2. X-VLA rot6d convention — LIBERO uses contiguous rot6d, CALVIN/SimplerEnv use interleaved. Profile-level rot6d_convention field selects the right functions at init time.

  3. Docker patch isolation — X-VLA ManiSkill2 patches moved to Dockerfile.simpler_xvla, keeping base Dockerfile.simpler clean.

  4. Filelock for shard results — prevent silent overwrites when multiple evals use the same output directory.

  5. SimplerEnv infrastructure (WIP) — base-relative EE pose, accumulate_success, prepackaged_config, pass_rotation_raw, image_size resize. All implemented but SimplerEnv needs migration to simpler_env.make() for correct visual domain matching before results can be verified.

Reproduction Results

# Pair Reported Result Status
1 X-VLA × LIBERO (4 suites) 98.1% 97.4% ✅ Reproduced
2 X-VLA × CALVIN 4.43 4.30 ✅ Reproduced
3 X-VLA × SimplerEnv 95.8% WIP
4 DB-CogACT × LIBERO (4 suites) 94.9% 94.7% ✅ Reproduced
5 DB-CogACT × CALVIN 4.06 4.02 ✅ Reproduced
6 DB-CogACT × SimplerEnv 69.5% WIP
7 GR00T × LIBERO (4 suites) 97.0% 94.9% ✅ Approximate
8 GR00T × SimplerEnv 57.1% WIP
9 GR00T × CALVIN No checkpoint

SimplerEnv WIP: benchmark currently uses build_maniskill2_env() with explicit parameters instead of simpler_env.make() which applies prepackaged_config for correct visual domain. Migration planned as follow-up.

Result data

All verified JSON results archived in docs/reproductions/data/.

Test plan

  • make test — 212 passed, 1 skipped
  • make check — ruff + ty clean
  • X-VLA LIBERO: 97.4% (all 4 suites, 2000 episodes)
  • X-VLA CALVIN: 4.30 avg chain (1000 sequences)
  • DB-CogACT LIBERO: 94.7% (all 4 suites, 2000 episodes)
  • DB-CogACT CALVIN: 4.02 avg chain (1000 sequences)
  • GR00T LIBERO: 94.9% (no regression from code changes)

🤖 Generated with Claude Code

MilkClouds and others added 30 commits March 30, 2026 14:27
…results

Verified changes only (cherry-picked from run-cross-benchmark-evals):

SimplerEnv benchmark:
- Add image_resolution param (resize in make_obs before sending to model)
- Remove euler2axangle conversion (pass rotation directly to env.step,
  matching official eval pipelines). DB-CogACT regression test passes.

GR00T model server:
- Request image_resolution=256 via get_observation_params (model trained
  on 256x256, was receiving 480x640)

New configs:
- xvla/simpler_widowx.yaml, xvla/simpler_google_robot.yaml, xvla/robotwin.yaml
- groot/simpler_widowx.yaml, groot/simpler_google_robot.yaml

Result data (preliminary, not yet fully reproduced):
- X-VLA CALVIN: avg_len 3.97 (reported 4.43) — EP_LEN=360, needs retest
- GR00T SimplerEnv WidowX: 25.0% (reported 57.1%) — Eggplant 96% reproduced

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive line-by-line comparison of all 10 model×benchmark pairs
against official eval code. Covers image, state, rotation, gripper,
chunking, action mode, and episode config for each pair.

Pairs 1-7 verified correct. Pairs 8-10 have identified discrepancies
(rot6d convention, missing state, EP_LEN, euler_offset, etc.).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce a structured spec system to prevent convention mismatch bugs
between model servers and benchmarks (rotation format, gripper polarity,
action mode, state format, etc.).

- specs.py: DimSpec frozen dataclass with validate() and is_compatible()
  methods, plus predefined constants for common conventions
- base classes: get_action_spec() / get_observation_spec() on ModelServer
  and Benchmark (raises NotImplementedError — subclasses must override)
- orchestrator: benchmark spec validation at evaluation start
- rotation.py: add axisangle_to_rot6d_interleaved utility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every concrete ModelServer and Benchmark subclass now declares its
action and observation format via get_action_spec() / get_observation_spec().
This makes convention choices (gripper polarity, rotation format, action
mode) explicit and machine-comparable at evaluation start.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
X-VLA CALVIN (Pair 8):
- rot6d: switch to interleaved convention (matches official eval)
- euler_state: CALVIN state[3:6] correctly interpreted as euler, not axis-angle
- gripper_threshold: 0.5 → 0.8, with profile-required fields (no default)
- absolute_action + ep_len=720: added to CALVIN obs_params
- simpler_widowx profile: new profile with output_action_dim=7,
  gripper_threshold=0.7, gripper_close_above=False (Bridge domain)
- euler_offset: new parameter for coordinate frame correction

SimplerEnv benchmark:
- send_state: new parameter to extract EEF state from ManiSkill2 obs

GR00T SimplerEnv (Pair 10):
- bridge_rotation: new parameter applying quat→bridge-euler transform
- bridge_rotation=true in simpler_widowx config

Configs:
- simpler_xvla_widowx.yaml: max_episode_steps=1200 (vs 120 default)
- simpler_groot_widowx.yaml: max_episode_steps=10000

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pair 8: add Issue 6 (absolute_action missing from CALVIN obs_params)
- Pair 8: correct Issue 3 (threshold only, not comparison direction)
- Pair 9: correct Issue 2 (argparse crash, not silently ignored)
- Pair 9: add gripper direction note (Bridge domain uses < threshold)
- Mark all fixed issues with FIXED annotations
- Update summary table (Pair 8 blockers 5 → 6)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…AMLs

Move max_episode_steps out of the shared simpler_all_tasks.yaml and
into model server obs_params (X-VLA) / config args (GR00T). The
orchestrator auto-merges unset params, so:
- CogACT: no obs_params → SimplerEnvBenchmark default (120) applies
- X-VLA simpler_widowx: obs_params sends 1200
- GR00T simpler_widowx: config arg sends 10000

Delete the redundant simpler_xvla_widowx.yaml and
simpler_groot_widowx.yaml standalone configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collapse multi-line `from vla_eval.specs import (...)` to single-line
where they fit within the 119-char line limit (9 files). Multi-line
imports in files exceeding 119 chars are left as-is.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DimSpec: add to_dict() / from_dict() for wire serialization
- serve.py: include action_spec and observation_spec in HELLO response
- orchestrator.py: deserialize server specs from HELLO, cross-validate
  against benchmark specs via check_specs(), log mismatch warnings
- base classes: tighten return type hint to dict[str, DimSpec]

Replaces the TODO with full implementation. Convention mismatches
(gripper polarity, rotation format, action mode, missing state) are
now detected before the first episode runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- groot.py: hoist cv2/transforms3d imports out of per-obs loops,
  move bridge default_rot to class constant
- xvla.py: move rotation imports to module level (already loaded)
- simpler/benchmark.py: delete dead _euler2axangle function,
  fix get_observation_spec to reflect send_state
- cogact/pi0/rtc: use RAW instead of GRIPPER_RAW for full action specs
- specs.py: warn on empty key intersection in check_specs
- orchestrator.py: narrow spec validation except, log at warning level
- configs: fix simpler_google profile reference to "simpler"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace absolute paths (/mnt/harbor/..., /tmp/official-eval-code/...)
with repo-relative paths and external source references. Local node
names and paths should not appear in an open-source repository.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`obs.get("controller_states") or obs.get("states")` crashes with
"The truth value of an array with more than one element is ambiguous"
when the observation contains numpy arrays. Use explicit None checks.

Verified: X-VLA CALVIN 2-sequence test passes (1 SUCCESS at 375 steps).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e EE

SimplerEnv benchmark now sends raw base_pose and tcp_pose from
ManiSkill2 obs when send_state=True. X-VLA model server computes
base-relative EE position (Pose(base).inv() * Pose(tcp)), matching
the official eval which feeds [ee_pos_wrt_base, identity_rot6d, zeros]
as proprio input.

Also fixes numpy-unsafe `or` chain in _obs_state_array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: X-VLA WidowX outputs absolute base-frame EE targets, not
deltas. The official eval uses a modified SimplerEnv (255isWhite fork)
with `arm_pd_ee_target_base_pose` control mode (use_delta=False,
frame=base). Our env used delta control, causing the robot to fly to
workspace boundaries immediately.

Changes:
- docker: patch ManiSkill2 to add absolute EE control mode
  (from 255isWhite/SimplerEnv fork commit 54ae2e0e)
- simpler/benchmark.py: add control_mode parameter (obs_params override)
- xvla.py: simpler_widowx profile sends control_mode via obs_params

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pair 11: DB-CogACT x RoboTwin 2.0 (config ready, per-task checkpoint)
- Pair 12: X-VLA x RoboTwin 2.0 (action dim + checkpoint blockers)
- Pair 13: StarVLA x RoboTwin 2.0 (Protocol B, config needed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Line-level pipeline verification for 3 RoboTwin pairs:
- Pair 11 (DB-CogACT): pipeline aligned, config-only fixes needed
  (test_num, expert_check)
- Pair 12 (X-VLA): 3 BLOCKERS — action_type ee vs qpos, state source
  endpose vs joint_action, missing EE→quat action conversion.
  Benchmark built for CogACT's qpos space, X-VLA needs EE space.
- Pair 13 (StarVLA): Protocol B vs A mismatch, config not created

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrote all 3 RoboTwin audit entries from scratch with proper evidence:

Pair 11 (DB-CogACT): 18-item table, all verified. No code-level issues.
  Config-only: test_num=1→100, skip_expert_check. Ready to evaluate.

Pair 12 (X-VLA): 18-item table, 3 BLOCKERS found:
  - action_type='ee' vs 'qpos' (IK vs direct joint)
  - State key mismatch: benchmark sends "joint_state", X-VLA reads
    "states"/"state" → proprio is zeros (not even wrong format)
  - Missing 20D→16D EE action conversion (rot6d→quat + gripper)
  - Latent: gripper_threshold 0.5 vs official 0.7

Pair 13 (StarVLA): 3 BLOCKERS — model server single-arm only,
  config not created, checkpoint not identified.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Checkpoint: StarVLA/Qwen3-VL-OFT-Robotwin2 (HuggingFace)
- Official eval: model2robotwin_interface.py — 14D qpos with action
  reordering [0,1,2,3,4,5,12,6,7,8,9,10,11,13], absolute mode,
  joint state from obs["joint_action"]["vector"]
- Remove false "checkpoint not identified" blocker
- Update to 2 real blockers: model server single-arm only + config needed
- Remove incorrect "Protocol mismatch" — eval is per-task regardless

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pair 12: add missed seed convention discrepancy (100000 vs 2000),
  note preserve_env_grippers silent failure after endpose fix
Pair 13: fix unnorm_key "new_embodiment" → "robotwin",
  add BLOCKER 3 (state key mismatch: "joint_state" vs "states"),
  add chunk_size (16), note official eval pops state before server send,
  note benchmark make_obs key inconsistency with its own observation spec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

transforms3d is not available in GR00T's uv environment, causing
ModuleNotFoundError at runtime. Use quat_to_matrix + matrix_to_euler_xyz
from vla_eval.rotation instead, with wxyz→xyzw quaternion conversion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shard result filenames are determined by benchmark name + shard ID
only — running two models against the same benchmark config with the
same num-shards writes to identical filenames, silently overwriting
each other. Add warning to run-evaluation skill with workarounds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shard result files are named by benchmark + shard ID only. Two evals
using the same config + shard count silently overwrite each other.

- Add filelock dependency
- Acquire file lock at eval start (fail-fast with timeout=0)
- Check for existing result file before starting (FileExistsError)
- Release lock after successful save
- Lock is auto-released on process crash (OS-level)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace "silently overwrite" warning with description of the new
fail-fast FileExistsError behavior from filelock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Filelock already prevents collision. Sequential execution is obvious.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- xvla.py: simpler profile gets output_action_dim=7 (was None → 20D
  crash on SimplerEnv assert). get_action_spec reports ROTATION_EULER
  when euler_offset is active (was reporting AA → false mismatch).
- simpler/benchmark.py: gripper spec CLOSE_NEG → CLOSE_POS (>0.5→+1
  means +1=close, which is close-positive convention).
- specs.py: check_specs warns when benchmark expects action keys the
  server doesn't declare (catches partial-spec gaps).
- orchestrator.py: simplify filelock code — remove redundant variables,
  derive lock path from output path directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The euler2axangle removal (commit 06e85c7) broke DB-CogACT SimplerEnv
(72.2% → 36.5%). ManiSkill2's delta control mode expects axis-angle
rotation, not euler. Absolute EE control (X-VLA simpler_widowx) passes
euler directly.

Branch on control_mode_override: default delta → euler→axangle via
rotation.py; absolute override → pass euler as-is.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DB-CogACT: LIBERO 94.8%, CALVIN 4.02, SimplerEnv 36.5%
X-VLA: CALVIN 4.30, SimplerEnv 69.8%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LIBERO uses contiguous rot6d layout, while CALVIN/SimplerEnv use
interleaved. The previous change to interleaved-only broke LIBERO
(97.2% → 21.8%).

Add rot6d_convention field to _XVLABenchmarkProfile. LIBERO profile
sets "contiguous", all others default to "interleaved". The rot6d
encode/decode functions are selected at init time based on the profile.

Verified: X-VLA LIBERO spatial 97.8% (target 97.2%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove xvla-libero/ (pre rot6d-fix, replaced by verified 97.8% result)
- Remove dbcogact-simpler/ (invalid — ran with broken Docker patch)
- Remove sv-q25groot-libero/ (failed experiment, 29.6%, no server info)
- Rename xvla-libero-v2/ → xvla-libero/ (verified result)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The image_size parameter and cv2.resize were removed from SimplerEnv
benchmark, causing 640x480 images to be sent directly to models that
expect 224x224. This caused DB-CogACT SimplerEnv to drop from 70.8%
to 48.9%.

Restore image_size as an optional parameter on SimplerEnvBenchmark,
auto-negotiated via get_observation_params() from the model server.
Add image_resolution parameter to dexbotic/cogact.py (default 224).

Verified: DB-CogACT SimplerEnv seed 0 = 70.8% (matches previous
reproduction exactly).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MilkClouds and others added 16 commits March 31, 2026 16:07
The original xvla_absolute_ee.patch modified sink camera position,
robot init qpos/height/y, and left debug prints. Split into:

- xvla_absolute_ee.patch: control mode + euler interpretation only
  (safe for all models, applied at build time)
- xvla_sink_camera.patch: sink camera/init alignment with Bridge
  dataset (X-VLA only)
- Dockerfile.simpler_xvla: extends simpler:latest with sink patch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all ManiSkill2 patches out of Dockerfile.simpler into
Dockerfile.simpler_xvla. The base simpler image is now clean —
no model-specific modifications.

Dockerfile.simpler_xvla applies both xvla_absolute_ee.patch
(absolute EE control) and xvla_sink_camera.patch (sink camera
alignment) on top of the base image.

Verified: DB-CogACT SimplerEnv 70.8% (seed 0) with patch-free image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- X-VLA LIBERO spatial: 97.8% (re-verified with rot6d convention fix)
- X-VLA CALVIN: 4.30 avg chain (new, reported 4.43)
- X-VLA SimplerEnv: 69.8% seed 0 (partial, reported 95.8%)
- GR00T SimplerEnv: marked as not reproduced (needs SimPolicyWrapper)
- Add data/ paths for all results
- DB-CogACT SimplerEnv: clarify 3-seed average

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GR00T SimplerEnv was 0% because state was not reaching the model
server (eef_pos missing from ManiSkill2 obs). NVIDIA's ManiSkill2
fork (youliangtan/ManiSkill2_real2sim) adds eef_pos as base-relative
EE pose: inv(base_mat) @ tcp_mat → [pos3, quat4_wxyz, gripper_openness].

Compute this in SimplerEnv benchmark when eef_pos is not available,
using base_pose + tcp_pose + get_gripper_closedness(). Also add
pass_rotation_raw to GR00T obs_params (skip euler→axangle conversion,
matching official eval which passes rotation directly to env.step).

Verified: GR00T SimplerEnv PutCarrot 58.3% (official 65.5%),
PutSpoon 45.8%/25.0% (official 64.5%), StackGreenCube 0% (official 5.5%).
Previously all 0%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
….1%)

PutEggplant 41.7% (official 93.0%) completes the 4-task evaluation.
Full results: StackGreen 0%, PutCarrot 58.3%, PutSpoon 45.8%,
PutEggplant 41.7%. Gap from tcp_pose vs ee_gripper_link + gripper
closedness calculation difference.

LIBERO: no regression (separate benchmark code path).
CALVIN: no checkpoint available, cannot evaluate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…data/

db-cogact.md referenced results from older eval runs whose JSON data
was already deleted. Remove it and update reproduced-performance.md
to match current data/:

- DB-CogACT LIBERO: 95.2% → 94.7% (current data)
- DB-CogACT CALVIN: 4.05 → 4.02
- DB-CogACT SimplerEnv: 72.2% (3-seed) → 70.8% (seed 0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Spatial 97.8%, Object 98.6%, Goal 98.0%, 10 95.2% (reported 98.1%).
All suites reproduced within expected variance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The "Usage: vla-eval serve --config <this file>" comment just repeats
the file path — zero information.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temporary debugging scripts for X-VLA SimplerEnv — not needed in PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add assert for self._env before unwrapped access (ty unresolved-attribute).
Mark X-VLA SimplerEnv and GR00T SimplerEnv as WIP in reproduced-performance
(rerun in progress, previous numbers not yet verified with latest fixes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add accumulate_success param to SimplerEnvBenchmark: OR-accumulates
success across episode steps, matching GR00T official eval (rollout_policy.py).
Remove max_episode_steps from GR00T obs_params (benchmark config decides).

Results (no overlay, accumulate_success, 300 steps):
  PutSpoon 70.8% (official 64.5%), PutCarrot 45.8% (65.5%),
  StackGreen 0.0% (5.5%), PutEggplant 4.2% (93.0%).
  4-task avg 30.2% (official 62.1%).

PutSpoon exceeds official. PutEggplant gap from missing rgb_overlay
(sink camera env needs visual matching). Previous run with overlay
scored 41.7% on PutEggplant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Header: branch name, date updated
- Pair 1: rot6d convention now profile-level, score 97.4%
- Pair 10: state/proprio FIXED, bridge rotation FIXED,
  accumulate_success FIXED, gripper resolved. Remaining gaps:
  n_action_steps=8 (currently 16) and max_episode_steps=504
  (currently 300) identified as primary suspects for 30.2% vs 62.1%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 3 monolithic docs (reproduced-performance.md, reported-performance.md,
cross-bench-audit.md) with per-codebase files that each serve as a single
source of truth: checkpoints, configs, reported scores, reproduced results,
pipeline audit findings, and configuration notes.

New structure:
- dexbotic.md, xvla.md, groot.md, openpi.md, starvla.md (per-codebase)
- common-pitfalls.md (reproduction failure taxonomy from audit findings)
- running-guide.md (execution guide + supply/demand + measurement protocol)
- README.md (summary matrix + verdict criteria + file index)

Each benchmark section now includes an inline config table with checkpoint
path, server/benchmark YAML paths, and result JSON paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- GR00T SimplerEnv: use 4-task subset (57.1%) as reported baseline, not
  7-task (62.1%), since our eval covers the 4-task set
- X-VLA CALVIN: add footnote that per-step values sum to 4.41 vs
  official avg_len 4.43
- running-guide: add missing "Partial" verdict level to match README
- dexbotic: normalize suite names (drop "LIBERO-" prefix, "10" → "Long")
- openpi: fix GitHub issue links — distinguish official openpi #799 from
  third-party SimplerEnv-OpenVLA #13/#28, clarify HaomingSong checkpoints
  are NOT official Pi0 checkpoints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add prepackaged_config parameter to SimplerEnvBenchmark. When True,
uses simpler_env's built-in env configuration (camera, lighting,
scene) instead of explicit parameters. Without it, pixel diff=33
vs official env, causing 0% on sink tasks.

Results (prepackaged_config, 504 steps, chunk_size=1, accumulate_success):
  PutSpoon 66.7% (official 64.5%), PutCarrot 54.2% (65.5%),
  StackGreen 4.2% (5.5%), PutEggplant 20.8% (93.0%).
  4-task avg 36.5% (official 57.1%).

PutSpoon exceeds official. StackGreen near-matches. PutEggplant gap
from NVIDIA's custom ManiSkill2 fork (youliangtan/ManiSkill2_real2sim)
having additional sink camera customizations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SimplerEnv benchmark uses build_maniskill2_env() with explicit params
instead of simpler_env.make() which applies prepackaged_config for
correct visual domain matching. All SimplerEnv results were produced
in a different visual domain from the official eval.

Remove unverified SimplerEnv result data (dbcogact-simpler, groot-simpler,
xvla-simpler). Mark all SimplerEnv entries as WIP pending migration to
simpler_env.make().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MilkClouds MilkClouds changed the title Cross-benchmark evaluation: DimSpec system + X-VLA/GR00T reproduction fixes Cross-benchmark evaluation: DimSpec system + LIBERO/CALVIN reproduction fixes Apr 1, 2026
@MilkClouds MilkClouds marked this pull request as ready for review April 1, 2026 19:22
@MilkClouds MilkClouds merged commit 3663942 into main Apr 1, 2026
5 checks passed
@MilkClouds MilkClouds deleted the cross-bench-eval-v2 branch April 1, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant