[WIP do not review] by eugenevinitsky · Pull Request #408 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-04-19T01:57:04Z

Summary

Inline clean eval: zero perturbations and road-segment dropout, enforce red-light stops. Logs validation/* metrics under controlled conditions so progress isn't confounded by training-time noise.

build_eval_overrides(clean=True) zeros lane_segment_dropout, boundary_segment_dropout, partner_blindness_prob, phantom_braking_{prob,trigger_prob}, and flips traffic_light_behavior to 1 (stop at red). Perturbations are always zeroed at eval regardless of clean since they don't change obs shape.
_swap_policy_obs_counts context manager temporarily aligns the live training policy's obs_{lane,boundary}_segment_count with the clean eval env. The GigaFlow encoder is truly count-invariant (shared MLP + max-pool over segments) — only the obs buffer slicing depends on these counts. Safe swap.
eval_multi_scenarios takes clean= and wraps the forward loop with the swap.
Inline eval site in _train reads eval.clean_eval (new ini default: True) and plumbs it through.
Removed stale _sys_instr.stderr.write debug lines in eval_multi_scenarios that crashed on NameError when the eval completed (was only imported in the render function).

Also noted: the inline render path was logging zero metrics because render_max_steps (e.g. 300) is much less than scenario_length (3000), so no episode completes within the cap and global_infos stays empty. The metric-only eval_multi_scenarios doesn't have this problem.

Test plan

Short training on emerge2 with dropout=0.4, perturbations>0, multi_scenario_eval=True, clean_eval=True (default). Policy was built at dropout=0.4 (obs_lane=48) and successfully forwarded through the clean eval env (obs_lane=80) via the swap, completing all 4 scenarios in ~8s with populated metrics (avg_distance_per_infraction = 7.47, red_light_violation_rate = 0.01).
Eval summary CSV written to experiments/.../validation/epoch_N/gigaflow/.
Longer run on cluster with wandb to verify validation/* tab populates.

🤖 Generated with Claude Code

- build_eval_overrides: always force perturbations to 0 (safe: no obs shape change). Add a clean= kwarg that additionally zeros road-segment dropout and flips traffic_light_behavior to 1 (stop at red). - _swap_policy_obs_counts: context manager that temporarily aligns the live training policy's obs_{lane,boundary}_segment_count with the eval env. The GigaFlow encoder's lane/boundary encoders are shared MLPs + max-pool over segments — weights are count-invariant, only slicing depends on these counts. So the same training policy runs correctly on a clean env with zero dropout (larger obs buffer) once we swap. - eval_multi_scenarios accepts clean= and wraps the forward loop with the swap when clean is True. - Inline eval call site (multi_scenario_eval) reads eval.clean_eval from the config and plumbs clean= all the way through. - drive.ini: add eval.clean_eval=True default. Fixes "no validation metrics": the inline render path (which HAD been running on multi_scenario_render_interval) only populates global_infos when an episode completes, but render_max_steps<<scenario_length so episodes never end. Use multi_scenario_eval=True to get real numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The _sys_instr.stderr.write instrumentation was added as debug tracing for the render path's close_client sequence. It was also dropped into eval_multi_scenarios by mistake — but _sys_instr is only imported inside eval_multi_scenarios_render, so the non-render eval path crashes with NameError whenever it runs to completion. Just remove the instrumentation from the non-render eval function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a “clean” inline evaluation mode for Drive to reduce evaluation-time noise (dropout/perturbations) and enforce red-light stops, while allowing reuse of the live training policy by temporarily aligning its road-segment slicing counts with the eval environment.

Changes:

Introduces clean_eval plumbing from config → build_eval_overrides(..., clean=...) → eval_multi_scenarios(..., clean=...).
Adds _swap_policy_obs_counts context manager and wraps the eval rollout loop to handle mismatched road-segment counts between training and clean eval.
Forces robustness perturbations (partner blindness / phantom braking) to zero during eval and removes stale _sys_instr.stderr.write debug lines that could crash.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`pufferlib/pufferl.py`	Adds clean-eval overrides, the policy obs-count swap context manager, and threads `clean` through inline + CLI evaluation paths.
`pufferlib/config/ocean/drive.ini`	Adds `eval.clean_eval = True` default and documents its intent for controlled validation metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T02:01:09Z

+    # Clean eval may use different road-dropout than training. The shared
+    # training policy's obs slicing needs to be aligned with this env; see
+    # _swap_policy_obs_counts.
+    swap_ctx = _swap_policy_obs_counts(policy, vecenv) if clean else contextlib.nullcontext()
+    with swap_ctx, tqdm(total=num_scenarios, desc="Processing scenarios", disable=quiet) as pbar:


New behavior depends on _swap_policy_obs_counts preventing forward-time crashes when training/eval road-segment dropouts differ (obs buffer size changes). There are no automated tests covering this path (e.g., training policy built with dropout>0, eval env with dropout=0, clean=True), so regressions here are likely to be caught only at runtime. Adding a focused test that constructs a Drive policy/env pair with mismatched segment counts and verifies eval_multi_scenarios(clean=True) runs at least one short rollout without shape/index errors would help lock in this guarantee.

Copilot · 2026-04-19T02:01:10Z

+        clean: if True, run a "clean" eval — zero road-segment dropout and
+            enforce red-light stops. Only safe when the policy is rebuilt
+            from the eval env (standalone eval / render_scenario.py). Inline
+            eval during training reuses the live training policy, whose
+            encoder was built for the training obs shape; zeroing dropout
+            there changes the obs shape and triggers a CUDA device-side


The clean docstring now says clean eval is only safe when the policy is rebuilt and that inline eval will crash due to an obs-shape mismatch. In this PR, inline eval explicitly passes clean=clean_eval and relies on _swap_policy_obs_counts to make this safe, so the docstring is misleading/outdated. Please update the docstring to describe the current safety model (i.e., inline eval is safe when using the swap, and any remaining limitations).

Copilot · 2026-04-19T02:01:10Z

+        # Dropout changes the obs shape. Only safe when the policy is
+        # rebuilt from the eval env (standalone eval / render_scenario).
+        # NEVER pass clean=True from an inline-eval call site — the live
+        # training policy's encoder was built for the training obs shape.


The inline comment under if clean: says "NEVER pass clean=True from an inline-eval call site", but train() now does exactly that and eval_multi_scenarios contains logic to support it via _swap_policy_obs_counts. This comment should be revised (or removed) because it contradicts the new behavior and could cause future callers to avoid the supported path or misdiagnose failures.

Suggested change

# Dropout changes the obs shape. Only safe when the policy is

# rebuilt from the eval env (standalone eval / render_scenario).

# NEVER pass clean=True from an inline-eval call site — the live

# training policy's encoder was built for the training obs shape.

# Dropout changes the obs shape. Standalone eval / render_scenario

# are safe because the policy is rebuilt from the eval env. Inline

# eval may also pass clean=True, but only when the caller updates

# the live policy's expected obs counts to match the eval env.

Copilot · 2026-04-19T02:01:10Z

        eval_overrides = build_eval_overrides(
            simulation_mode=tmp_args["eval_simulation"],
            num_agents=num_agents_eval,
            num_scenarios=tmp_args["num_scenarios"],
            map_dir=map_dir,
            num_carla_maps=tmp_args.get("num_carla_maps", 8),
+            clean=clean_from_config,
        )
        args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides)
+        clean = clean or clean_from_config


eval_multi_scenarios has a clean parameter, but when args is None the env overrides are built using clean_from_config only (ignoring the passed clean value). This can lead to inconsistent behavior where clean=True affects swapping but not the env configuration, or callers cannot force clean=True programmatically unless they also set config. Consider making clean default to None and resolving precedence explicitly (e.g., parameter overrides config), and pass the resolved value into build_eval_overrides.

Suggested change

eval_overrides = build_eval_overrides(

simulation_mode=tmp_args["eval_simulation"],

num_agents=num_agents_eval,

num_scenarios=tmp_args["num_scenarios"],

map_dir=map_dir,

num_carla_maps=tmp_args.get("num_carla_maps", 8),

clean=clean_from_config,

)

args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides)

clean = clean or clean_from_config

resolved_clean = clean or clean_from_config

eval_overrides = build_eval_overrides(

simulation_mode=tmp_args["eval_simulation"],

num_agents=num_agents_eval,

num_scenarios=tmp_args["num_scenarios"],

map_dir=map_dir,

num_carla_maps=tmp_args.get("num_carla_maps", 8),

clean=resolved_clean,

)

args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides)

clean = resolved_clean

Runs inline eval on replay scenarios (nuPlan mini train bins) alongside the gigaflow multi_scenario_eval. Policy controls only the SDC (control_sdc_only) while other agents follow logged trajectories. Metrics log under metric_prefix "validation_replay" so they're distinguishable from the gigaflow "validation" eval. - drive.ini: add [eval].replay_eval + replay_map_dir + replay_num_scenarios + replay_scenario_length (201 for nuPlan duration_s=20 bins) + replay_control_mode + replay_init_steps. - pufferl.py: sibling block in _train that overrides the default WOMD replay scenario_length (91) on top of build_eval_overrides and calls eval_multi_scenarios with metric_prefix="validation_replay". Shares the clean= plumbing and _swap_policy_obs_counts — the replay env with clean=True has a different obs shape from training, and the swap keeps the live training policy usable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When --no-render is set, calls eval_multi_scenarios across every .bin in --map-dir instead of rendering a single scenario to mp4. Produces evaluation_summary.csv in --output-dir. Replaces the inline dropout/perturbation-zero overrides with build_eval_overrides(clean=True) — same effect, but centralizes the clean-eval logic in one place. For replay metrics we leave offroad_behavior / collision_behavior at the eval default (=1, terminate on infraction) so the SDC is penalized per normal eval rules. The render path still forces them to 0 so the video shows the full trajectory even when the policy is far off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

draw_scene gated the waypoint draw loop on obs_only==0, so BEV view (which uses obs_only=1) had no visible goals. Drop the gate — the goal trail is the main reason to watch a BEV render. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends _swap_policy_obs_counts to also swap max_partner_observations and max_traffic_control_observations — they're both shared-MLP + max-pool encoders, so swapping the count on the live training policy is safe and lets the policy consume a wider obs buffer at eval. build_eval_overrides(clean=True) now sets max_partner_observations=32 (training default 16). In BEV render the extra partner observations show up as more visible vehicles, matching the clean lane behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

utils.py: run_driving_behaviours_eval_in_subprocess now passes eval_mode=1, traffic_light_behavior=1, zero dropout, zero perturbations, and max_partner_observations=32 — matches build_eval_overrides(clean=True). Previously the subprocess re-parsed drive.ini and inherited whatever defaults were there, so eval_mode stayed 0 (randomized TL cycle) and training-time CLI overrides quietly dropped. driving_behaviours_eval.ini: rebuilt around the nuPlan mini-train bins labeled under /scratch/ev2237/data/nuplan/categories/<class>/. Eleven sections (hard_stop, highway_straight, lane_change, merge, parked_cars, roundabout, stopped_traffic, traffic_light_{green,stop}, unprotected_ {left,right}). Scenario length 201 for nuPlan duration_s=20. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Training can override dt for curriculum/speed experiments; eval needs to stay at 10Hz so replay-env simulation matches the logged trajectory sample rate. Otherwise waypoints drift against the SDC's actual path. - build_eval_overrides (inline + standalone + render_scenario): dt=0.1 added to common_env so it flows through regardless of clean mode. - run_driving_behaviours_eval_in_subprocess: --env.dt 0.1 added to the subprocess cmd so it overrides whatever drive.ini default / CLI override the parent process had. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pufferl.py: - eval_multi_scenarios_render takes a clean= kwarg, wraps the rollout loop with _swap_policy_obs_counts when set. Standalone entry now reads eval.clean_eval from the config. - _render_driving_behaviours builds overrides with clean=True and passes clean=True to eval_multi_scenarios_render. Matches the metric-eval subprocess so the mp4s reflect the same clean conditions the wandb scalars do (no more flashing BEVs from inherited dropout). - _train multi_scenario_render block: same — reads eval.clean_eval, plumbs to build_eval_overrides + eval_multi_scenarios_render. drive.h: - compute_metrics score threshold was hardcoded >=4, but num_target_ waypoints=3 caps num_goals_reached at 3, so score was always 0. Changed to >=3. Removes the TODO/FIXME comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eugenevinitsky and others added 2 commits April 18, 2026 21:38

Copilot AI review requested due to automatic review settings April 19, 2026 01:57

Copilot started reviewing on behalf of eugenevinitsky April 19, 2026 01:57 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

eugenevinitsky and others added 9 commits April 19, 2026 19:37

render_scenario: cap no-render workers at physical CPU count

f26aa19

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

driving_behaviours_eval.ini: point at categories_v021 (0.2.1 reconvert)

ff4e0df

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP do not review]#408

[WIP do not review]#408
eugenevinitsky wants to merge 11 commits intoemerge/temp_trainingfrom
ev/clean-eval

eugenevinitsky commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eugenevinitsky commented Apr 19, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants