[WIP do not review]#408
Conversation
- build_eval_overrides: always force perturbations to 0 (safe: no obs
shape change). Add a clean= kwarg that additionally zeros road-segment
dropout and flips traffic_light_behavior to 1 (stop at red).
- _swap_policy_obs_counts: context manager that temporarily aligns the
live training policy's obs_{lane,boundary}_segment_count with the eval
env. The GigaFlow encoder's lane/boundary encoders are shared MLPs +
max-pool over segments — weights are count-invariant, only slicing
depends on these counts. So the same training policy runs correctly on
a clean env with zero dropout (larger obs buffer) once we swap.
- eval_multi_scenarios accepts clean= and wraps the forward loop with
the swap when clean is True.
- Inline eval call site (multi_scenario_eval) reads eval.clean_eval from
the config and plumbs clean= all the way through.
- drive.ini: add eval.clean_eval=True default.
Fixes "no validation metrics": the inline render path (which HAD been
running on multi_scenario_render_interval) only populates global_infos
when an episode completes, but render_max_steps<<scenario_length so
episodes never end. Use multi_scenario_eval=True to get real numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The _sys_instr.stderr.write instrumentation was added as debug tracing for the render path's close_client sequence. It was also dropped into eval_multi_scenarios by mistake — but _sys_instr is only imported inside eval_multi_scenarios_render, so the non-render eval path crashes with NameError whenever it runs to completion. Just remove the instrumentation from the non-render eval function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a “clean” inline evaluation mode for Drive to reduce evaluation-time noise (dropout/perturbations) and enforce red-light stops, while allowing reuse of the live training policy by temporarily aligning its road-segment slicing counts with the eval environment.
Changes:
- Introduces
clean_evalplumbing from config →build_eval_overrides(..., clean=...)→eval_multi_scenarios(..., clean=...). - Adds
_swap_policy_obs_countscontext manager and wraps the eval rollout loop to handle mismatched road-segment counts between training and clean eval. - Forces robustness perturbations (partner blindness / phantom braking) to zero during eval and removes stale
_sys_instr.stderr.writedebug lines that could crash.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
pufferlib/pufferl.py |
Adds clean-eval overrides, the policy obs-count swap context manager, and threads clean through inline + CLI evaluation paths. |
pufferlib/config/ocean/drive.ini |
Adds eval.clean_eval = True default and documents its intent for controlled validation metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Clean eval may use different road-dropout than training. The shared | ||
| # training policy's obs slicing needs to be aligned with this env; see | ||
| # _swap_policy_obs_counts. | ||
| swap_ctx = _swap_policy_obs_counts(policy, vecenv) if clean else contextlib.nullcontext() | ||
| with swap_ctx, tqdm(total=num_scenarios, desc="Processing scenarios", disable=quiet) as pbar: |
There was a problem hiding this comment.
New behavior depends on _swap_policy_obs_counts preventing forward-time crashes when training/eval road-segment dropouts differ (obs buffer size changes). There are no automated tests covering this path (e.g., training policy built with dropout>0, eval env with dropout=0, clean=True), so regressions here are likely to be caught only at runtime. Adding a focused test that constructs a Drive policy/env pair with mismatched segment counts and verifies eval_multi_scenarios(clean=True) runs at least one short rollout without shape/index errors would help lock in this guarantee.
| clean: if True, run a "clean" eval — zero road-segment dropout and | ||
| enforce red-light stops. Only safe when the policy is rebuilt | ||
| from the eval env (standalone eval / render_scenario.py). Inline | ||
| eval during training reuses the live training policy, whose | ||
| encoder was built for the training obs shape; zeroing dropout | ||
| there changes the obs shape and triggers a CUDA device-side |
There was a problem hiding this comment.
The clean docstring now says clean eval is only safe when the policy is rebuilt and that inline eval will crash due to an obs-shape mismatch. In this PR, inline eval explicitly passes clean=clean_eval and relies on _swap_policy_obs_counts to make this safe, so the docstring is misleading/outdated. Please update the docstring to describe the current safety model (i.e., inline eval is safe when using the swap, and any remaining limitations).
| # Dropout changes the obs shape. Only safe when the policy is | ||
| # rebuilt from the eval env (standalone eval / render_scenario). | ||
| # NEVER pass clean=True from an inline-eval call site — the live | ||
| # training policy's encoder was built for the training obs shape. |
There was a problem hiding this comment.
The inline comment under if clean: says "NEVER pass clean=True from an inline-eval call site", but train() now does exactly that and eval_multi_scenarios contains logic to support it via _swap_policy_obs_counts. This comment should be revised (or removed) because it contradicts the new behavior and could cause future callers to avoid the supported path or misdiagnose failures.
| # Dropout changes the obs shape. Only safe when the policy is | |
| # rebuilt from the eval env (standalone eval / render_scenario). | |
| # NEVER pass clean=True from an inline-eval call site — the live | |
| # training policy's encoder was built for the training obs shape. | |
| # Dropout changes the obs shape. Standalone eval / render_scenario | |
| # are safe because the policy is rebuilt from the eval env. Inline | |
| # eval may also pass clean=True, but only when the caller updates | |
| # the live policy's expected obs counts to match the eval env. |
| eval_overrides = build_eval_overrides( | ||
| simulation_mode=tmp_args["eval_simulation"], | ||
| num_agents=num_agents_eval, | ||
| num_scenarios=tmp_args["num_scenarios"], | ||
| map_dir=map_dir, | ||
| num_carla_maps=tmp_args.get("num_carla_maps", 8), | ||
| clean=clean_from_config, | ||
| ) | ||
| args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides) | ||
| clean = clean or clean_from_config |
There was a problem hiding this comment.
eval_multi_scenarios has a clean parameter, but when args is None the env overrides are built using clean_from_config only (ignoring the passed clean value). This can lead to inconsistent behavior where clean=True affects swapping but not the env configuration, or callers cannot force clean=True programmatically unless they also set config. Consider making clean default to None and resolving precedence explicitly (e.g., parameter overrides config), and pass the resolved value into build_eval_overrides.
| eval_overrides = build_eval_overrides( | |
| simulation_mode=tmp_args["eval_simulation"], | |
| num_agents=num_agents_eval, | |
| num_scenarios=tmp_args["num_scenarios"], | |
| map_dir=map_dir, | |
| num_carla_maps=tmp_args.get("num_carla_maps", 8), | |
| clean=clean_from_config, | |
| ) | |
| args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides) | |
| clean = clean or clean_from_config | |
| resolved_clean = clean or clean_from_config | |
| eval_overrides = build_eval_overrides( | |
| simulation_mode=tmp_args["eval_simulation"], | |
| num_agents=num_agents_eval, | |
| num_scenarios=tmp_args["num_scenarios"], | |
| map_dir=map_dir, | |
| num_carla_maps=tmp_args.get("num_carla_maps", 8), | |
| clean=resolved_clean, | |
| ) | |
| args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides) | |
| clean = resolved_clean |
Runs inline eval on replay scenarios (nuPlan mini train bins) alongside the gigaflow multi_scenario_eval. Policy controls only the SDC (control_sdc_only) while other agents follow logged trajectories. Metrics log under metric_prefix "validation_replay" so they're distinguishable from the gigaflow "validation" eval. - drive.ini: add [eval].replay_eval + replay_map_dir + replay_num_scenarios + replay_scenario_length (201 for nuPlan duration_s=20 bins) + replay_control_mode + replay_init_steps. - pufferl.py: sibling block in _train that overrides the default WOMD replay scenario_length (91) on top of build_eval_overrides and calls eval_multi_scenarios with metric_prefix="validation_replay". Shares the clean= plumbing and _swap_policy_obs_counts — the replay env with clean=True has a different obs shape from training, and the swap keeps the live training policy usable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When --no-render is set, calls eval_multi_scenarios across every .bin in --map-dir instead of rendering a single scenario to mp4. Produces evaluation_summary.csv in --output-dir. Replaces the inline dropout/perturbation-zero overrides with build_eval_overrides(clean=True) — same effect, but centralizes the clean-eval logic in one place. For replay metrics we leave offroad_behavior / collision_behavior at the eval default (=1, terminate on infraction) so the SDC is penalized per normal eval rules. The render path still forces them to 0 so the video shows the full trajectory even when the policy is far off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
draw_scene gated the waypoint draw loop on obs_only==0, so BEV view (which uses obs_only=1) had no visible goals. Drop the gate — the goal trail is the main reason to watch a BEV render. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends _swap_policy_obs_counts to also swap max_partner_observations and max_traffic_control_observations — they're both shared-MLP + max-pool encoders, so swapping the count on the live training policy is safe and lets the policy consume a wider obs buffer at eval. build_eval_overrides(clean=True) now sets max_partner_observations=32 (training default 16). In BEV render the extra partner observations show up as more visible vehicles, matching the clean lane behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
utils.py: run_driving_behaviours_eval_in_subprocess now passes
eval_mode=1, traffic_light_behavior=1, zero dropout, zero perturbations,
and max_partner_observations=32 — matches build_eval_overrides(clean=True).
Previously the subprocess re-parsed drive.ini and inherited whatever
defaults were there, so eval_mode stayed 0 (randomized TL cycle) and
training-time CLI overrides quietly dropped.
driving_behaviours_eval.ini: rebuilt around the nuPlan mini-train bins
labeled under /scratch/ev2237/data/nuplan/categories/<class>/. Eleven
sections (hard_stop, highway_straight, lane_change, merge, parked_cars,
roundabout, stopped_traffic, traffic_light_{green,stop}, unprotected_
{left,right}). Scenario length 201 for nuPlan duration_s=20.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Training can override dt for curriculum/speed experiments; eval needs to stay at 10Hz so replay-env simulation matches the logged trajectory sample rate. Otherwise waypoints drift against the SDC's actual path. - build_eval_overrides (inline + standalone + render_scenario): dt=0.1 added to common_env so it flows through regardless of clean mode. - run_driving_behaviours_eval_in_subprocess: --env.dt 0.1 added to the subprocess cmd so it overrides whatever drive.ini default / CLI override the parent process had. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pufferl.py: - eval_multi_scenarios_render takes a clean= kwarg, wraps the rollout loop with _swap_policy_obs_counts when set. Standalone entry now reads eval.clean_eval from the config. - _render_driving_behaviours builds overrides with clean=True and passes clean=True to eval_multi_scenarios_render. Matches the metric-eval subprocess so the mp4s reflect the same clean conditions the wandb scalars do (no more flashing BEVs from inherited dropout). - _train multi_scenario_render block: same — reads eval.clean_eval, plumbs to build_eval_overrides + eval_multi_scenarios_render. drive.h: - compute_metrics score threshold was hardcoded >=4, but num_target_ waypoints=3 caps num_goals_reached at 3, so score was always 0. Changed to >=3. Removes the TODO/FIXME comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Inline clean eval: zero perturbations and road-segment dropout, enforce red-light stops. Logs validation/* metrics under controlled conditions so progress isn't confounded by training-time noise.
build_eval_overrides(clean=True)zeroslane_segment_dropout,boundary_segment_dropout,partner_blindness_prob,phantom_braking_{prob,trigger_prob}, and flipstraffic_light_behaviorto 1 (stop at red). Perturbations are always zeroed at eval regardless ofcleansince they don't change obs shape._swap_policy_obs_countscontext manager temporarily aligns the live training policy'sobs_{lane,boundary}_segment_countwith the clean eval env. The GigaFlow encoder is truly count-invariant (shared MLP + max-pool over segments) — only the obs buffer slicing depends on these counts. Safe swap.eval_multi_scenariostakesclean=and wraps the forward loop with the swap._trainreadseval.clean_eval(new ini default: True) and plumbs it through._sys_instr.stderr.writedebug lines ineval_multi_scenariosthat crashed on NameError when the eval completed (was only imported in the render function).Also noted: the inline render path was logging zero metrics because
render_max_steps(e.g. 300) is much less thanscenario_length(3000), so no episode completes within the cap andglobal_infosstays empty. The metric-onlyeval_multi_scenariosdoesn't have this problem.Test plan
multi_scenario_eval=True,clean_eval=True(default). Policy was built at dropout=0.4 (obs_lane=48) and successfully forwarded through the clean eval env (obs_lane=80) via the swap, completing all 4 scenarios in ~8s with populated metrics (avg_distance_per_infraction = 7.47,red_light_violation_rate = 0.01).experiments/.../validation/epoch_N/gigaflow/.validation/*tab populates.🤖 Generated with Claude Code