Skip to content

[WIP do not review]#408

Open
eugenevinitsky wants to merge 11 commits intoemerge/temp_trainingfrom
ev/clean-eval
Open

[WIP do not review]#408
eugenevinitsky wants to merge 11 commits intoemerge/temp_trainingfrom
ev/clean-eval

Conversation

@eugenevinitsky
Copy link
Copy Markdown

Summary

Inline clean eval: zero perturbations and road-segment dropout, enforce red-light stops. Logs validation/* metrics under controlled conditions so progress isn't confounded by training-time noise.

  • build_eval_overrides(clean=True) zeros lane_segment_dropout, boundary_segment_dropout, partner_blindness_prob, phantom_braking_{prob,trigger_prob}, and flips traffic_light_behavior to 1 (stop at red). Perturbations are always zeroed at eval regardless of clean since they don't change obs shape.
  • _swap_policy_obs_counts context manager temporarily aligns the live training policy's obs_{lane,boundary}_segment_count with the clean eval env. The GigaFlow encoder is truly count-invariant (shared MLP + max-pool over segments) — only the obs buffer slicing depends on these counts. Safe swap.
  • eval_multi_scenarios takes clean= and wraps the forward loop with the swap.
  • Inline eval site in _train reads eval.clean_eval (new ini default: True) and plumbs it through.
  • Removed stale _sys_instr.stderr.write debug lines in eval_multi_scenarios that crashed on NameError when the eval completed (was only imported in the render function).

Also noted: the inline render path was logging zero metrics because render_max_steps (e.g. 300) is much less than scenario_length (3000), so no episode completes within the cap and global_infos stays empty. The metric-only eval_multi_scenarios doesn't have this problem.

Test plan

  • Short training on emerge2 with dropout=0.4, perturbations>0, multi_scenario_eval=True, clean_eval=True (default). Policy was built at dropout=0.4 (obs_lane=48) and successfully forwarded through the clean eval env (obs_lane=80) via the swap, completing all 4 scenarios in ~8s with populated metrics (avg_distance_per_infraction = 7.47, red_light_violation_rate = 0.01).
  • Eval summary CSV written to experiments/.../validation/epoch_N/gigaflow/.
  • Longer run on cluster with wandb to verify validation/* tab populates.

🤖 Generated with Claude Code

eugenevinitsky and others added 2 commits April 18, 2026 21:38
- build_eval_overrides: always force perturbations to 0 (safe: no obs
  shape change). Add a clean= kwarg that additionally zeros road-segment
  dropout and flips traffic_light_behavior to 1 (stop at red).
- _swap_policy_obs_counts: context manager that temporarily aligns the
  live training policy's obs_{lane,boundary}_segment_count with the eval
  env. The GigaFlow encoder's lane/boundary encoders are shared MLPs +
  max-pool over segments — weights are count-invariant, only slicing
  depends on these counts. So the same training policy runs correctly on
  a clean env with zero dropout (larger obs buffer) once we swap.
- eval_multi_scenarios accepts clean= and wraps the forward loop with
  the swap when clean is True.
- Inline eval call site (multi_scenario_eval) reads eval.clean_eval from
  the config and plumbs clean= all the way through.
- drive.ini: add eval.clean_eval=True default.

Fixes "no validation metrics": the inline render path (which HAD been
running on multi_scenario_render_interval) only populates global_infos
when an episode completes, but render_max_steps<<scenario_length so
episodes never end. Use multi_scenario_eval=True to get real numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The _sys_instr.stderr.write instrumentation was added as debug tracing
for the render path's close_client sequence. It was also dropped into
eval_multi_scenarios by mistake — but _sys_instr is only imported
inside eval_multi_scenarios_render, so the non-render eval path
crashes with NameError whenever it runs to completion.

Just remove the instrumentation from the non-render eval function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 19, 2026 01:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a “clean” inline evaluation mode for Drive to reduce evaluation-time noise (dropout/perturbations) and enforce red-light stops, while allowing reuse of the live training policy by temporarily aligning its road-segment slicing counts with the eval environment.

Changes:

  • Introduces clean_eval plumbing from config → build_eval_overrides(..., clean=...)eval_multi_scenarios(..., clean=...).
  • Adds _swap_policy_obs_counts context manager and wraps the eval rollout loop to handle mismatched road-segment counts between training and clean eval.
  • Forces robustness perturbations (partner blindness / phantom braking) to zero during eval and removes stale _sys_instr.stderr.write debug lines that could crash.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pufferlib/pufferl.py Adds clean-eval overrides, the policy obs-count swap context manager, and threads clean through inline + CLI evaluation paths.
pufferlib/config/ocean/drive.ini Adds eval.clean_eval = True default and documents its intent for controlled validation metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pufferlib/pufferl.py
Comment on lines +2264 to +2268
# Clean eval may use different road-dropout than training. The shared
# training policy's obs slicing needs to be aligned with this env; see
# _swap_policy_obs_counts.
swap_ctx = _swap_policy_obs_counts(policy, vecenv) if clean else contextlib.nullcontext()
with swap_ctx, tqdm(total=num_scenarios, desc="Processing scenarios", disable=quiet) as pbar:
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior depends on _swap_policy_obs_counts preventing forward-time crashes when training/eval road-segment dropouts differ (obs buffer size changes). There are no automated tests covering this path (e.g., training policy built with dropout>0, eval env with dropout=0, clean=True), so regressions here are likely to be caught only at runtime. Adding a focused test that constructs a Drive policy/env pair with mismatched segment counts and verifies eval_multi_scenarios(clean=True) runs at least one short rollout without shape/index errors would help lock in this guarantee.

Copilot uses AI. Check for mistakes.
Comment thread pufferlib/pufferl.py
Comment on lines +1882 to +1887
clean: if True, run a "clean" eval — zero road-segment dropout and
enforce red-light stops. Only safe when the policy is rebuilt
from the eval env (standalone eval / render_scenario.py). Inline
eval during training reuses the live training policy, whose
encoder was built for the training obs shape; zeroing dropout
there changes the obs shape and triggers a CUDA device-side
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clean docstring now says clean eval is only safe when the policy is rebuilt and that inline eval will crash due to an obs-shape mismatch. In this PR, inline eval explicitly passes clean=clean_eval and relies on _swap_policy_obs_counts to make this safe, so the docstring is misleading/outdated. Please update the docstring to describe the current safety model (i.e., inline eval is safe when using the swap, and any remaining limitations).

Copilot uses AI. Check for mistakes.
Comment thread pufferlib/pufferl.py Outdated
Comment on lines +1919 to +1922
# Dropout changes the obs shape. Only safe when the policy is
# rebuilt from the eval env (standalone eval / render_scenario).
# NEVER pass clean=True from an inline-eval call site — the live
# training policy's encoder was built for the training obs shape.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment under if clean: says "NEVER pass clean=True from an inline-eval call site", but train() now does exactly that and eval_multi_scenarios contains logic to support it via _swap_policy_obs_counts. This comment should be revised (or removed) because it contradicts the new behavior and could cause future callers to avoid the supported path or misdiagnose failures.

Suggested change
# Dropout changes the obs shape. Only safe when the policy is
# rebuilt from the eval env (standalone eval / render_scenario).
# NEVER pass clean=True from an inline-eval call site — the live
# training policy's encoder was built for the training obs shape.
# Dropout changes the obs shape. Standalone eval / render_scenario
# are safe because the policy is rebuilt from the eval env. Inline
# eval may also pass clean=True, but only when the caller updates
# the live policy's expected obs counts to match the eval env.

Copilot uses AI. Check for mistakes.
Comment thread pufferlib/pufferl.py
Comment on lines 2176 to +2185
eval_overrides = build_eval_overrides(
simulation_mode=tmp_args["eval_simulation"],
num_agents=num_agents_eval,
num_scenarios=tmp_args["num_scenarios"],
map_dir=map_dir,
num_carla_maps=tmp_args.get("num_carla_maps", 8),
clean=clean_from_config,
)
args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides)
clean = clean or clean_from_config
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_multi_scenarios has a clean parameter, but when args is None the env overrides are built using clean_from_config only (ignoring the passed clean value). This can lead to inconsistent behavior where clean=True affects swapping but not the env configuration, or callers cannot force clean=True programmatically unless they also set config. Consider making clean default to None and resolving precedence explicitly (e.g., parameter overrides config), and pass the resolved value into build_eval_overrides.

Suggested change
eval_overrides = build_eval_overrides(
simulation_mode=tmp_args["eval_simulation"],
num_agents=num_agents_eval,
num_scenarios=tmp_args["num_scenarios"],
map_dir=map_dir,
num_carla_maps=tmp_args.get("num_carla_maps", 8),
clean=clean_from_config,
)
args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides)
clean = clean or clean_from_config
resolved_clean = clean or clean_from_config
eval_overrides = build_eval_overrides(
simulation_mode=tmp_args["eval_simulation"],
num_agents=num_agents_eval,
num_scenarios=tmp_args["num_scenarios"],
map_dir=map_dir,
num_carla_maps=tmp_args.get("num_carla_maps", 8),
clean=resolved_clean,
)
args = load_eval_multi_scenarios_config(env_name, model_path, eval_overrides)
clean = resolved_clean

Copilot uses AI. Check for mistakes.
eugenevinitsky and others added 9 commits April 19, 2026 19:37
Runs inline eval on replay scenarios (nuPlan mini train bins) alongside
the gigaflow multi_scenario_eval. Policy controls only the SDC
(control_sdc_only) while other agents follow logged trajectories.
Metrics log under metric_prefix "validation_replay" so they're
distinguishable from the gigaflow "validation" eval.

- drive.ini: add [eval].replay_eval + replay_map_dir +
  replay_num_scenarios + replay_scenario_length (201 for nuPlan
  duration_s=20 bins) + replay_control_mode + replay_init_steps.
- pufferl.py: sibling block in _train that overrides the default WOMD
  replay scenario_length (91) on top of build_eval_overrides and calls
  eval_multi_scenarios with metric_prefix="validation_replay". Shares
  the clean= plumbing and _swap_policy_obs_counts — the replay env
  with clean=True has a different obs shape from training, and the
  swap keeps the live training policy usable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When --no-render is set, calls eval_multi_scenarios across every .bin
in --map-dir instead of rendering a single scenario to mp4. Produces
evaluation_summary.csv in --output-dir.

Replaces the inline dropout/perturbation-zero overrides with
build_eval_overrides(clean=True) — same effect, but centralizes the
clean-eval logic in one place.

For replay metrics we leave offroad_behavior / collision_behavior at
the eval default (=1, terminate on infraction) so the SDC is penalized
per normal eval rules. The render path still forces them to 0 so the
video shows the full trajectory even when the policy is far off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
draw_scene gated the waypoint draw loop on obs_only==0, so BEV view
(which uses obs_only=1) had no visible goals. Drop the gate — the
goal trail is the main reason to watch a BEV render.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends _swap_policy_obs_counts to also swap max_partner_observations
and max_traffic_control_observations — they're both shared-MLP +
max-pool encoders, so swapping the count on the live training policy
is safe and lets the policy consume a wider obs buffer at eval.

build_eval_overrides(clean=True) now sets max_partner_observations=32
(training default 16). In BEV render the extra partner observations
show up as more visible vehicles, matching the clean lane behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
utils.py: run_driving_behaviours_eval_in_subprocess now passes
eval_mode=1, traffic_light_behavior=1, zero dropout, zero perturbations,
and max_partner_observations=32 — matches build_eval_overrides(clean=True).
Previously the subprocess re-parsed drive.ini and inherited whatever
defaults were there, so eval_mode stayed 0 (randomized TL cycle) and
training-time CLI overrides quietly dropped.

driving_behaviours_eval.ini: rebuilt around the nuPlan mini-train bins
labeled under /scratch/ev2237/data/nuplan/categories/<class>/. Eleven
sections (hard_stop, highway_straight, lane_change, merge, parked_cars,
roundabout, stopped_traffic, traffic_light_{green,stop}, unprotected_
{left,right}). Scenario length 201 for nuPlan duration_s=20.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Training can override dt for curriculum/speed experiments; eval needs
to stay at 10Hz so replay-env simulation matches the logged trajectory
sample rate. Otherwise waypoints drift against the SDC's actual path.

- build_eval_overrides (inline + standalone + render_scenario): dt=0.1
  added to common_env so it flows through regardless of clean mode.
- run_driving_behaviours_eval_in_subprocess: --env.dt 0.1 added to
  the subprocess cmd so it overrides whatever drive.ini default / CLI
  override the parent process had.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pufferl.py:
- eval_multi_scenarios_render takes a clean= kwarg, wraps the rollout
  loop with _swap_policy_obs_counts when set. Standalone entry now
  reads eval.clean_eval from the config.
- _render_driving_behaviours builds overrides with clean=True and
  passes clean=True to eval_multi_scenarios_render. Matches the
  metric-eval subprocess so the mp4s reflect the same clean conditions
  the wandb scalars do (no more flashing BEVs from inherited dropout).
- _train multi_scenario_render block: same — reads eval.clean_eval,
  plumbs to build_eval_overrides + eval_multi_scenarios_render.

drive.h:
- compute_metrics score threshold was hardcoded >=4, but num_target_
  waypoints=3 caps num_goals_reached at 3, so score was always 0.
  Changed to >=3. Removes the TODO/FIXME comments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants