Skip to content

[WIP] Trajviz: Vulkan offline trajectory renderer#398

Open
eugenevinitsky wants to merge 5 commits into3.0from
ev/visualize_tooling
Open

[WIP] Trajviz: Vulkan offline trajectory renderer#398
eugenevinitsky wants to merge 5 commits into3.0from
ev/visualize_tooling

Conversation

@eugenevinitsky
Copy link
Copy Markdown

Summary

A new headless Vulkan-backed renderer that turns saved Drive trajectories into MP4 videos. Independent of the existing raylib visualizer (scripts/build_ocean.sh visualize) — opt-in via TRAJVIZ=1 python setup.py build_ext --inplace. Drive sim code (drive.{h,c,py}, binding.c) is untouched.

Status: WIP. Builds, runs, renders correctly on real Drive sims. There is more to do (see below) before this is merge-ready.

  • Public Python API: Renderer.render_episode(...) and Renderer.render_batch([...]) (up to 16 episodes per batch)
  • CLI: python -m pufferlib.ocean.drive.trajviz <inputs> --maps-dir ... --out ...
  • Random-rollout smoke test: python -m pufferlib.ocean.drive.trajviz.tools.random_rollout
  • Standalone C harness: pufferlib/ocean/drive/trajviz/tests/test_main.c (no Python required)

Performance

On RTX 4080 + 16-core CPU, 1280×720, 90-frame episodes, both views, libx264 -preset veryfast:

batch_size wall time per-episode ep/s
1 345 ms 345 ms 2.9
4 1094 ms 274 ms 3.7
8 2136 ms 267 ms 3.7

Pure Vulkan + readback floor (encoder bypassed) is ~30 ms / ep ≈ 32 ep/s. The remaining gap is libx264 — NVENC integration (deferred) would close it.

Headline optimizations applied during development:

  • HOST_CACHED readback memory — single biggest win (~6-7×). Default HOST_VISIBLE | HOST_COHERENT on NVIDIA picks write-combined PCIe BAR memory, fast for the GPU but ~250 MB/s for CPU reads. HOST_CACHED puts it in regular RAM (>5 GB/s).
  • LINE_STRIP polylines — one draw call per polyline instead of per segment.
  • Vertically-tiled atlas for batched rendering — N episodes' tiles stacked vertically so each tile's bytes are row-contiguous in the readback buffer (one fwrite per pipe per frame, no row stitching).
  • Threaded fwrite fan-out — each ffmpeg pipe owns a writer thread; per-frame write phase costs max(single fwrite) instead of sum(fwrites).
  • F_SETPIPE_SZ — bumps the kernel pipe buffer up to whatever the per-process limit allows so fwrites don't ping-pong on a 64 KB pipe.

Architecture

__init__.py / _native.c / trajviz.{h,c}
                ↓
   vk_renderer (single)   vk_batch_renderer (tiled atlas)
                ↓                  ↓
   vk_pipeline / vk_context / ffmpeg_pipe (+ writer thread)
                ↓                  ↓
        Vulkan 1.3             ffmpeg subprocess

Two views, matching the existing live raylib path:

  • Top-down = RenderView.FULL_SIM_STATE — orthographic full-map
  • BEV = RenderView.BEV_AGENT_OBS — agent-centric ~100m × 178m window, ego at center facing up

Documentation in docs/src/trajviz.md covers prerequisites (apt packages), build, Python and CLI usage, performance tuning (sysctl knobs, HOST_CACHED explanation, batch size guidance), debugging env vars, troubleshooting, and architecture overview.

Why Vulkan, not raylib

  • Headless on Linux clusters with no X server (raylib needs xvfb)
  • Throughput-oriented batching (impossible without command-buffer control)
  • Independent build path so the optional Vulkan dep doesn't pollute the live drive sim build

Why WIP — known gaps before merge

  • No NPC / expert-replay agents — currently only renders the controlled agents from get_sim_trajectories. The other ~18 vehicles in a typical Waymo scenario (the WOSAC "context" tracks) aren't shown. Adding them needs a separate Drive API to expose expert trajectories.
  • No 3D follow-camRenderView.AGENT_PERSP (3D car meshes from .glb) is not implemented in trajviz. Top-down + BEV only.
  • CPU-bound by libx264 once batched — NVENC integration (Vulkan video encode or libnvidia-encode + CUDA-Vulkan interop) would unlock the remaining ~12% gap to the pure-GPU ceiling.
  • batch_size capped at 16 — atlas image height grows linearly. Past ~22 we'd need multiple atlas passes or a 2-D tile grid.
  • Uniform num_steps per batch — short episodes get padded with zeros. Wastes a tiny bit of GPU work on the trailing zeros.
  • Buffer/image helpers duplicated between vk_renderer.c and vk_batch_renderer.c (~50 lines each). Should be consolidated into a shared helper before merge.

Test plan

  • TRAJVIZ=1 python setup.py build_ext --inplace — clean build, no warnings beyond the standard PyCFunction cast
  • Single-episode render via random_rollout.py — 90-frame MP4 with valid h264, 1280×720, both views
  • Batched render of 4 episodes — 4 valid MP4s, ~270 ms/ep
  • Visual sanity check on extracted frames (ego centered in BEV, Waymo road geometry slides correctly)
  • Render an actual saved trajectories_*.npz from a real training run end-to-end via render_npz
  • Try on a non-NVIDIA GPU (AMD radv, Intel) to confirm HOST_CACHED fallback works
  • CI build with TRAJVIZ=1 so the extension keeps compiling
  • Decide whether to delete notebooks/visualize_trajectories.py once map_io.py is the canonical parser

🤖 Generated with Claude Code

eugenevinitsky and others added 3 commits April 11, 2026 14:43
Ported from ev/yolo (~6 commits) as a single clean change, skipping debug
commits and the per-component reward logging / partner obs changes that
were tangled into the same branch history.

## C side (drive.h, datatypes.h, env_binding.h)

- Agent struct gets four float* buffers (sim_traj_x/y/z/heading) sized to
  episode_length. Allocated in init() after set_active_agents, freed in
  free_agent.
- c_step writes the post-move_dynamics state into position t = timestep-1.
  Cheap: 4 float copies per agent per step.
- c_get_sim_trajectories(env, x, y, z, heading, lengths, ep_len) copies all
  active agents' buffers into output arrays. lengths[i] = env->timestep so
  callers know how much of the buffer is valid for the current episode.
- vec_get_sim_trajectories: Python-facing wrapper that iterates sub-envs
  with agent-offset accounting.
- vec_get_world_mean: (x, y, z) tuple from env 0. Used by save_trajectories
  to lift sim coords back into the source map frame for offline rendering.

## Python side (drive.py)

- Drive.__init__ accepts traj_save_dir kwarg; stores it along with a
  _worker_idx slot that vector.py fills in for multiprocessing workers.
- Cache world_mean after binding.vectorize (and on resample_maps).
- get_sim_trajectories() allocates output arrays and calls the binding.
- notify() writes a per-worker traj_worker_{idx}.npz containing the
  sim trajectories, map_ids, agent_offsets, map_files, world_mean.
  Called from workers via the shm notify mechanism.

## Multiprocessing plumbing (vector.py)

- _worker_process tags each env (or sub-envs of a Serial wrapper) with
  _worker_idx after construction so env.notify() knows which file to write.
- Multiprocessing.save_worker_trajectories() sets the notify flag for all
  workers and spins until they all clear it (workers respond inside their
  step loop).

## Checkpoint integration (pufferl.py)

- save_trajectories() dumps the rolling policy buffers (actions, rewards,
  values, logprobs, terminals, truncations) + C-side trajectories + map
  context into trajectories_{epoch:06d}.npz. Supports both multiprocessing
  (fan out via save_worker_trajectories, stitch worker files) and Serial
  (read driver_env directly).
- save_reproducibility() snapshots the compiled .so, key source files,
  active config, and git commit/diff on the first checkpoint of a run.
- Both called inside the existing checkpoint block in train().
- train() pre-creates data_dir/traj_tmp and threads traj_save_dir into
  args["env"] so workers inherit it automatically via env kwargs.
- Opt-out via `save_trajectories: False` in the train config.

Verified locally: C extension builds, Drive.get_sim_trajectories() returns
correctly-shaped arrays with live sim positions, world_mean binding works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pulled verbatim from ev/yolo (commit af9119a). jupytext percent-format
script that reads the trajectories_<epoch>.npz written by
PuffeRL.save_trajectories and renders agent paths on top of the source
map. Uses world_mean to align sim coords with the map frame.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A new headless Vulkan-backed renderer that turns saved Drive
trajectories into MP4 videos. Independent of the existing raylib
visualizer (`scripts/build_ocean.sh visualize`) — opt-in via
`TRAJVIZ=1 python setup.py build_ext --inplace`. Optional dependency,
won't affect users who don't need it.

## Public surface

- Python: `pufferlib.ocean.drive.trajviz.Renderer`
  - `render_episode(...)` for single-episode rendering
  - `render_batch([...])` for multi-episode batched rendering (up to 16)
  - `render_npz(path, maps_dir, out_dir)` for saved trajectories_*.npz
- CLI: `python -m pufferlib.ocean.drive.trajviz <inputs> --maps-dir ... --out ...`
- Random-rollout smoke test:
  `python -m pufferlib.ocean.drive.trajviz.tools.random_rollout`
- Standalone C harness: `tests/test_main.c` (no Python required)

## Architecture

- `_native.c`     CPython extension shell (numpy unwrap, GIL release)
- `trajviz.{h,c}` public C API: render_episode, render_episodes_batch
- `vk_context`    VkInstance / VkDevice / queue / command pool
- `vk_pipeline`   line + box graphics pipelines, push-constant cameras
- `vk_renderer`   single-episode renderer
- `vk_batch_renderer`  batched renderer with vertically-tiled atlas
                       (per-episode tile bytes are row-contiguous in
                       the readback buffer)
- `ffmpeg_pipe`   pipe-to-ffmpeg + per-pipe writer thread for parallel
                  fan-out fwrites
- `shaders/`      GLSL → SPIR-V (compiled at build time, embedded as
                  uint32_t arrays in a generated `shaders.c`)

## Two views

Matches the existing live raylib path:
- Top-down (RenderView.FULL_SIM_STATE): orthographic full-map
- BEV (RenderView.BEV_AGENT_OBS): agent-centric ~100m × 178m window,
  ego at center facing up

## Performance

On RTX 4080 + 16-core CPU, 1280x720 90-frame episodes with both views,
libx264 -preset veryfast:

  batch_size=1:   345 ms / ep   (2.9 ep/s)
  batch_size=4:  1094 ms total  (274 ms / ep, 3.7 ep/s)
  batch_size=8:  2136 ms total  (267 ms / ep, 3.7 ep/s)

Pure Vulkan + readback floor (no encoder): ~30 ms / ep ≈ 32 ep/s.
The remaining gap is libx264 encoding — NVENC would close it.

Key optimizations applied:
- HOST_CACHED readback memory (6-7× win on its own — uncached PCIe
  BAR reads were ~250 MB/s; cached RAM reads are >5 GB/s)
- LINE_STRIP polylines (one draw per polyline, not per segment)
- Tiled vertical atlas for batched rendering (single submit per frame
  for N episodes; per-tile bytes are row-contiguous for one fwrite
  per pipe per frame)
- Threaded fwrites (per-pipe writer threads, parallel fan-out)
- F_SETPIPE_SZ to fit a frame in one pipe buffer

## Why Vulkan, not raylib

- Headless on Linux clusters with no X server (raylib needs xvfb)
- Throughput-oriented batching (impossible without command-buffer
  control)
- Independent build path so the optional Vulkan dep doesn't pollute
  the live drive sim build

## Documentation

`docs/src/trajviz.md` covers prerequisites (apt packages), build,
Python and CLI usage, performance tuning (sysctl knobs, HOST_CACHED
explanation, batch size guidance), debugging env vars, troubleshooting,
and architecture overview.

## Notes

- `pufferlib/ocean/drive/map_io.py` extracted from
  `notebooks/visualize_trajectories.py` so trajviz and the notebook
  share one .bin map parser. The notebook still works.
- `trajviz/shaders.c` is generated at build time and gitignored.
- Drive sim code (drive.{h,c,py}, binding.c) is untouched.
Copilot AI review requested due to automatic review settings April 11, 2026 22:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in, headless Vulkan-backed “trajviz” renderer for offline Drive trajectory visualization (MP4 output), and wires up trajectory capture/saving during training checkpoints to feed the renderer—without changing the live raylib visualizer path.

Changes:

  • Introduces pufferlib.ocean.drive.trajviz (C/Vulkan renderer + CPython extension + CLI/tools) built only when TRAJVIZ=1.
  • Adds C-side sim-trajectory recording and Python/C bindings to retrieve + save per-checkpoint trajectories (including map context) for offline rendering.
  • Adds docs and a shared .bin map parser (map_io.py) to support loading and rendering saved trajectories.

Reviewed changes

Copilot reviewed 35 out of 37 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
setup.py Adds opt-in build path for the trajviz Vulkan CPython extension and shader build step.
pufferlib/vector.py Tags worker envs with _worker_idx; adds save_worker_trajectories() fan-out helper.
pufferlib/pufferl.py Saves trajectories at checkpoint time; adds reproducibility snapshot; threads traj temp dir into env kwargs.
pufferlib/ocean/env_binding.h Adds vectorized bindings to pull sim trajectories and world_mean from C envs.
pufferlib/ocean/drive/drive.py Caches world_mean; adds get_sim_trajectories() and notify() to write per-worker .npz.
pufferlib/ocean/drive/drive.h Allocates/records per-step sim trajectories in C; exposes c_get_sim_trajectories().
pufferlib/ocean/drive/datatypes.h Extends Agent with sim-trajectory buffers and frees them on teardown.
pufferlib/ocean/drive/map_io.py Adds canonical parser/transform helpers for Drive .bin maps for offline tooling.
pufferlib/ocean/drive/trajviz/init.py Python Renderer wrapper + render_npz() convenience loader.
pufferlib/ocean/drive/trajviz/main.py CLI entry point to render one or more trajectories_*.npz inputs.
pufferlib/ocean/drive/trajviz/_native.c CPython extension: numpy validation + GIL release around render calls.
pufferlib/ocean/drive/trajviz/trajviz.h Public C API for single-episode and batched rendering.
pufferlib/ocean/drive/trajviz/trajviz.c Orchestrates Vulkan renderers + ffmpeg pipes; implements batch tiling path.
pufferlib/ocean/drive/trajviz/vk_context.h Declares Vulkan instance/device/queue lifecycle and error helpers.
pufferlib/ocean/drive/trajviz/vk_context.c Implements Vulkan init (1.3 + dynamic rendering + sync2) and teardown.
pufferlib/ocean/drive/trajviz/vk_pipeline.h Declares shared pipeline/push-constant and instance formats.
pufferlib/ocean/drive/trajviz/vk_pipeline.c Creates Vulkan graphics pipelines for polylines and agent boxes.
pufferlib/ocean/drive/trajviz/vk_renderer.h Declares single-episode renderer (frame slots, readback, ffmpeg drain).
pufferlib/ocean/drive/trajviz/vk_renderer.c Implements per-frame rendering + readback + ffmpeg write for one episode.
pufferlib/ocean/drive/trajviz/vk_batch_renderer.h Declares batched atlas renderer API.
pufferlib/ocean/drive/trajviz/vk_batch_renderer.c Implements batched atlas rendering + threaded pipe write fan-out.
pufferlib/ocean/drive/trajviz/vk_math.h Adds small mat4 helpers for ortho fit and BEV camera.
pufferlib/ocean/drive/trajviz/ffmpeg_pipe.h Declares ffmpeg subprocess pipe + writer-thread API.
pufferlib/ocean/drive/trajviz/ffmpeg_pipe.c Implements popen-based ffmpeg piping and async writer thread.
pufferlib/ocean/drive/trajviz/shaders.h Declares externs for generated SPIR-V blobs.
pufferlib/ocean/drive/trajviz/shaders/build_shaders.sh Builds GLSL → SPIR-V and generates shaders.c.
pufferlib/ocean/drive/trajviz/shaders/polyline.vert Adds GLSL for road polyline vertex stage.
pufferlib/ocean/drive/trajviz/shaders/polyline.frag Adds GLSL for road polyline fragment stage.
pufferlib/ocean/drive/trajviz/shaders/agent_box.vert Adds GLSL for instanced agent quad expansion.
pufferlib/ocean/drive/trajviz/shaders/agent_box.frag Adds GLSL for flat-colored agent box fragment stage.
pufferlib/ocean/drive/trajviz/tools/random_rollout.py End-to-end smoke test that rolls out Drive and renders via trajviz.
pufferlib/ocean/drive/trajviz/tools/init.py Marks tools as a package (module discovery).
pufferlib/ocean/drive/trajviz/tests/test_main.c Standalone C harness to validate Vulkan+ffmpeg path without Python.
docs/src/trajviz.md Adds end-user documentation: build/run/tuning/architecture/troubleshooting.
docs/src/SUMMARY.md Links trajviz documentation into the docs sidebar.
notebooks/visualize_trajectories.py Adds/updates notebook for analyzing and plotting saved trajectories.
.gitignore Ignores generated pufferlib/ocean/drive/trajviz/shaders.c.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pufferlib/pufferl.py
Comment on lines +720 to +740
if hasattr(self.vecenv, "save_worker_trajectories"):
traj_tmp = getattr(driver_env, "_traj_save_dir", None) if driver_env else None
if traj_tmp:
self.vecenv.save_worker_trajectories()
worker_files = sorted(glob.glob(os.path.join(traj_tmp, "traj_worker_*.npz")))
if worker_files:
all_traj = {}
map_files = None
world_mean = None
for f in worker_files:
d = np.load(f, allow_pickle=True)
for k in ("x", "y", "z", "heading", "lengths", "map_ids"):
if k in d:
all_traj.setdefault(k, []).append(d[k])
if map_files is None and "map_files" in d:
map_files = d["map_files"]
if world_mean is None and "world_mean" in d:
world_mean = d["world_mean"]
for k, v in all_traj.items():
key = f"traj_{k}" if k in ("x", "y", "z", "heading", "lengths") else k
data[key] = np.concatenate(v)
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiprocessing stitching path doesn’t include agent_offsets, which render_npz() requires to slice per-env episodes. Simply concatenating per-worker (agent-local) offsets wouldn’t be correct anyway; you likely need to (a) collect each worker’s agent_offsets, (b) shift them by the cumulative agent count, and (c) concatenate to produce a global agent_offsets aligned with the concatenated traj_* arrays.

Copilot uses AI. Check for mistakes.
Comment thread pufferlib/pufferl.py
Comment on lines +721 to +726
traj_tmp = getattr(driver_env, "_traj_save_dir", None) if driver_env else None
if traj_tmp:
self.vecenv.save_worker_trajectories()
worker_files = sorted(glob.glob(os.path.join(traj_tmp, "traj_worker_*.npz")))
if worker_files:
all_traj = {}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traj_tmp is a shared directory (.../traj_tmp) that is never cleaned, and glob(traj_worker_*.npz) will pick up stale worker files (e.g., if a previous run used more workers, or a crashed worker left an old file). Consider writing into an epoch-scoped subdir, or deleting existing traj_worker_*.npz files before triggering save_worker_trajectories(), and/or validating the expected worker count before stitching.

Copilot uses AI. Check for mistakes.
Comment on lines +123 to +125
PyErr_Fetch(&type, &value, &tb);
PyErr_Format(PyExc_TypeError, "%s: %s", name,
value ? PyUnicode_AsUTF8(PyObject_Str(value)) : "type/shape mismatch");
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_array() leaks a reference: PyObject_Str(value) creates a new object that isn’t DECREF’d (it’s passed directly into PyUnicode_AsUTF8(...)). Store the PyObject_Str result in a temporary, use it for formatting, then Py_DECREF it to avoid per-call leaks on shape/type errors.

Suggested change
PyErr_Fetch(&type, &value, &tb);
PyErr_Format(PyExc_TypeError, "%s: %s", name,
value ? PyUnicode_AsUTF8(PyObject_Str(value)) : "type/shape mismatch");
PyObject *value_str = NULL;
const char *message = "type/shape mismatch";
PyErr_Fetch(&type, &value, &tb);
if (value) {
value_str = PyObject_Str(value);
if (value_str) {
const char *utf8 = PyUnicode_AsUTF8(value_str);
if (utf8)
message = utf8;
}
}
PyErr_Format(PyExc_TypeError, "%s: %s", name, message);
Py_XDECREF(value_str);

Copilot uses AI. Check for mistakes.
d->rs.polygonMode = VK_POLYGON_MODE_FILL;
d->rs.cullMode = VK_CULL_MODE_NONE;
d->rs.frontFace = VK_FRONT_FACE_COUNTER_CLOCKWISE;
d->rs.lineWidth = 1.5f; /* used for line topology only; ignored for tris */
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rs.lineWidth is set to 1.5f, but Vulkan requires the wideLines device feature to be enabled for line widths != 1.0. Since vk_ctx_init() doesn’t enable wideLines, this can trigger validation errors or pipeline creation failure on some devices. Either set the line width back to 1.0 or explicitly query+enable VkPhysicalDeviceFeatures::wideLines when supported.

Suggested change
d->rs.lineWidth = 1.5f; /* used for line topology only; ignored for tris */
d->rs.lineWidth = 1.0f; /* Vulkan-safe default; wider lines require wideLines */

Copilot uses AI. Check for mistakes.
Comment on lines +434 to +438
static void record_view(VkCommandBuffer cmd, Renderer *r, FrameSlot *slot, RenderTarget *rt, const Mat4 *mvp,
uint32_t num_instances) {
barrier_image(cmd, rt->color.image, VK_IMAGE_LAYOUT_UNDEFINED, VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL,
VK_PIPELINE_STAGE_2_TOP_OF_PIPE_BIT, 0, VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT,
VK_ACCESS_2_COLOR_ATTACHMENT_WRITE_BIT);
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image layout barrier uses oldLayout = VK_IMAGE_LAYOUT_UNDEFINED every frame. After the first frame the image will actually be in VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL (set later in the same command buffer), so this will trip validation and is undefined behavior. Track the current layout per render target (or always transition from TRANSFER_SRC_OPTIMAL back to COLOR_ATTACHMENT_OPTIMAL after the copy) and use the real old layout in the next frame’s barrier.

Copilot uses AI. Check for mistakes.
Comment on lines +1010 to +1016
PyArrayObject *x_arr = (PyArrayObject *)PyTuple_GetItem(args, 1);
PyArrayObject *y_arr = (PyArrayObject *)PyTuple_GetItem(args, 2);
PyArrayObject *z_arr = (PyArrayObject *)PyTuple_GetItem(args, 3);
PyArrayObject *heading_arr = (PyArrayObject *)PyTuple_GetItem(args, 4);
PyArrayObject *lengths_arr = (PyArrayObject *)PyTuple_GetItem(args, 5);
int ep_len = (int)PyLong_AsLong(PyTuple_GetItem(args, 6));

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_get_sim_trajectories casts tuple items directly to PyArrayObject* and calls PyArray_DATA without validating they’re actually NumPy arrays / contiguous / expected dtype+shape. Other vec_* helpers in this file do PyArray_Check and shape-derived stride math. Add PyArray_Check (and ideally dtype/ndim checks) here too to prevent segfaults on misuse from Python.

Copilot uses AI. Check for mistakes.
Comment thread pufferlib/vector.py
Comment on lines +546 to +556
def save_worker_trajectories(self):
"""Trigger every worker to call env.notify(), then block until all finish.

Used by PuffeRL.save_trajectories() to fan out a trajectory-save request
across workers. Each worker's env.notify() writes a per-worker npz and
clears its own notify flag; we spin until all flags are down.
"""
self.buf["notify"][:] = True
while any(self.buf["notify"]):
time.sleep(0.01)

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_worker_trajectories() spins indefinitely using Python’s any(self.buf['notify']) over a NumPy array. This is both slower than self.buf['notify'].any() and can hang forever if a worker dies or never clears its flag. Consider using np.any(...)/.any() plus a timeout (and surfacing an error) to avoid deadlocking the training loop.

Copilot uses AI. Check for mistakes.
Comment thread setup.py
Comment on lines +27 to +30
# Opt-in: TRAJVIZ=1 builds the Vulkan trajectory renderer as a CPython
# extension. Requires libvulkan-dev + glslang-tools (apt). See
# docs/trajviz.md for installation. Default off — most users don't need it.
TRAJVIZ = os.getenv("TRAJVIZ", "0") == "1"
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment points users to docs/trajviz.md, but the documentation added in this PR lives under docs/src/trajviz.md. Update the path so the install instructions are discoverable from the repo layout.

Copilot uses AI. Check for mistakes.
if (!traj_xyh || !vert_offsets || !poly_meta_offsets || !poly_type_offsets || !agent_lengths) {
snprintf(ctx->last_error, sizeof(ctx->last_error), "null required pointer to render_episodes_batch");
return TRAJVIZ_ERR_BAD_ARG;
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trajviz_render_episodes_batch allows all_road_offsets / all_road_types to be NULL (they’re not included in the required-pointer check), but later computes num_polys_s from poly_meta_offsets and can call vk_batch_renderer_set_episode with num_polys_s > 0 and off_s/typ_s == NULL, which is likely to crash. Either require these pointers when any episode has polylines, or validate per-episode and force num_polys_s=0 when offsets/types are absent.

Suggested change
}
}
{
int any_episode_has_polylines = 0;
for (int s = 0; s < batch_size; ++s) {
if (poly_meta_offsets[s + 1] > poly_meta_offsets[s]) {
any_episode_has_polylines = 1;
break;
}
}
if (any_episode_has_polylines && (!all_road_offsets || !all_road_types)) {
snprintf(ctx->last_error, sizeof(ctx->last_error),
"road offset/type arrays are required when any episode has polylines");
return TRAJVIZ_ERR_BAD_ARG;
}
}

Copilot uses AI. Check for mistakes.
const uint32_t *off_s = (all_road_offsets && num_polys_plus_1 > 0) ? (all_road_offsets + pm_start) : NULL;

uint32_t pt_start = poly_type_offsets[s];
const uint32_t *typ_s = (all_road_types && num_polys_s > 0) ? (all_road_types + pt_start) : NULL;
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-episode road slicing can yield num_polys_s > 0 while off_s/typ_s are NULL (because all_road_offsets / all_road_types are treated as optional). Before calling vk_batch_renderer_set_episode, add a consistency check that offsets/types are present whenever num_polys_s > 0, and fail with a clear error if not.

Suggested change
const uint32_t *typ_s = (all_road_types && num_polys_s > 0) ? (all_road_types + pt_start) : NULL;
const uint32_t *typ_s = (all_road_types && num_polys_s > 0) ? (all_road_types + pt_start) : NULL;
if (num_polys_s > 0 && (!off_s || !typ_s)) {
snprintf(ctx->last_error, sizeof(ctx->last_error),
"episode %d has %u road polygons but missing %s%s",
s, num_polys_s,
!off_s ? "road offsets" : "",
(!off_s && !typ_s) ? " and road types" : (!typ_s ? "road types" : ""));
err = TRAJVIZ_ERR_BAD_ARGS;
goto cleanup;
}

Copilot uses AI. Check for mistakes.
Adds an env-var encoder selector to ffmpeg_pipe.c with two choices:

  TRAJVIZ_ENCODER unset (default)        → libx264 -preset veryfast
  TRAJVIZ_ENCODER=nvenc / h264_nvenc     → h264_nvenc -preset p4

libx264 stays the default — counter-intuitively NVENC turned out to be
the wrong fit for trajviz's "spawn one ffmpeg subprocess per output MP4
per render" architecture. Two reasons measured empirically on RTX 4080
+ 16-core CPU:

1. NVENC session creation is ~100 ms per session and we spawn 2N
   ffmpeg processes per render_batch call. For short episodes the
   per-session startup tax dominates wall time.

2. The driver still throttles concurrent NVENC sessions per process
   ("incompatible client key (21)") at batch_size ≥ 8 even though the
   consumer-card cap was officially removed in driver 530+.

3. In steady state, libx264 -preset veryfast and NVENC -preset p4 are
   tied per-frame at 720p (~2.3 ms/frame either way). 16 parallel
   libx264 instances on 16 cores out-throughputs a single NVENC engine
   serializing 16 streams.

Per-episode wall time, libx264 vs nvenc, both views, 1280x720:

   batch_size  T=90    T=500    T=1000
   1           350/790 1162/1540 2203/2284
   4           273/815 1139/1442 5157/5432

NVENC closes the startup gap with longer episodes but never wins on
this hardware. Real NVENC throughput unlocks would require either a
single long-lived ffmpeg with multi-input/multi-output or direct
libnvidia-encode integration with VK_KHR_external_memory_fd — both
larger refactors than v1.

`TRAJVIZ_ENCODER=nvenc` remains as a one-line opt-in for users who
want to experiment or have a single-stream long-episode workload
where the math flips.

docs/src/trajviz.md gets a "Choosing an encoder" section with the
empirical table and the architecture explanation, plus the env var is
added to the debugging knobs list.
Each Drive sub-env in a vec computes its own world_mean in set_means()
from its own map's road + agent points, so different maps in a
num_maps>1 vec have different world_means. Empirically these can
differ by 10+ km in source-Waymo coordinates across maps from
different cities.

The previous code had a misleading comment in env_binding.h's
vec_get_world_mean ("All envs in a vec share the same map-centering
convention so env 0 is representative") and saved a single world_mean
(env 0's) into trajectories_*.npz. Any offline tool that loaded a
non-env-0 sub-env's source map and tried to align it with that env's
trajectory was off by (this env's world_mean − env 0's world_mean) —
silently rendering roads in the wrong place.

Fixes:

env_binding.h
  - Replace the misleading comment on vec_get_world_mean with one
    that explains the env-0-only nature and points at the new fn
  - Add vec_get_all_world_means(c_envs, out) that fills a
    (num_envs, 3) float32 array with each sub-env's world_mean

drive.py
  - Drive.get_world_means() Python wrapper, returns (num_envs, 3)
  - Drive.notify() now saves world_means (plural, per-env) into the
    per-worker npz, in addition to the legacy world_mean (singular)

pufferl.py
  - PuffeRL.save_trajectories concatenates world_means across the
    per-worker npz files (matching how it concatenates map_ids /
    agent_offsets) and saves it as a (total_envs, 3) array
  - The serial / native PufferEnv path also saves world_means via
    driver_env.get_world_means()

trajviz/__init__.py
  - render_npz prefers world_means (plural) when present and looks up
    the right per-env value via the env loop. Falls back to the
    legacy single world_mean key with a printed warning when older
    npz files are loaded — those files render with mis-aligned roads
    for non-env-0 sub-envs (but at least they render).
  - Map cache key now includes the world_mean tuple, not just map_id,
    so future heterogeneous-init_mode setups can't trip on a stale
    cached entry.

docs/src/trajviz.md
  - New "Per-env world_means" item in the known limitations section
    explaining the schema change and the back-compat behavior

Verified empirically: Drive(num_maps=3, use_all_maps=True) loaded
map_001.bin / map_500.bin / map_900.bin, the three sub-envs returned
world_means (420, -11042, -193), (-1737, -11850, -60), and (5950,
1706, 3) respectively — max diff 12,748 m.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants