[WIP] Trajviz: Vulkan offline trajectory renderer#398
[WIP] Trajviz: Vulkan offline trajectory renderer#398eugenevinitsky wants to merge 5 commits into3.0from
Conversation
Ported from ev/yolo (~6 commits) as a single clean change, skipping debug
commits and the per-component reward logging / partner obs changes that
were tangled into the same branch history.
## C side (drive.h, datatypes.h, env_binding.h)
- Agent struct gets four float* buffers (sim_traj_x/y/z/heading) sized to
episode_length. Allocated in init() after set_active_agents, freed in
free_agent.
- c_step writes the post-move_dynamics state into position t = timestep-1.
Cheap: 4 float copies per agent per step.
- c_get_sim_trajectories(env, x, y, z, heading, lengths, ep_len) copies all
active agents' buffers into output arrays. lengths[i] = env->timestep so
callers know how much of the buffer is valid for the current episode.
- vec_get_sim_trajectories: Python-facing wrapper that iterates sub-envs
with agent-offset accounting.
- vec_get_world_mean: (x, y, z) tuple from env 0. Used by save_trajectories
to lift sim coords back into the source map frame for offline rendering.
## Python side (drive.py)
- Drive.__init__ accepts traj_save_dir kwarg; stores it along with a
_worker_idx slot that vector.py fills in for multiprocessing workers.
- Cache world_mean after binding.vectorize (and on resample_maps).
- get_sim_trajectories() allocates output arrays and calls the binding.
- notify() writes a per-worker traj_worker_{idx}.npz containing the
sim trajectories, map_ids, agent_offsets, map_files, world_mean.
Called from workers via the shm notify mechanism.
## Multiprocessing plumbing (vector.py)
- _worker_process tags each env (or sub-envs of a Serial wrapper) with
_worker_idx after construction so env.notify() knows which file to write.
- Multiprocessing.save_worker_trajectories() sets the notify flag for all
workers and spins until they all clear it (workers respond inside their
step loop).
## Checkpoint integration (pufferl.py)
- save_trajectories() dumps the rolling policy buffers (actions, rewards,
values, logprobs, terminals, truncations) + C-side trajectories + map
context into trajectories_{epoch:06d}.npz. Supports both multiprocessing
(fan out via save_worker_trajectories, stitch worker files) and Serial
(read driver_env directly).
- save_reproducibility() snapshots the compiled .so, key source files,
active config, and git commit/diff on the first checkpoint of a run.
- Both called inside the existing checkpoint block in train().
- train() pre-creates data_dir/traj_tmp and threads traj_save_dir into
args["env"] so workers inherit it automatically via env kwargs.
- Opt-out via `save_trajectories: False` in the train config.
Verified locally: C extension builds, Drive.get_sim_trajectories() returns
correctly-shaped arrays with live sim positions, world_mean binding works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pulled verbatim from ev/yolo (commit af9119a). jupytext percent-format script that reads the trajectories_<epoch>.npz written by PuffeRL.save_trajectories and renders agent paths on top of the source map. Uses world_mean to align sim coords with the map frame. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A new headless Vulkan-backed renderer that turns saved Drive
trajectories into MP4 videos. Independent of the existing raylib
visualizer (`scripts/build_ocean.sh visualize`) — opt-in via
`TRAJVIZ=1 python setup.py build_ext --inplace`. Optional dependency,
won't affect users who don't need it.
## Public surface
- Python: `pufferlib.ocean.drive.trajviz.Renderer`
- `render_episode(...)` for single-episode rendering
- `render_batch([...])` for multi-episode batched rendering (up to 16)
- `render_npz(path, maps_dir, out_dir)` for saved trajectories_*.npz
- CLI: `python -m pufferlib.ocean.drive.trajviz <inputs> --maps-dir ... --out ...`
- Random-rollout smoke test:
`python -m pufferlib.ocean.drive.trajviz.tools.random_rollout`
- Standalone C harness: `tests/test_main.c` (no Python required)
## Architecture
- `_native.c` CPython extension shell (numpy unwrap, GIL release)
- `trajviz.{h,c}` public C API: render_episode, render_episodes_batch
- `vk_context` VkInstance / VkDevice / queue / command pool
- `vk_pipeline` line + box graphics pipelines, push-constant cameras
- `vk_renderer` single-episode renderer
- `vk_batch_renderer` batched renderer with vertically-tiled atlas
(per-episode tile bytes are row-contiguous in
the readback buffer)
- `ffmpeg_pipe` pipe-to-ffmpeg + per-pipe writer thread for parallel
fan-out fwrites
- `shaders/` GLSL → SPIR-V (compiled at build time, embedded as
uint32_t arrays in a generated `shaders.c`)
## Two views
Matches the existing live raylib path:
- Top-down (RenderView.FULL_SIM_STATE): orthographic full-map
- BEV (RenderView.BEV_AGENT_OBS): agent-centric ~100m × 178m window,
ego at center facing up
## Performance
On RTX 4080 + 16-core CPU, 1280x720 90-frame episodes with both views,
libx264 -preset veryfast:
batch_size=1: 345 ms / ep (2.9 ep/s)
batch_size=4: 1094 ms total (274 ms / ep, 3.7 ep/s)
batch_size=8: 2136 ms total (267 ms / ep, 3.7 ep/s)
Pure Vulkan + readback floor (no encoder): ~30 ms / ep ≈ 32 ep/s.
The remaining gap is libx264 encoding — NVENC would close it.
Key optimizations applied:
- HOST_CACHED readback memory (6-7× win on its own — uncached PCIe
BAR reads were ~250 MB/s; cached RAM reads are >5 GB/s)
- LINE_STRIP polylines (one draw per polyline, not per segment)
- Tiled vertical atlas for batched rendering (single submit per frame
for N episodes; per-tile bytes are row-contiguous for one fwrite
per pipe per frame)
- Threaded fwrites (per-pipe writer threads, parallel fan-out)
- F_SETPIPE_SZ to fit a frame in one pipe buffer
## Why Vulkan, not raylib
- Headless on Linux clusters with no X server (raylib needs xvfb)
- Throughput-oriented batching (impossible without command-buffer
control)
- Independent build path so the optional Vulkan dep doesn't pollute
the live drive sim build
## Documentation
`docs/src/trajviz.md` covers prerequisites (apt packages), build,
Python and CLI usage, performance tuning (sysctl knobs, HOST_CACHED
explanation, batch size guidance), debugging env vars, troubleshooting,
and architecture overview.
## Notes
- `pufferlib/ocean/drive/map_io.py` extracted from
`notebooks/visualize_trajectories.py` so trajviz and the notebook
share one .bin map parser. The notebook still works.
- `trajviz/shaders.c` is generated at build time and gitignored.
- Drive sim code (drive.{h,c,py}, binding.c) is untouched.
There was a problem hiding this comment.
Pull request overview
Adds an opt-in, headless Vulkan-backed “trajviz” renderer for offline Drive trajectory visualization (MP4 output), and wires up trajectory capture/saving during training checkpoints to feed the renderer—without changing the live raylib visualizer path.
Changes:
- Introduces
pufferlib.ocean.drive.trajviz(C/Vulkan renderer + CPython extension + CLI/tools) built only whenTRAJVIZ=1. - Adds C-side sim-trajectory recording and Python/C bindings to retrieve + save per-checkpoint trajectories (including map context) for offline rendering.
- Adds docs and a shared
.binmap parser (map_io.py) to support loading and rendering saved trajectories.
Reviewed changes
Copilot reviewed 35 out of 37 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| setup.py | Adds opt-in build path for the trajviz Vulkan CPython extension and shader build step. |
| pufferlib/vector.py | Tags worker envs with _worker_idx; adds save_worker_trajectories() fan-out helper. |
| pufferlib/pufferl.py | Saves trajectories at checkpoint time; adds reproducibility snapshot; threads traj temp dir into env kwargs. |
| pufferlib/ocean/env_binding.h | Adds vectorized bindings to pull sim trajectories and world_mean from C envs. |
| pufferlib/ocean/drive/drive.py | Caches world_mean; adds get_sim_trajectories() and notify() to write per-worker .npz. |
| pufferlib/ocean/drive/drive.h | Allocates/records per-step sim trajectories in C; exposes c_get_sim_trajectories(). |
| pufferlib/ocean/drive/datatypes.h | Extends Agent with sim-trajectory buffers and frees them on teardown. |
| pufferlib/ocean/drive/map_io.py | Adds canonical parser/transform helpers for Drive .bin maps for offline tooling. |
| pufferlib/ocean/drive/trajviz/init.py | Python Renderer wrapper + render_npz() convenience loader. |
| pufferlib/ocean/drive/trajviz/main.py | CLI entry point to render one or more trajectories_*.npz inputs. |
| pufferlib/ocean/drive/trajviz/_native.c | CPython extension: numpy validation + GIL release around render calls. |
| pufferlib/ocean/drive/trajviz/trajviz.h | Public C API for single-episode and batched rendering. |
| pufferlib/ocean/drive/trajviz/trajviz.c | Orchestrates Vulkan renderers + ffmpeg pipes; implements batch tiling path. |
| pufferlib/ocean/drive/trajviz/vk_context.h | Declares Vulkan instance/device/queue lifecycle and error helpers. |
| pufferlib/ocean/drive/trajviz/vk_context.c | Implements Vulkan init (1.3 + dynamic rendering + sync2) and teardown. |
| pufferlib/ocean/drive/trajviz/vk_pipeline.h | Declares shared pipeline/push-constant and instance formats. |
| pufferlib/ocean/drive/trajviz/vk_pipeline.c | Creates Vulkan graphics pipelines for polylines and agent boxes. |
| pufferlib/ocean/drive/trajviz/vk_renderer.h | Declares single-episode renderer (frame slots, readback, ffmpeg drain). |
| pufferlib/ocean/drive/trajviz/vk_renderer.c | Implements per-frame rendering + readback + ffmpeg write for one episode. |
| pufferlib/ocean/drive/trajviz/vk_batch_renderer.h | Declares batched atlas renderer API. |
| pufferlib/ocean/drive/trajviz/vk_batch_renderer.c | Implements batched atlas rendering + threaded pipe write fan-out. |
| pufferlib/ocean/drive/trajviz/vk_math.h | Adds small mat4 helpers for ortho fit and BEV camera. |
| pufferlib/ocean/drive/trajviz/ffmpeg_pipe.h | Declares ffmpeg subprocess pipe + writer-thread API. |
| pufferlib/ocean/drive/trajviz/ffmpeg_pipe.c | Implements popen-based ffmpeg piping and async writer thread. |
| pufferlib/ocean/drive/trajviz/shaders.h | Declares externs for generated SPIR-V blobs. |
| pufferlib/ocean/drive/trajviz/shaders/build_shaders.sh | Builds GLSL → SPIR-V and generates shaders.c. |
| pufferlib/ocean/drive/trajviz/shaders/polyline.vert | Adds GLSL for road polyline vertex stage. |
| pufferlib/ocean/drive/trajviz/shaders/polyline.frag | Adds GLSL for road polyline fragment stage. |
| pufferlib/ocean/drive/trajviz/shaders/agent_box.vert | Adds GLSL for instanced agent quad expansion. |
| pufferlib/ocean/drive/trajviz/shaders/agent_box.frag | Adds GLSL for flat-colored agent box fragment stage. |
| pufferlib/ocean/drive/trajviz/tools/random_rollout.py | End-to-end smoke test that rolls out Drive and renders via trajviz. |
| pufferlib/ocean/drive/trajviz/tools/init.py | Marks tools as a package (module discovery). |
| pufferlib/ocean/drive/trajviz/tests/test_main.c | Standalone C harness to validate Vulkan+ffmpeg path without Python. |
| docs/src/trajviz.md | Adds end-user documentation: build/run/tuning/architecture/troubleshooting. |
| docs/src/SUMMARY.md | Links trajviz documentation into the docs sidebar. |
| notebooks/visualize_trajectories.py | Adds/updates notebook for analyzing and plotting saved trajectories. |
| .gitignore | Ignores generated pufferlib/ocean/drive/trajviz/shaders.c. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if hasattr(self.vecenv, "save_worker_trajectories"): | ||
| traj_tmp = getattr(driver_env, "_traj_save_dir", None) if driver_env else None | ||
| if traj_tmp: | ||
| self.vecenv.save_worker_trajectories() | ||
| worker_files = sorted(glob.glob(os.path.join(traj_tmp, "traj_worker_*.npz"))) | ||
| if worker_files: | ||
| all_traj = {} | ||
| map_files = None | ||
| world_mean = None | ||
| for f in worker_files: | ||
| d = np.load(f, allow_pickle=True) | ||
| for k in ("x", "y", "z", "heading", "lengths", "map_ids"): | ||
| if k in d: | ||
| all_traj.setdefault(k, []).append(d[k]) | ||
| if map_files is None and "map_files" in d: | ||
| map_files = d["map_files"] | ||
| if world_mean is None and "world_mean" in d: | ||
| world_mean = d["world_mean"] | ||
| for k, v in all_traj.items(): | ||
| key = f"traj_{k}" if k in ("x", "y", "z", "heading", "lengths") else k | ||
| data[key] = np.concatenate(v) |
There was a problem hiding this comment.
The multiprocessing stitching path doesn’t include agent_offsets, which render_npz() requires to slice per-env episodes. Simply concatenating per-worker (agent-local) offsets wouldn’t be correct anyway; you likely need to (a) collect each worker’s agent_offsets, (b) shift them by the cumulative agent count, and (c) concatenate to produce a global agent_offsets aligned with the concatenated traj_* arrays.
| traj_tmp = getattr(driver_env, "_traj_save_dir", None) if driver_env else None | ||
| if traj_tmp: | ||
| self.vecenv.save_worker_trajectories() | ||
| worker_files = sorted(glob.glob(os.path.join(traj_tmp, "traj_worker_*.npz"))) | ||
| if worker_files: | ||
| all_traj = {} |
There was a problem hiding this comment.
traj_tmp is a shared directory (.../traj_tmp) that is never cleaned, and glob(traj_worker_*.npz) will pick up stale worker files (e.g., if a previous run used more workers, or a crashed worker left an old file). Consider writing into an epoch-scoped subdir, or deleting existing traj_worker_*.npz files before triggering save_worker_trajectories(), and/or validating the expected worker count before stitching.
| PyErr_Fetch(&type, &value, &tb); | ||
| PyErr_Format(PyExc_TypeError, "%s: %s", name, | ||
| value ? PyUnicode_AsUTF8(PyObject_Str(value)) : "type/shape mismatch"); |
There was a problem hiding this comment.
as_array() leaks a reference: PyObject_Str(value) creates a new object that isn’t DECREF’d (it’s passed directly into PyUnicode_AsUTF8(...)). Store the PyObject_Str result in a temporary, use it for formatting, then Py_DECREF it to avoid per-call leaks on shape/type errors.
| PyErr_Fetch(&type, &value, &tb); | |
| PyErr_Format(PyExc_TypeError, "%s: %s", name, | |
| value ? PyUnicode_AsUTF8(PyObject_Str(value)) : "type/shape mismatch"); | |
| PyObject *value_str = NULL; | |
| const char *message = "type/shape mismatch"; | |
| PyErr_Fetch(&type, &value, &tb); | |
| if (value) { | |
| value_str = PyObject_Str(value); | |
| if (value_str) { | |
| const char *utf8 = PyUnicode_AsUTF8(value_str); | |
| if (utf8) | |
| message = utf8; | |
| } | |
| } | |
| PyErr_Format(PyExc_TypeError, "%s: %s", name, message); | |
| Py_XDECREF(value_str); |
| d->rs.polygonMode = VK_POLYGON_MODE_FILL; | ||
| d->rs.cullMode = VK_CULL_MODE_NONE; | ||
| d->rs.frontFace = VK_FRONT_FACE_COUNTER_CLOCKWISE; | ||
| d->rs.lineWidth = 1.5f; /* used for line topology only; ignored for tris */ |
There was a problem hiding this comment.
rs.lineWidth is set to 1.5f, but Vulkan requires the wideLines device feature to be enabled for line widths != 1.0. Since vk_ctx_init() doesn’t enable wideLines, this can trigger validation errors or pipeline creation failure on some devices. Either set the line width back to 1.0 or explicitly query+enable VkPhysicalDeviceFeatures::wideLines when supported.
| d->rs.lineWidth = 1.5f; /* used for line topology only; ignored for tris */ | |
| d->rs.lineWidth = 1.0f; /* Vulkan-safe default; wider lines require wideLines */ |
| static void record_view(VkCommandBuffer cmd, Renderer *r, FrameSlot *slot, RenderTarget *rt, const Mat4 *mvp, | ||
| uint32_t num_instances) { | ||
| barrier_image(cmd, rt->color.image, VK_IMAGE_LAYOUT_UNDEFINED, VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL, | ||
| VK_PIPELINE_STAGE_2_TOP_OF_PIPE_BIT, 0, VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT, | ||
| VK_ACCESS_2_COLOR_ATTACHMENT_WRITE_BIT); |
There was a problem hiding this comment.
The image layout barrier uses oldLayout = VK_IMAGE_LAYOUT_UNDEFINED every frame. After the first frame the image will actually be in VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL (set later in the same command buffer), so this will trip validation and is undefined behavior. Track the current layout per render target (or always transition from TRANSFER_SRC_OPTIMAL back to COLOR_ATTACHMENT_OPTIMAL after the copy) and use the real old layout in the next frame’s barrier.
| PyArrayObject *x_arr = (PyArrayObject *)PyTuple_GetItem(args, 1); | ||
| PyArrayObject *y_arr = (PyArrayObject *)PyTuple_GetItem(args, 2); | ||
| PyArrayObject *z_arr = (PyArrayObject *)PyTuple_GetItem(args, 3); | ||
| PyArrayObject *heading_arr = (PyArrayObject *)PyTuple_GetItem(args, 4); | ||
| PyArrayObject *lengths_arr = (PyArrayObject *)PyTuple_GetItem(args, 5); | ||
| int ep_len = (int)PyLong_AsLong(PyTuple_GetItem(args, 6)); | ||
|
|
There was a problem hiding this comment.
vec_get_sim_trajectories casts tuple items directly to PyArrayObject* and calls PyArray_DATA without validating they’re actually NumPy arrays / contiguous / expected dtype+shape. Other vec_* helpers in this file do PyArray_Check and shape-derived stride math. Add PyArray_Check (and ideally dtype/ndim checks) here too to prevent segfaults on misuse from Python.
| def save_worker_trajectories(self): | ||
| """Trigger every worker to call env.notify(), then block until all finish. | ||
|
|
||
| Used by PuffeRL.save_trajectories() to fan out a trajectory-save request | ||
| across workers. Each worker's env.notify() writes a per-worker npz and | ||
| clears its own notify flag; we spin until all flags are down. | ||
| """ | ||
| self.buf["notify"][:] = True | ||
| while any(self.buf["notify"]): | ||
| time.sleep(0.01) | ||
|
|
There was a problem hiding this comment.
save_worker_trajectories() spins indefinitely using Python’s any(self.buf['notify']) over a NumPy array. This is both slower than self.buf['notify'].any() and can hang forever if a worker dies or never clears its flag. Consider using np.any(...)/.any() plus a timeout (and surfacing an error) to avoid deadlocking the training loop.
| # Opt-in: TRAJVIZ=1 builds the Vulkan trajectory renderer as a CPython | ||
| # extension. Requires libvulkan-dev + glslang-tools (apt). See | ||
| # docs/trajviz.md for installation. Default off — most users don't need it. | ||
| TRAJVIZ = os.getenv("TRAJVIZ", "0") == "1" |
There was a problem hiding this comment.
This comment points users to docs/trajviz.md, but the documentation added in this PR lives under docs/src/trajviz.md. Update the path so the install instructions are discoverable from the repo layout.
| if (!traj_xyh || !vert_offsets || !poly_meta_offsets || !poly_type_offsets || !agent_lengths) { | ||
| snprintf(ctx->last_error, sizeof(ctx->last_error), "null required pointer to render_episodes_batch"); | ||
| return TRAJVIZ_ERR_BAD_ARG; | ||
| } |
There was a problem hiding this comment.
trajviz_render_episodes_batch allows all_road_offsets / all_road_types to be NULL (they’re not included in the required-pointer check), but later computes num_polys_s from poly_meta_offsets and can call vk_batch_renderer_set_episode with num_polys_s > 0 and off_s/typ_s == NULL, which is likely to crash. Either require these pointers when any episode has polylines, or validate per-episode and force num_polys_s=0 when offsets/types are absent.
| } | |
| } | |
| { | |
| int any_episode_has_polylines = 0; | |
| for (int s = 0; s < batch_size; ++s) { | |
| if (poly_meta_offsets[s + 1] > poly_meta_offsets[s]) { | |
| any_episode_has_polylines = 1; | |
| break; | |
| } | |
| } | |
| if (any_episode_has_polylines && (!all_road_offsets || !all_road_types)) { | |
| snprintf(ctx->last_error, sizeof(ctx->last_error), | |
| "road offset/type arrays are required when any episode has polylines"); | |
| return TRAJVIZ_ERR_BAD_ARG; | |
| } | |
| } |
| const uint32_t *off_s = (all_road_offsets && num_polys_plus_1 > 0) ? (all_road_offsets + pm_start) : NULL; | ||
|
|
||
| uint32_t pt_start = poly_type_offsets[s]; | ||
| const uint32_t *typ_s = (all_road_types && num_polys_s > 0) ? (all_road_types + pt_start) : NULL; |
There was a problem hiding this comment.
The per-episode road slicing can yield num_polys_s > 0 while off_s/typ_s are NULL (because all_road_offsets / all_road_types are treated as optional). Before calling vk_batch_renderer_set_episode, add a consistency check that offsets/types are present whenever num_polys_s > 0, and fail with a clear error if not.
| const uint32_t *typ_s = (all_road_types && num_polys_s > 0) ? (all_road_types + pt_start) : NULL; | |
| const uint32_t *typ_s = (all_road_types && num_polys_s > 0) ? (all_road_types + pt_start) : NULL; | |
| if (num_polys_s > 0 && (!off_s || !typ_s)) { | |
| snprintf(ctx->last_error, sizeof(ctx->last_error), | |
| "episode %d has %u road polygons but missing %s%s", | |
| s, num_polys_s, | |
| !off_s ? "road offsets" : "", | |
| (!off_s && !typ_s) ? " and road types" : (!typ_s ? "road types" : "")); | |
| err = TRAJVIZ_ERR_BAD_ARGS; | |
| goto cleanup; | |
| } |
Adds an env-var encoder selector to ffmpeg_pipe.c with two choices:
TRAJVIZ_ENCODER unset (default) → libx264 -preset veryfast
TRAJVIZ_ENCODER=nvenc / h264_nvenc → h264_nvenc -preset p4
libx264 stays the default — counter-intuitively NVENC turned out to be
the wrong fit for trajviz's "spawn one ffmpeg subprocess per output MP4
per render" architecture. Two reasons measured empirically on RTX 4080
+ 16-core CPU:
1. NVENC session creation is ~100 ms per session and we spawn 2N
ffmpeg processes per render_batch call. For short episodes the
per-session startup tax dominates wall time.
2. The driver still throttles concurrent NVENC sessions per process
("incompatible client key (21)") at batch_size ≥ 8 even though the
consumer-card cap was officially removed in driver 530+.
3. In steady state, libx264 -preset veryfast and NVENC -preset p4 are
tied per-frame at 720p (~2.3 ms/frame either way). 16 parallel
libx264 instances on 16 cores out-throughputs a single NVENC engine
serializing 16 streams.
Per-episode wall time, libx264 vs nvenc, both views, 1280x720:
batch_size T=90 T=500 T=1000
1 350/790 1162/1540 2203/2284
4 273/815 1139/1442 5157/5432
NVENC closes the startup gap with longer episodes but never wins on
this hardware. Real NVENC throughput unlocks would require either a
single long-lived ffmpeg with multi-input/multi-output or direct
libnvidia-encode integration with VK_KHR_external_memory_fd — both
larger refactors than v1.
`TRAJVIZ_ENCODER=nvenc` remains as a one-line opt-in for users who
want to experiment or have a single-stream long-episode workload
where the math flips.
docs/src/trajviz.md gets a "Choosing an encoder" section with the
empirical table and the architecture explanation, plus the env var is
added to the debugging knobs list.
Each Drive sub-env in a vec computes its own world_mean in set_means()
from its own map's road + agent points, so different maps in a
num_maps>1 vec have different world_means. Empirically these can
differ by 10+ km in source-Waymo coordinates across maps from
different cities.
The previous code had a misleading comment in env_binding.h's
vec_get_world_mean ("All envs in a vec share the same map-centering
convention so env 0 is representative") and saved a single world_mean
(env 0's) into trajectories_*.npz. Any offline tool that loaded a
non-env-0 sub-env's source map and tried to align it with that env's
trajectory was off by (this env's world_mean − env 0's world_mean) —
silently rendering roads in the wrong place.
Fixes:
env_binding.h
- Replace the misleading comment on vec_get_world_mean with one
that explains the env-0-only nature and points at the new fn
- Add vec_get_all_world_means(c_envs, out) that fills a
(num_envs, 3) float32 array with each sub-env's world_mean
drive.py
- Drive.get_world_means() Python wrapper, returns (num_envs, 3)
- Drive.notify() now saves world_means (plural, per-env) into the
per-worker npz, in addition to the legacy world_mean (singular)
pufferl.py
- PuffeRL.save_trajectories concatenates world_means across the
per-worker npz files (matching how it concatenates map_ids /
agent_offsets) and saves it as a (total_envs, 3) array
- The serial / native PufferEnv path also saves world_means via
driver_env.get_world_means()
trajviz/__init__.py
- render_npz prefers world_means (plural) when present and looks up
the right per-env value via the env loop. Falls back to the
legacy single world_mean key with a printed warning when older
npz files are loaded — those files render with mis-aligned roads
for non-env-0 sub-envs (but at least they render).
- Map cache key now includes the world_mean tuple, not just map_id,
so future heterogeneous-init_mode setups can't trip on a stale
cached entry.
docs/src/trajviz.md
- New "Per-env world_means" item in the known limitations section
explaining the schema change and the back-compat behavior
Verified empirically: Drive(num_maps=3, use_all_maps=True) loaded
map_001.bin / map_500.bin / map_900.bin, the three sub-envs returned
world_means (420, -11042, -193), (-1737, -11850, -60), and (5950,
1706, 3) respectively — max diff 12,748 m.
Summary
A new headless Vulkan-backed renderer that turns saved Drive trajectories into MP4 videos. Independent of the existing raylib visualizer (
scripts/build_ocean.sh visualize) — opt-in viaTRAJVIZ=1 python setup.py build_ext --inplace. Drive sim code (drive.{h,c,py},binding.c) is untouched.Status: WIP. Builds, runs, renders correctly on real Drive sims. There is more to do (see below) before this is merge-ready.
Renderer.render_episode(...)andRenderer.render_batch([...])(up to 16 episodes per batch)python -m pufferlib.ocean.drive.trajviz <inputs> --maps-dir ... --out ...python -m pufferlib.ocean.drive.trajviz.tools.random_rolloutpufferlib/ocean/drive/trajviz/tests/test_main.c(no Python required)Performance
On RTX 4080 + 16-core CPU, 1280×720, 90-frame episodes, both views, libx264
-preset veryfast:Pure Vulkan + readback floor (encoder bypassed) is ~30 ms / ep ≈ 32 ep/s. The remaining gap is libx264 — NVENC integration (deferred) would close it.
Headline optimizations applied during development:
HOST_CACHEDreadback memory — single biggest win (~6-7×). DefaultHOST_VISIBLE | HOST_COHERENTon NVIDIA picks write-combined PCIe BAR memory, fast for the GPU but ~250 MB/s for CPU reads.HOST_CACHEDputs it in regular RAM (>5 GB/s).LINE_STRIPpolylines — one draw call per polyline instead of per segment.fwriteper pipe per frame, no row stitching).fwritefan-out — each ffmpeg pipe owns a writer thread; per-frame write phase costsmax(single fwrite)instead ofsum(fwrites).F_SETPIPE_SZ— bumps the kernel pipe buffer up to whatever the per-process limit allows so fwrites don't ping-pong on a 64 KB pipe.Architecture
Two views, matching the existing live raylib path:
RenderView.FULL_SIM_STATE— orthographic full-mapRenderView.BEV_AGENT_OBS— agent-centric ~100m × 178m window, ego at center facing upDocumentation in
docs/src/trajviz.mdcovers prerequisites (apt packages), build, Python and CLI usage, performance tuning (sysctl knobs,HOST_CACHEDexplanation, batch size guidance), debugging env vars, troubleshooting, and architecture overview.Why Vulkan, not raylib
Why WIP — known gaps before merge
get_sim_trajectories. The other ~18 vehicles in a typical Waymo scenario (the WOSAC "context" tracks) aren't shown. Adding them needs a separate Drive API to expose expert trajectories.RenderView.AGENT_PERSP(3D car meshes from.glb) is not implemented in trajviz. Top-down + BEV only.batch_sizecapped at 16 — atlas image height grows linearly. Past ~22 we'd need multiple atlas passes or a 2-D tile grid.num_stepsper batch — short episodes get padded with zeros. Wastes a tiny bit of GPU work on the trailing zeros.vk_renderer.candvk_batch_renderer.c(~50 lines each). Should be consolidated into a shared helper before merge.Test plan
TRAJVIZ=1 python setup.py build_ext --inplace— clean build, no warnings beyond the standardPyCFunctioncastrandom_rollout.py— 90-frame MP4 with valid h264, 1280×720, both viewstrajectories_*.npzfrom a real training run end-to-end viarender_npzHOST_CACHEDfallback worksTRAJVIZ=1so the extension keeps compilingnotebooks/visualize_trajectories.pyoncemap_io.pyis the canonical parser🤖 Generated with Claude Code