Skip to content

Refactor macOS zerocopy normalization boundary#56

Open
yhs0602 wants to merge 9 commits intomainfrom
codex/zerocopy-raw-normalize
Open

Refactor macOS zerocopy normalization boundary#56
yhs0602 wants to merge 9 commits intomainfrom
codex/zerocopy-raw-normalize

Conversation

@yhs0602
Copy link
Copy Markdown
Owner

@yhs0602 yhs0602 commented Mar 19, 2026

Summary

  • keep the macOS zerocopy transport tensor internal and stop exposing the clone workaround as the bridge contract
  • replace the public ZEROCOPY path on macOS with a native Metal normalize op that writes an allocator-managed MPS tensor in RGB + flipped layout
  • fix the Apple bridge tensor metadata to describe the underlying IOSurface correctly as a 3-channel BGR tensor with explicit row/pixel strides
  • preserve the Python fallback normalization path when the native helper is unavailable
  • add regression tests covering the native helper path and the Python fallback path

Verification

  • uv run --extra test pytest tests/python/unit/test_observation_converter.py -v
  • python -m pip install -e . --no-build-isolation

Notes

  • a stable public raw zerocopy mode is intentionally not exposed yet because the upstream PyTorch MPS allocator contract for external buffers is still insufficient for view-safe raw tensor usage
  • importing the built extension locally is still blocked by the existing @rpath/libomp.dylib runtime issue in this machine setup

Summary by CodeRabbit

  • Documentation

    • Added design spec and implementation plan for macOS zerocopy observation handling.
  • New Features

    • macOS native-backed normalization to ensure consistent RGB + vertical-flip outputs.
  • Refactor

    • Explicit separation between internal raw zerocopy tensor state and normalized public observation boundary.
  • Tests

    • Unit tests validating zerocopy initialization, caching, internal type tracking, and consistent normalized outputs.

@yhs0602 yhs0602 requested a review from Copilot March 19, 2026 14:05
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a native Metal normalization pipeline and refactors zerocopy handling so raw Apple tensors are retained internally and normalized at the public conversion boundary; updates the Python observation converter and JAX paths, adds unit tests for macOS/JAX flows, and integrates Apple C++/Metal normalization into the build.

Changes

Cohort / File(s) Summary
Documentation & Planning
docs/superpowers/plans/2026-03-19-zerocopy-raw-normalize.md, docs/superpowers/specs/2026-03-19-zerocopy-raw-normalize-design.md
Adds plan and design spec describing separation of raw zerocopy tensor state from normalized public observation state, task breakdowns, testing targets, and success criteria.
Python: Observation Converter
src/craftground/environment/observation_converter.py
Adds _apple_normalize_tensor import with ImportError fallback; introduces Apple/CUDA and JAX normalization helper methods; replaces observation_tensor_type bookkeeping with internal_type; routes zerocopy/JAX conversions through helpers and fixes JAX fallback.
Unit Tests
tests/python/unit/test_observation_converter.py
Adds tests stubbing native module and jax to validate raw caching, internal_type updates, native-normalize invocation when present, fallback normalization, and caching across repeated calls.
Build Configuration
CMakeLists.txt
Adds src/cpp/ipc_apple_ops.mm to Apple build sources.
C++ IPC Module (cross‑platform)
src/cpp/ipc.cpp, src/cpp/ipc.h
Exports new Python-callable craftground_native.normalize_apple_mtl_tensor(py_obj) (declared in header); on non-Apple builds returns None.
Apple IPC core & header changes
src/cpp/ipc_apple.h, src/cpp/ipc_apple.mm
Declares normalize_apple_mtl_tensor_impl; updates DLPack creation to set strides and 3-channel layout, improves DLTensor cleanup and bytesPerRow handling.
Apple Metal Normalization Implementation
src/cpp/ipc_apple_ops.mm
New Objective‑C++ file implementing a Metal compute pipeline (flip + BGR→RGB), validates MPS uint8 tensors, dispatches kernel, and exposes normalize_apple_mtl_tensor_impl to Python. Important review area: Metal pipeline, buffer/stride correctness and synchronization.
Vendored Library Update
pybind11
Updates pybind11 submodule commit pointer used by the project (submodule revision bump).

Sequence Diagram(s)

sequenceDiagram
  participant Python as Python (ObservationConverter)
  participant Native as craftground_native (pybind IPC)
  participant Metal as Metal Compute (MTL)

  Python->>Native: initialize_from_mach_port(ipc_handle)
  Native-->>Python: raw MTL-backed uint8 tensor
  Python->>Python: store raw tensor (last_observations[0]), set internal_type=APPLE_TENSOR
  Python->>Native: normalize_apple_mtl_tensor(raw_tensor)
  alt Native (Apple build) returns tensor
    Native->>Metal: dispatch Metal kernel (flip + BGR->RGB)
    Metal-->>Native: normalized RGB uint8 tensor
    Native-->>Python: normalized tensor
  else Native returns None / absent
    Python->>Python: fallback normalize (_normalize_apple_zerocopy_tensor)
    Python-->>Python: produce normalized tensor
  end
  Python-->>Client: return normalized observation
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble bytes from Apple’s stream,

Raw and cozy, kept unseen.
When callers knock I flip and mend,
BGR hops home to RGB blend.
Tests bound in—our pipeline gleams.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Refactor macOS zerocopy normalization boundary' accurately and specifically describes the main architectural change—moving the normalization logic to a boundary and exposing raw internal tensors differently.
Description check ✅ Passed The PR description is substantially complete with clear Summary, Verification, and Notes sections covering the main changes, testing approach, and important context about upstream limitations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/zerocopy-raw-normalize
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the macOS zerocopy observation flow so raw BGRA/BGR + unflipped transport tensors are kept internally, while the public ScreenEncodingMode.ZEROCOPY output remains normalized (RGB + flipped) at the converter boundary. This also updates internal tensor-type bookkeeping and adds targeted unit/regression coverage plus design/plan docs.

Changes:

  • Move Apple/CUDA zerocopy normalization into dedicated helper(s) at the convert() return boundary.
  • Fix zerocopy/JAX internal tensor type tracking by consistently using internal_type.
  • Add a macOS zerocopy regression test and document the design/implementation plan.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/craftground/environment/observation_converter.py Stores raw zerocopy tensors internally; normalizes only at the public return boundary; updates internal type tracking.
tests/python/unit/test_observation_converter.py Adds a unit test that asserts raw storage + normalized output for the macOS (mach-port) zerocopy path.
docs/superpowers/specs/2026-03-19-zerocopy-raw-normalize-design.md Captures the rationale and intended boundary between raw transport vs normalized public output.
docs/superpowers/plans/2026-03-19-zerocopy-raw-normalize.md Records the implementation/testing checklist used to land the refactor safely.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 305 to 307
self.last_observations[0] = rgb_array_or_tensor
# drop alpha, flip y axis, and clone
rgb_array_or_tensor = rgb_array_or_tensor.clone()[:, :, [2, 1, 0]].flip(0)
Comment on lines 318 to 322
jax_image = jnp.from_dlpack(cuda_dlpack)
rgb_array_or_tensor = jax_image
rgb_array_or_tensor = rgb_array_or_tensor.clone()[:, :, [2, 1, 0]].flip(0)
self.observation_tensor_type = ObservationTensorType.JAX_NP
self.internal_type = ObservationTensorType.JAX_NP
return jax_image, None
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
docs/superpowers/plans/2026-03-19-zerocopy-raw-normalize.md (1)

13-13: Minor: Use official "macOS" spelling.

Apple's operating system is officially spelled "macOS" (lowercase "mac").

📝 Suggested fix
-## Chunk 1: MacOS Zerocopy Boundary
+## Chunk 1: macOS Zerocopy Boundary
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-03-19-zerocopy-raw-normalize.md` at line 13,
Update the header and any occurrences that use "MacOS" to the official Apple
capitalization "macOS" (e.g., change the heading "## Chunk 1: MacOS Zerocopy
Boundary" to "## Chunk 1: macOS Zerocopy Boundary" and update any other
instances of "MacOS" in this document to "macOS").
tests/python/unit/test_observation_converter.py (1)

21-24: Consider adding error path coverage.

The current test covers the happy path well. Consider adding tests for edge cases in follow-up work:

  • Error handling when initialize_from_mach_port returns None
  • CUDA zerocopy path (different handle length)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/python/unit/test_observation_converter.py` around lines 21 - 24, Add
test cases in tests/python/unit/test_observation_converter.py to cover failure
and CUDA zerocopy branches: create a fake_native where initialize_from_mach_port
returns None and assert the converter (the code exercised by the existing test)
handles/raises the expected error or returns the expected fallback; and create
another fake_native where mtl_tensor_from_cuda_mem_handle receives a CUDA handle
of a different length to exercise the CUDA zerocopy path and assert the
converter's behavior. Reference the existing fake_native SimpleNamespace and the
initialize_from_mach_port and mtl_tensor_from_cuda_mem_handle call sites to
implement these two new tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/craftground/environment/observation_converter.py`:
- Around line 319-322: The JAX branch incorrectly uses tensor methods (.clone(),
.flip()) and returns the original jax_image instead of the processed tensor;
update the block that sets rgb_array_or_tensor to use JAX-safe operations (e.g.,
use jax.numpy to copy/convert and jnp.flip to flip the array) rather than
.clone()/.flip(), change the channel reorder from [:, :, [2, 1, 0]] to [:, :,
:3] to drop alpha (matching CUDA semantics), ensure self.internal_type is set to
ObservationTensorType.JAX_NP, and return rgb_array_or_tensor, None (not
jax_image).
- Around line 306-309: convert_jax_observation() is using PyTorch methods on JAX
arrays (rgb_array_or_tensor.clone() and .flip(0)), which will raise
AttributeError; replace those with JAX equivalents by copying or re-wrapping the
array using jnp.array(...) or .copy() and use jnp.flip(..., axis=0) to flip the
y axis, and perform channel reordering via advanced indexing (e.g., [...,
[2,1,0]]); set self.internal_type = ObservationTensorType.JAX_NP as before and
return the transformed array; also apply the same replacements in the else
branch that performs the same clone/flip operations.

---

Nitpick comments:
In `@docs/superpowers/plans/2026-03-19-zerocopy-raw-normalize.md`:
- Line 13: Update the header and any occurrences that use "MacOS" to the
official Apple capitalization "macOS" (e.g., change the heading "## Chunk 1:
MacOS Zerocopy Boundary" to "## Chunk 1: macOS Zerocopy Boundary" and update any
other instances of "MacOS" in this document to "macOS").

In `@tests/python/unit/test_observation_converter.py`:
- Around line 21-24: Add test cases in
tests/python/unit/test_observation_converter.py to cover failure and CUDA
zerocopy branches: create a fake_native where initialize_from_mach_port returns
None and assert the converter (the code exercised by the existing test)
handles/raises the expected error or returns the expected fallback; and create
another fake_native where mtl_tensor_from_cuda_mem_handle receives a CUDA handle
of a different length to exercise the CUDA zerocopy path and assert the
converter's behavior. Reference the existing fake_native SimpleNamespace and the
initialize_from_mach_port and mtl_tensor_from_cuda_mem_handle call sites to
implement these two new tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 22924490-8cd7-4aa4-9b17-4c8fefa20703

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8592f and a42b2a5.

📒 Files selected for processing (4)
  • docs/superpowers/plans/2026-03-19-zerocopy-raw-normalize.md
  • docs/superpowers/specs/2026-03-19-zerocopy-raw-normalize-design.md
  • src/craftground/environment/observation_converter.py
  • tests/python/unit/test_observation_converter.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/cpp/ipc_apple.h (1)

6-8: Missing pybind11 include for py::object type.

This header uses py::object but doesn't include the pybind11 header. It relies on the includer having already included pybind11, which is fragile and can cause confusing compilation errors.

♻️ Proposed fix
 `#ifndef` __IPC_APPLE_H__
 `#define` __IPC_APPLE_H__
 
+#include <pybind11/pybind11.h>
+namespace py = pybind11;
+
 `#define` USE_CUSTOM_DL_PACK_TENSOR 1
 
 py::object
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpp/ipc_apple.h` around lines 6 - 8, The header uses py::object (in
declarations mtl_tensor_from_mach_port and normalize_apple_mtl_tensor_impl) but
doesn't include pybind11, causing fragile builds; add the appropriate pybind11
include (e.g., `#include` <pybind11/pybind11.h> or <pybind11/pytypes.h>) at the
top of src/cpp/ipc_apple.h so the py::object type is defined and the header can
be included independently.
src/cpp/ipc.h (1)

7-10: Missing pybind11 include for py namespace alias.

The header declares namespace py = pybind11; and uses py::object, but doesn't include the pybind11 header. This works only when the includer happens to include pybind11 first.

♻️ Proposed fix
 `#include` <stdlib.h>
 `#include` <string>
 
+#include <pybind11/pybind11.h>
 namespace py = pybind11;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpp/ipc.h` around lines 7 - 10, The header declares namespace py =
pybind11 and uses py::object in initialize_from_mach_port and
normalize_apple_mtl_tensor but doesn't include pybind11; add the proper include
for pybind11 (e.g. include <pybind11/pybind11.h>) at the top of src/cpp/ipc.h so
the py alias and py::object are always defined regardless of include order.
src/cpp/ipc_apple_ops.mm (1)

120-122: Consider using PyTorch's getMTLBufferStorage() helper for consistency with MPS backend patterns.

The direct cast from src.storage().data() to id<MTLBuffer> is supported by PyTorch's MPS backend and documented in their Metal kernel implementations. However, PyTorch provides the getMTLBufferStorage() helper function for this exact operation, which aligns with the backend's recommended approach and improves code clarity. Using the helper would make the intent explicit and align with PyTorch's own metal kernel code patterns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpp/ipc_apple_ops.mm` around lines 120 - 122, Replace the raw casts to
Metal buffers with PyTorch's helper: instead of casting src.storage().data() and
dst.storage().data() to id<MTLBuffer>, call getMTLBufferStorage(src.storage())
and getMTLBufferStorage(dst.storage()) (keeping use of
getNormalizePipeline(stream->device())/pipeline unchanged) so src_buffer and
dst_buffer are obtained via the helper for clarity and consistency with MPS
backend patterns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpp/ipc_apple_ops.mm`:
- Around line 71-95: Reformat the Metal pipeline creation block to satisfy
clang-format style rules: run clang-format -i on the file or reflow the block so
spacing, alignment and brace placement match project style; specifically
reformat the section that creates and checks
library/newFunctionWithName:@"normalize_bgr_flip", the error assignments to
library_error and pipeline_error, and the calls involving device, options,
pipeline, function, and pipeline_error so that indentation, line breaks and
spacing conform to clang-format (target the code around
newLibraryWithSource:options:error:, newFunctionWithName:, and
newComputePipelineStateWithFunction:error:); then re-run CI to ensure no
remaining formatting violations.

---

Nitpick comments:
In `@src/cpp/ipc_apple_ops.mm`:
- Around line 120-122: Replace the raw casts to Metal buffers with PyTorch's
helper: instead of casting src.storage().data() and dst.storage().data() to
id<MTLBuffer>, call getMTLBufferStorage(src.storage()) and
getMTLBufferStorage(dst.storage()) (keeping use of
getNormalizePipeline(stream->device())/pipeline unchanged) so src_buffer and
dst_buffer are obtained via the helper for clarity and consistency with MPS
backend patterns.

In `@src/cpp/ipc_apple.h`:
- Around line 6-8: The header uses py::object (in declarations
mtl_tensor_from_mach_port and normalize_apple_mtl_tensor_impl) but doesn't
include pybind11, causing fragile builds; add the appropriate pybind11 include
(e.g., `#include` <pybind11/pybind11.h> or <pybind11/pytypes.h>) at the top of
src/cpp/ipc_apple.h so the py::object type is defined and the header can be
included independently.

In `@src/cpp/ipc.h`:
- Around line 7-10: The header declares namespace py = pybind11 and uses
py::object in initialize_from_mach_port and normalize_apple_mtl_tensor but
doesn't include pybind11; add the proper include for pybind11 (e.g. include
<pybind11/pybind11.h>) at the top of src/cpp/ipc.h so the py alias and
py::object are always defined regardless of include order.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 266c2b26-26b0-4ebe-860d-c43ea28eefb0

📥 Commits

Reviewing files that changed from the base of the PR and between a42b2a5 and 41b460b.

📒 Files selected for processing (8)
  • CMakeLists.txt
  • src/cpp/ipc.cpp
  • src/cpp/ipc.h
  • src/cpp/ipc_apple.h
  • src/cpp/ipc_apple.mm
  • src/cpp/ipc_apple_ops.mm
  • src/craftground/environment/observation_converter.py
  • tests/python/unit/test_observation_converter.py
✅ Files skipped from review due to trivial changes (1)
  • CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/craftground/environment/observation_converter.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cpp/ipc_apple.mm (1)

117-123: ⚠️ Potential issue | 🟠 Major

Validate the IOSurface size at runtime before building the tensor.

Line 148 still uses the caller-supplied width/height. The only check is assert on Lines 120-121, which disappears in release builds. If those values drift, the code can give the tensor the wrong shape and size the no-copy buffer from the wrong height.

Suggested fix
-    assert(iosurface_width == width);
-    assert(iosurface_height == height);
+    if (iosurface_width != static_cast<size_t>(width) ||
+        iosurface_height != static_cast<size_t>(height)) {
+        throw std::runtime_error(
+            "IOSurface dimensions do not match requested tensor shape"
+        );
+    }
     size_t bytesPerRow = IOSurfaceGetBytesPerRow(ioSurface);
-    size_t length = bytesPerRow * height;
+    size_t length = bytesPerRow * iosurface_height;
@@
-    DLManagedTensor *tensor =
-        createDLPackTensorMetal(mtlBuffer, width, height, bytesPerRow);
+    DLManagedTensor *tensor = createDLPackTensorMetal(
+        mtlBuffer, iosurface_width, iosurface_height, bytesPerRow
+    );

Also applies to: 147-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpp/ipc_apple.mm` around lines 117 - 123, Runtime validation is missing:
don't rely on asserts for iosurface size — check iosurface_width and
iosurface_height against the caller-supplied width/height at runtime and fail
fast if they differ; use IOSurfaceGetWidth/IOSurfaceGetHeight to compute the
tensor shape and the no-copy buffer length (bytesPerRow * iosurface_height)
instead of using the possibly-stale caller `height`/`width`, and handle the
mismatch by returning an error/throwing/early-exit from the function (replace
the assert-based checks around iosurface_width/iosurface_height and update the
calculation of `length` to use `iosurface_height` and `iosurface_width` as the
authoritative dimensions).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpp/ipc_apple_ops.mm`:
- Around line 69-96: The code uses manual retains/releases and alloc/init which
break ARC; update the Metal setup to ARC-safe operations: remove all `[...
release]` and the `retain` on library_error, replace the `[[NSError alloc]
initWithDomain:...]` with the autoreleasing factory `[NSError
errorWithDomain:code:userInfo:]`, and stop passing the static `pipeline_error`
into Metal calls—use a local `NSError *localPipelineError = nil` when calling
`newComputePipelineStateWithFunction:function error:&localPipelineError` then
assign `pipeline_error = localPipelineError` (or `pipeline_error =
[localPipelineError copy]` only if you need an owned copy). Keep references to
`options`, `library`, `function`, and `pipeline` as ARC-managed objects and
remove manual memory management calls around them.
- Around line 124-125: The C-style casts converting storage().data() to
id<MTLBuffer> (variables src_buffer and dst_buffer) are incompatible with ARC;
change those casts to non-retaining Objective-C bridge casts by replacing the
C-style casts with (__bridge id<MTLBuffer>) when recovering the buffers from
src.storage().data() and dst.storage().data() so the conversions are ARC-safe
and do not alter retain counts.

In `@src/cpp/ipc.cpp`:
- Around line 105-109: The Python import expects a symbol named
mtl_dlpack_from_mach_port but the PYBIND11_MODULE only exports
initialize_from_mach_port, normalize_apple_mtl_tensor, and
mtl_tensor_from_cuda_mem_handle; add an exported binding named
mtl_dlpack_from_mach_port (either by binding the existing C++ function
initialize_from_mach_port under that Python name or by implementing a thin
wrapper function mtl_dlpack_from_mach_port that calls the existing
implementation) in the PYBIND11_MODULE block so observation_converter.py can
import mtl_dlpack_from_mach_port successfully.

In `@src/craftground/environment/observation_converter.py`:
- Around line 119-126: The code normalizes only the value returned in the
APPLE_TENSOR/CUDA_DLPACK branches but leaves self.last_observations[0] as the
raw zerocopy buffer so render() later consumes unnormalized data; fix by
replacing the raw cache with the normalized tensor (assign the result of
_normalize_apple_zerocopy_tensor or _normalize_cuda_zerocopy_tensor back into
self.last_observations[0]) inside the branches (affecting initialize_zerocopy /
the branch handling for ObservationTensorType.APPLE_TENSOR and
ObservationTensorType.CUDA_DLPACK) so both the returned tuple and the cached
last_observations[0] contain the normalized RGB(A) tensor.

In `@tests/python/unit/test_observation_converter.py`:
- Around line 15-21: The fixture should model macOS' 4-byte-per-pixel layout by
creating an underlying HxWx4 uint8 tensor (add a dummy fourth channel per pixel)
and then produce the HxWx3 view using torch.as_strided with element-strides
(H_stride=8, W_stride=4, C_stride=1) so the resulting raw_bgr has shape (2,2,3)
but a physical 4-byte stride; replace the current raw_bgr creation with this
pattern (use the variable name raw_bgr) and apply the same change to the other
Apple zero-copy test fixture referenced in the file.

---

Outside diff comments:
In `@src/cpp/ipc_apple.mm`:
- Around line 117-123: Runtime validation is missing: don't rely on asserts for
iosurface size — check iosurface_width and iosurface_height against the
caller-supplied width/height at runtime and fail fast if they differ; use
IOSurfaceGetWidth/IOSurfaceGetHeight to compute the tensor shape and the no-copy
buffer length (bytesPerRow * iosurface_height) instead of using the
possibly-stale caller `height`/`width`, and handle the mismatch by returning an
error/throwing/early-exit from the function (replace the assert-based checks
around iosurface_width/iosurface_height and update the calculation of `length`
to use `iosurface_height` and `iosurface_width` as the authoritative
dimensions).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3628f1fe-9f4e-46de-b7ba-2309a7b0df39

📥 Commits

Reviewing files that changed from the base of the PR and between 41b460b and b0fcda3.

📒 Files selected for processing (8)
  • pybind11
  • src/cpp/ipc.cpp
  • src/cpp/ipc.h
  • src/cpp/ipc_apple.h
  • src/cpp/ipc_apple.mm
  • src/cpp/ipc_apple_ops.mm
  • src/craftground/environment/observation_converter.py
  • tests/python/unit/test_observation_converter.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/cpp/ipc.h
  • src/cpp/ipc_apple.h

Comment on lines 105 to 109
PYBIND11_MODULE(craftground_native, m) {
m.doc() = "Craftground Native Module";
m.def("initialize_from_mach_port", &initialize_from_mach_port);
m.def("normalize_apple_mtl_tensor", &normalize_apple_mtl_tensor);
m.def("mtl_tensor_from_cuda_mem_handle", &mtl_tensor_from_cuda_mem_handle);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The Apple JAX path still imports a symbol this module never exports.

Line 108 adds the new normalize binding, but src/craftground/environment/observation_converter.py Line 298 still does from .craftground_native import mtl_dlpack_from_mach_port. That name is not bound anywhere in this module, so the first macOS/JAX conversion will fail with ImportError. The new unit test only passes because it injects a fake implementation into sys.modules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpp/ipc.cpp` around lines 105 - 109, The Python import expects a symbol
named mtl_dlpack_from_mach_port but the PYBIND11_MODULE only exports
initialize_from_mach_port, normalize_apple_mtl_tensor, and
mtl_tensor_from_cuda_mem_handle; add an exported binding named
mtl_dlpack_from_mach_port (either by binding the existing C++ function
initialize_from_mach_port under that Python name or by implementing a thin
wrapper function mtl_dlpack_from_mach_port that calls the existing
implementation) in the PYBIND11_MODULE block so observation_converter.py can
import mtl_dlpack_from_mach_port successfully.

Comment on lines 119 to 126
if self.internal_type == ObservationTensorType.APPLE_TENSOR:
obs_1 = self.last_observations[0].clone()[:, :, [2, 1, 0]].flip(0)
obs_1 = self._normalize_apple_zerocopy_tensor(
self.last_observations[0]
)
return (obs_1, None)
elif self.internal_type == ObservationTensorType.CUDA_DLPACK:
obs_1 = self.last_observations[0].clone()[:, :, :3].flip(0)
obs_1 = self._normalize_cuda_zerocopy_tensor(self.last_observations[0])
return (obs_1, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

render() now sees the raw zerocopy tensor, not the public RGB output.

Line 120/125 normalizes only the return value. initialize_zerocopy() still leaves self.last_observations[0] pointing at the raw Apple/CUDA buffer, and render() later consumes that cache directly on Lines 168-204. That regresses zerocopy rendering back to unnormalized BGR/RGBA data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/craftground/environment/observation_converter.py` around lines 119 - 126,
The code normalizes only the value returned in the APPLE_TENSOR/CUDA_DLPACK
branches but leaves self.last_observations[0] as the raw zerocopy buffer so
render() later consumes unnormalized data; fix by replacing the raw cache with
the normalized tensor (assign the result of _normalize_apple_zerocopy_tensor or
_normalize_cuda_zerocopy_tensor back into self.last_observations[0]) inside the
branches (affecting initialize_zerocopy / the branch handling for
ObservationTensorType.APPLE_TENSOR and ObservationTensorType.CUDA_DLPACK) so
both the returned tuple and the cached last_observations[0] contain the
normalized RGB(A) tensor.

Comment on lines +15 to +21
raw_bgr = torch.tensor(
[
[[10, 20, 30], [40, 50, 60]],
[[70, 80, 90], [100, 110, 120]],
],
dtype=torch.uint8,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Model the macOS raw tensor as a 4-byte-strided view in these fixtures.

The Apple bridge now exposes shape[2] == 3 with a physical pixel stride of 4. These compact HWCx3 tensors will still pass even if the stride metadata regresses again.

Example fixture shape
-    raw_bgr = torch.tensor(
+    raw_bgra = torch.tensor(
         [
-            [[10, 20, 30], [40, 50, 60]],
-            [[70, 80, 90], [100, 110, 120]],
+            [[10, 20, 30, 255], [40, 50, 60, 255]],
+            [[70, 80, 90, 255], [100, 110, 120, 255]],
         ],
         dtype=torch.uint8,
     )
+    raw_bgr = raw_bgra[..., :3]
+    assert raw_bgr.stride() == (8, 4, 1)

Apply the same pattern to both Apple zerocopy tests.

Also applies to: 63-69

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/python/unit/test_observation_converter.py` around lines 15 - 21, The
fixture should model macOS' 4-byte-per-pixel layout by creating an underlying
HxWx4 uint8 tensor (add a dummy fourth channel per pixel) and then produce the
HxWx3 view using torch.as_strided with element-strides (H_stride=8, W_stride=4,
C_stride=1) so the resulting raw_bgr has shape (2,2,3) but a physical 4-byte
stride; replace the current raw_bgr creation with this pattern (use the variable
name raw_bgr) and apply the same change to the other Apple zero-copy test
fixture referenced in the file.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpp/ipc_apple_ops.mm`:
- Around line 116-118: The kernel assumes channels are contiguous; before
building NormalizeParams and dispatching the kernel, validate the tensor channel
stride and positivity: add TORCH_CHECK(src.stride(2) == 1, "Expected contiguous
channel stride (stride(2) == 1)") and also assert positive strides (e.g.,
TORCH_CHECK(src.stride(i) > 0, "...") for relevant dims) to avoid non-contiguous
memory causing corrupted reads; perform these checks near the existing
TORCH_CHECK calls that reference src.dim(), src.size(2), and
src.storage().data() so the validation runs before NormalizeParams construction
and kernel dispatch.
- Around line 129-138: The NormalizeParams construction narrows 64-bit tensor
metadata (src/dst size, stride, storage_offset) to uint32_t without bounds
checks; add checked conversions before filling NormalizeParams by verifying each
value from src.size(...), src.stride(...), src.storage_offset(),
dst.stride(...), dst.storage_offset() is <= UINT32_MAX (use
std::numeric_limits<uint32_t>::max() or UINT32_MAX), and handle violations
(return an error/throw an exception or log and abort) instead of silently
truncating; after validation, perform the static_casts into NormalizeParams as
before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de3db258-1c24-4136-8d6f-db15473c8373

📥 Commits

Reviewing files that changed from the base of the PR and between b0fcda3 and afd68d2.

📒 Files selected for processing (1)
  • src/cpp/ipc_apple_ops.mm

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/cpp/ipc_apple_ops.mm (1)

12-21: Pin the host/kernel param ABI with a static_assert.

NormalizeParams is duplicated in C++ and in the embedded Metal source. A small host-side size check would at least fail fast if the CPU layout drifts from the shader contract in a later refactor.

🧭 Suggested hardening
 struct NormalizeParams {
     uint32_t width;
     uint32_t height;
     uint32_t src_row_stride;
     uint32_t src_pixel_stride;
     uint32_t src_offset;
     uint32_t dst_row_stride;
     uint32_t dst_pixel_stride;
     uint32_t dst_offset;
 };
+static_assert(
+    sizeof(NormalizeParams) == 8 * sizeof(uint32_t),
+    "NormalizeParams must stay in sync with the Metal kernel"
+);

Also applies to: 33-42

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpp/ipc_apple_ops.mm` around lines 12 - 21, Add a compile-time layout
check to pin the host/kernel ABI: immediately after the NormalizeParams struct
definition add a static_assert that sizeof(NormalizeParams) equals the exact
byte size expected by the Metal shader ABI (use the literal byte value that
matches the embedded Metal struct), and include similar static_asserts for the
other duplicated structs mentioned (the ones around lines 33-42) so any future
layout drift fails to compile; place these assertions next to the corresponding
struct declarations and keep the message descriptive (e.g., "NormalizeParams
size must match Metal shader ABI").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cpp/ipc_apple_ops.mm`:
- Around line 119-126: Validation currently checks channel contiguity
(src.stride(2) == 1) and positive strides but misses overlap checks; add checks
before the kernel launch to reject HWC views that would let the kernel read
invalid bytes: if width > 1 require src.stride(1) >= 3, and if height > 1
require src.stride(0) >= ((width > 0 ? (width - 1) : 0) * src.stride(1) + 3);
reference the tensor strides via src.stride(1) and src.stride(0), and the
width/height variables used for the kernel so the code returns an error
(TORCH_CHECK) when these conditions are not met.

---

Nitpick comments:
In `@src/cpp/ipc_apple_ops.mm`:
- Around line 12-21: Add a compile-time layout check to pin the host/kernel ABI:
immediately after the NormalizeParams struct definition add a static_assert that
sizeof(NormalizeParams) equals the exact byte size expected by the Metal shader
ABI (use the literal byte value that matches the embedded Metal struct), and
include similar static_asserts for the other duplicated structs mentioned (the
ones around lines 33-42) so any future layout drift fails to compile; place
these assertions next to the corresponding struct declarations and keep the
message descriptive (e.g., "NormalizeParams size must match Metal shader ABI").

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a97610a-a9d2-466b-9555-af14e170a0a6

📥 Commits

Reviewing files that changed from the base of the PR and between afd68d2 and a8ed608.

📒 Files selected for processing (1)
  • src/cpp/ipc_apple_ops.mm

Comment on lines +119 to +126
TORCH_CHECK(
src.stride(2) == 1,
"Expected contiguous channel stride (stride(2) == 1)"
);
TORCH_CHECK(
src.stride(0) > 0 && src.stride(1) > 0,
"Expected positive row and pixel strides"
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f ipc_apple_ops.mm

Repository: yhs0602/CraftGround

Length of output: 87


🏁 Script executed:

cat -n src/cpp/ipc_apple_ops.mm

Repository: yhs0602/CraftGround

Length of output: 7864


Reject overlapping HWC views in validation checks.

Lines 119–126 enforce channel contiguity (stride(2) == 1) and positive strides, but do not ensure pixels or rows are non-overlapping. The kernel (lines 54–62) reads 3 bytes per pixel assuming stride(1) >= 3 and reads across rows assuming stride(0) >= (width - 1) * stride(1) + 3. Tensors with stride(1) < 3 or insufficient stride(0) will cause the kernel to read the wrong bytes without error.

Add validation before line 148:

  • For width > 1: require stride(1) >= 3
  • For height > 1: require stride(0) >= (width - 1) * stride(1) + 3 (handle width = 0 as 0)
Suggested fix
+    const auto width = to_u32_checked(src.size(1), "width");
+    const auto height = to_u32_checked(src.size(0), "height");
+    const auto src_row_stride =
+        to_u32_checked(src.stride(0), "src_row_stride");
+    const auto src_pixel_stride =
+        to_u32_checked(src.stride(1), "src_pixel_stride");
+    const auto min_row_span =
+        width == 0
+            ? uint64_t{0}
+            : (static_cast<uint64_t>(width) - 1) *
+                  static_cast<uint64_t>(src_pixel_stride) +
+                  3u;
+
+    TORCH_CHECK(
+        width <= 1 || src_pixel_stride >= 3,
+        "Expected pixel stride >= 3 for non-overlapping RGB pixels"
+    );
+    TORCH_CHECK(
+        height <= 1 || static_cast<uint64_t>(src_row_stride) >= min_row_span,
+        "Expected row stride to cover full row without overlap"
+    );
+
     NormalizeParams params{
-        to_u32_checked(src.size(1), "width"),
-        to_u32_checked(src.size(0), "height"),
-        to_u32_checked(src.stride(0), "src_row_stride"),
-        to_u32_checked(src.stride(1), "src_pixel_stride"),
+        width,
+        height,
+        src_row_stride,
+        src_pixel_stride,
         to_u32_checked(src.storage_offset(), "src_offset"),
         to_u32_checked(dst.stride(0), "dst_row_stride"),
         to_u32_checked(dst.stride(1), "dst_pixel_stride"),
         to_u32_checked(dst.storage_offset(), "dst_offset"),
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cpp/ipc_apple_ops.mm` around lines 119 - 126, Validation currently checks
channel contiguity (src.stride(2) == 1) and positive strides but misses overlap
checks; add checks before the kernel launch to reject HWC views that would let
the kernel read invalid bytes: if width > 1 require src.stride(1) >= 3, and if
height > 1 require src.stride(0) >= ((width > 0 ? (width - 1) : 0) *
src.stride(1) + 3); reference the tensor strides via src.stride(1) and
src.stride(0), and the width/height variables used for the kernel so the code
returns an error (TORCH_CHECK) when these conditions are not met.

@yhs0602
Copy link
Copy Markdown
Owner Author

yhs0602 commented Mar 20, 2026

성능 관점에서 이 PR의 임팩트는 “완전한 zero-copy 달성”보다는, 기존 macOS public zerocopy 경로의 불필요한 중간 처리를 줄이는 쪽에 가깝습니다.

기존 경로에서는 Python 쪽에서 clone() -> 채널 재배열 -> flip() 후처리를 수행하고 있었는데, 이 PR에서는 이를 native Metal 1-pass normalize로 치환했습니다.
즉, 현재와 같은 공개 출력 계약(RGB + flipped)은 유지하면서도, Python-side 임시 처리와 중간 temporary 생성을 줄이는 방향입니다.

정리하면:

  • raw transport 자체의 root cause를 완전히 해결한 것은 아닙니다.
  • 하지만 user-facing macOS zerocopy 경로에서는 기존보다 더 직접적으로 최종 출력 텐서를 만들어내므로, 기존 경로 대비 오버헤드는 줄어드는 방향입니다.
  • 특히 Python에서 clone/indexing/flip을 조합하던 경로를 native GPU-side normalize 한 번으로 접어 넣은 점이 이 PR의 성능상 핵심 이점입니다.

즉, 이 PR의 성능 가치는 “copy를 0으로 만들었다”가 아니라,
“현재 공개 API를 유지한 채 normalize 비용을 더 단순하고 예측 가능한 1-pass 경로로 줄였다”는 데 있습니다.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants