Proposal: ⚫ ONNX Runtime support for fVDB operations #579

swahtz · 2026-03-31T05:28:42Z

swahtz
Mar 31, 2026
Maintainer

Motivation

fVDB operations currently exist only within the PyTorch ecosystem. Users who want to deploy fVDB-based models (neural radiance fields, 3D reconstruction, sparse convolution networks, etc.) in ONNX Runtime -- for cross-language inference (C++, C#, Java, JS), hardware-specific EPs, or standardized model interchange -- have no path to do so.

The core challenge is that ONNX graphs transport only tensors between nodes, while fVDB operations pass two non-tensor types: GridBatch (which wraps NanoVDB OnIndexGrid data in a contiguous byte buffer) and JaggedTensor (variable-length batched data). Grid topology is dynamic -- constructed at inference time from input data.

Design Overview

The approach is split into two phases: (1) a custom ONNX operator domain with tensor-based representations of fVDB types, and (2) an optional ONNX Runtime Execution Provider plugin for optimized execution.

Phase 1: Custom ONNX Operator Domain (`fvdb`)

1.1 Tensor Representations

GridBatch as a tensor bundle ("GridBatch Bundle")

A GridBatch is decomposed into a group of tensors that always travel together through the ONNX graph. This exploits the fact that NanoVDB grids are stored as a single contiguous byte buffer (see GridData::mGridSize in NanoVDB.h and TorchDeviceBuffer in src/fvdb/detail/TorchDeviceBuffer.h):

Tensor	Dtype	Shape	Source in `GridBatchImpl`
`grid_blob`	`uint8`	`[N]` (dynamic)	`mGridHdl->data()` -- the raw NanoVDB buffer containing all grids packed sequentially
`grid_byte_offsets`	`int64`	`[B]`	Per-grid `mCumBytes` from `GridMetadata` (`src/fvdb/detail/GridBatchImpl.h` ~L30-57)
`voxel_sizes`	`float64`	`[B, 3]`	Per-grid `mVoxelSize` from `GridMetadata`
`origins`	`float64`	`[B, 3]`	Per-grid voxel origins from `GridMetadata`
`leaf_batch_indices`	`int32`	`[L]` (dynamic)	`mLeafBatchIndices` (`GridBatchImpl.h` ~L93)
`batch_offsets`	`int64`	`[B+1]`	`mBatchOffsets` (`GridBatchImpl.h` ~L94)
`list_indices`	`int32`	`[M]` (dynamic)	`mListIndices` (`GridBatchImpl.h` ~L95)

The grid_blob contains the serialized NanoVDB tree structure. Individual grids within the batch are accessed via pointer arithmetic: reinterpret_cast<const nanovdb::OnIndexGrid*>(blob_ptr + grid_byte_offsets[bi]) (see GridBatchImpl::Accessor::grid() at GridBatchImpl.h ~L151-156).

Alignment requirement: NanoVDB requires 32-byte alignment (NanoVDB.h ~L67-78). CUDA allocators satisfy this (typically 256B+). CPU allocators may not -- the custom op implementations must validate or enforce alignment.

JaggedTensor as a tensor bundle ("JaggedTensor Bundle")

A JaggedTensor (src/fvdb/JaggedTensor.h ~L163-192) is decomposed into its constituent tensors:

Tensor	Dtype	Shape	Source
`jdata`	varies	`[N, *esizes]`	`mData` -- packed values
`joffsets`	`int64`	`[T+1]`	`mOffsets` -- CSR boundaries
`jidx`	`int32`	`[N]`	`mBatchIdx` -- per-element batch index
`jlidx`	`int32`	`[T, ldim]`	`mListIdx` -- list-of-lists indexing

1.2 Custom Operator Schemas

Define ONNX custom ops in the fvdb domain. Each op corresponds to an existing C++ function in src/fvdb/detail/ops/. The ops split into two categories:

Grid Construction Ops (produce a GridBatch Bundle from input tensors):

ONNX Custom Op	Underlying implementation	Notes
`fvdb.GridFromPoints`	`BuildGridFromPoints.h`
`fvdb.GridFromIJK`	`BuildGridFromIjk.h`
`fvdb.GridFromMesh`	`BuildGridFromMesh.h`
`fvdb.GridFromDense`	`BuildDenseGrid.h`
`fvdb.GridFromNanoVDB`	New -- lightweight init	For users who provide a pre-built NanoVDB blob as input; derives metadata + index tensors from the blob

Grid-Consuming Ops (take a GridBatch Bundle + data tensors, produce tensor/JaggedTensor Bundle outputs):

Priority ops for initial implementation (covers the most common inference patterns):

ONNX Custom Op	Underlying implementation
`fvdb.SampleTrilinear`	`SampleGridTrilinear.h`
`fvdb.SampleBezier`	`SampleGridBezier.h`
`fvdb.SplatTrilinear`	`SplatIntoGridTrilinear.h`
`fvdb.SplatBezier`	`SplatIntoGridBezier.h`
`fvdb.PointsInGrid`	`PointsInGrid.h`
`fvdb.IJKToIndex`	`IjkToIndex.h`
`fvdb.CoordsInGrid`	`CoordsInGrid.h`
`fvdb.DownsampleAvgPool`	`DownsampleGridAvgPool.h`
`fvdb.DownsampleMaxPool`	`DownsampleGridMaxPool.h`

Grid-to-Grid Ops (produce a new GridBatch Bundle):

ONNX Custom Op	Underlying implementation
`fvdb.CoarsenedGrid`	`BuildCoarseGridFromFine.h`
`fvdb.DilatedGrid`	`BuildDilatedGrid.h`
`fvdb.ConvGrid`	`BuildGridForConv.h`
`fvdb.ConvTransposeGrid`	`BuildGridForConvTranspose.h`
`fvdb.PrunedGrid`	`BuildPrunedGrid.h`
`fvdb.MergedGrid`	`BuildMergedGrids.h`

The full op list (~55 ops under src/fvdb/detail/ops/) need not all be implemented at once. The initial set above covers the most common inference workloads.

1.3 Custom Op Library Implementation

Build a shared library (libfvdb_onnx_ops.so / fvdb_onnx_ops.dll) that:

Exports RegisterCustomOps per the ONNX Runtime custom op library convention
Registers each fVDB custom op in the fvdb domain using the Ort::CustomOpDomain + Ort::Custom::CreateLiteCustomOp API
Provides both CPU and CUDA implementations using Ort::Custom::CudaContext for GPU ops:

// Sketch of a CUDA custom op for SampleTrilinear
void FvdbSampleTrilinear(
    const Ort::Custom::CudaContext& cuda_ctx,
    // GridBatch Bundle inputs:
    const Ort::Custom::Tensor<uint8_t>& grid_blob,
    const Ort::Custom::Tensor<int64_t>& grid_byte_offsets,
    const Ort::Custom::Tensor<double>& voxel_sizes,
    const Ort::Custom::Tensor<double>& origins,
    const Ort::Custom::Tensor<int32_t>& leaf_batch_indices,
    const Ort::Custom::Tensor<int64_t>& batch_offsets,
    const Ort::Custom::Tensor<int32_t>& list_indices,
    // JaggedTensor Bundle inputs (points):
    const Ort::Custom::Tensor<float>& points_data,
    const Ort::Custom::Tensor<int64_t>& points_offsets,
    const Ort::Custom::Tensor<int32_t>& points_idx,
    // JaggedTensor Bundle inputs (voxel data):
    const Ort::Custom::Tensor<float>& voxel_data,
    const Ort::Custom::Tensor<int64_t>& voxel_offsets,
    const Ort::Custom::Tensor<int32_t>& voxel_idx,
    // JaggedTensor Bundle outputs:
    Ort::Custom::Tensor<float>& out_data,
    Ort::Custom::Tensor<int64_t>& out_offsets,
    Ort::Custom::Tensor<int32_t>& out_idx) {
    // 1. Reconstruct GridBatchImpl::Accessor from blob + metadata
    // 2. Reconstruct JaggedTensor views from component tensors
    // 3. Call existing fvdb::detail::ops::sampleGridTrilinear()
    // 4. Decompose results back into output tensors
}

Each custom op is a thin adapter layer that reconstructs fVDB types from the tensor bundle, calls the existing C++ implementation (no kernel rewrite needed), and decomposes the results.

1.4 ONNX Model Export

Provide a Python utility to export fVDB-based PyTorch models to ONNX. Since fVDB ops are currently bound via pybind11 (src/python/Bindings.cpp, PYBIND11_MODULE at L96) rather than torch.library, torch.onnx.export() won't trace them automatically. Options:

Custom ONNX exporter with symbolic functions: Register torch.onnx symbolic handlers for each fVDB op that emit the corresponding fvdb.* custom ONNX nodes
Manual graph construction: Provide an fvdb.export_onnx(model, sample_inputs, path) utility that traces the model and builds the ONNX graph programmatically using onnx.helper

1.5 User-Provided Grids

Users who construct grids externally and pass them in for inference use the fvdb.GridFromNanoVDB op. They provide:

The raw NanoVDB bytes as a uint8 input tensor
Voxel sizes as a float64[B, 3] input tensor
Origins as a float64[B, 3] input tensor

The fvdb.GridFromNanoVDB op derives the remaining metadata (leaf batch indices, batch offsets, list indices) from the NanoVDB buffer at runtime. This mirrors the existing GridBatch constructor that accepts a GridHandle (GridBatch.h ~L30-32).

Phase 2: fVDB Execution Provider Plugin

Build an ONNX Runtime EP plugin (libfvdb_ep.so) using the EP ABI kernel-based plugin API introduced in ORT v1.24.

2.1 Architecture

The EP claims subgraphs of fvdb.* custom ops via OrtEp::GetCapability + EpGraphSupportInfo_LookUpKernel. Inside the EP:

GridBatch objects are maintained as native C++ state, not reconstructed from tensor bundles per-op
The EP kernel registry (OrtKernelRegistry + KernelRegistry_AddKernel) contains entries for each fVDB op
Grid construction ops create native GridBatch objects and store them in EP-managed state
Grid-consuming ops access the native GridBatch directly, eliminating the per-op blob interpretation overhead
Only plain tensors (feature data, query points, output samples) cross the EP boundary

This is architecturally similar to how TensorRT EP captures subgraphs and executes them with an opaque internal engine. The relevant ORT APIs:

API	Purpose	Reference
`OrtEp::GetCapability`	Claim fVDB subgraph nodes	PR #24887
`OrtEp::GetKernelRegistry`	Return registered fVDB kernel implementations	PR #26206
`OrtKernelImpl::Compute`	Per-op execution entry point	PR #26206
`OrtEpApi::KernelRegistry_AddKernel`	Register each fVDB kernel	PR #26206
Weight pre-packing	Pre-process static grid data at session init	PR #26754
`OrtEpApi::GetEnvConfigEntries`	Read fVDB-specific config (device, memory limits)	v1.24.1 release

2.2 Memory Management

The EP controls its own memory allocation. Since fVDB already uses TorchDeviceBuffer (src/fvdb/detail/TorchDeviceBuffer.h) which is a raw uint8_t* + size + device, the EP can either:

Continue using TorchDeviceBuffer (requires linking libtorch)
Implement a lightweight buffer type that uses ORT's allocator APIs instead, decoupling from PyTorch at the inference layer

2.3 Benefits Over Phase 1 Alone

Eliminates per-op GridBatch reconstruction from the tensor bundle
Enables cross-op optimizations (e.g., fusing grid construction + immediate sample)
Grid state never materializes as ONNX tensor edges, reducing memory management overhead
Gives full control over CUDA stream and memory management within the fVDB subgraph

Implementation Plan

Phase 1 milestones:

Define the GridBatch Bundle and JaggedTensor Bundle tensor representations
Implement fvdb.GridFromPoints and fvdb.GridFromNanoVDB construction ops
Implement fvdb.SampleTrilinear and fvdb.PointsInGrid as initial consuming ops
Build and test the custom op shared library with ONNX Runtime (CPU + CUDA)
Implement Python export utility
Expand op coverage based on user demand

Phase 2 milestones:

Scaffold EP plugin with ORT EP ABI
Implement GetCapability to claim fvdb.* subgraphs
Implement native GridBatch state management within the EP
Register kernel implementations for all Phase 1 ops
Benchmark against Phase 1 (per-op reconstruction) to quantify improvement

Key References

ONNX Runtime Custom Operator docs -- custom op domain registration, CUDA custom ops, shared library convention
EP ABI: kernel-based EPs (PR #26206) -- OrtKernelImpl, OrtKernelRegistry, KernelRegistry_AddKernel, EpGraphSupportInfo_LookUpKernel
EP ABI: GetCapability/Compile infrastructure (PR #24887) -- OrtEp::GetCapability, Graph IR APIs, plugin EP infrastructure
EP ABI: weight pre-packing (PR #26754) -- pre-process grid data at session initialization
EP ABI: control flow kernel APIs (PR #26927) -- if fVDB ops need control flow support
ONNX Runtime v1.24.1 release notes -- OrtApi::CreateEnvWithOptions, OrtEpApi::GetEnvConfigEntries, EP Plugin API overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: ⚫ ONNX Runtime support for fVDB operations #579

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Proposal: ⚫ ONNX Runtime support for fVDB operations #579

Uh oh!

swahtz Mar 31, 2026 Maintainer

Motivation

Design Overview

Phase 1: Custom ONNX Operator Domain (fvdb)

1.1 Tensor Representations

1.2 Custom Operator Schemas

1.3 Custom Op Library Implementation

1.4 ONNX Model Export

1.5 User-Provided Grids

Phase 2: fVDB Execution Provider Plugin

2.1 Architecture

2.2 Memory Management

2.3 Benefits Over Phase 1 Alone

Implementation Plan

Key References

Replies: 0 comments

swahtz
Mar 31, 2026
Maintainer

Phase 1: Custom ONNX Operator Domain (`fvdb`)