06 Mar 20:21

e6c3ba2

v1.12.0 Latest

Latest

Warp v1.12.0

Warp v1.12 adds experimental hardware-accelerated texture sampling on CUDA GPUs, extends tile programming with element-wise arithmetic operators and differentiable FFT, and broadens JAX interoperability with jax.vmap support. This release also introduces subscript-style type hints for better IDE integration, new quaternion and approximate-math builtins, B-spline shape functions in warp.fem, and a collection of utility and diagnostics APIs.

New features

Hardware-accelerated textures

Experimental. This API may change without a formal deprecation cycle.

Warp v1.12 introduces wp.Texture1D, wp.Texture2D, and wp.Texture3D classes that leverage CUDA texture memory for hardware-accelerated interpolation directly inside Warp kernels. On GPU, texture reads are routed through dedicated texture units that perform filtered lookups in a single instruction, making them ideal for rendering, volume sampling, signed-distance-field queries, and simulation lookup tables. On CPU, a software fallback provides identical semantics so the same kernel code runs on both devices.

import warp as wp
import numpy as np

wp.init()

# 64x64 single-channel height map
data = np.random.rand(64, 64).astype(np.float32)

# Create a 2D texture with bilinear filtering
tex = wp.Texture2D(data, filter_mode=wp.Texture.FILTER_LINEAR)

@wp.kernel
def sample_texture(tex: wp.Texture2D, coords: wp.array[wp.vec2f], out: wp.array[float]):
    i = wp.tid()
    # Coordinates are in [0, 1]; bilinear interpolation is automatic
    out[i] = wp.texture_sample(tex, coords[i], dtype=float)

coords = wp.array(np.random.rand(1024, 2).astype(np.float32), dtype=wp.vec2f)
result = wp.zeros(1024, dtype=float)
wp.launch(sample_texture, dim=1024, inputs=[tex, coords, result])

print(f"Sampled {result.shape[0]} points, range: [{result.numpy().min():.4f}, {result.numpy().max():.4f}]")
# Example output: Sampled 1024 points, range: [0.0069, 0.9793]

Key capabilities:

1D / 2D / 3D texture classes (wp.Texture1D, wp.Texture2D, wp.Texture3D) with matching wp.texture_sample() overloads that accept scalar, vec2f, or vec3f coordinates.
Filter modes: FILTER_POINT for nearest-neighbor sampling and FILTER_LINEAR for bilinear (2D) or trilinear (3D) interpolation.
Address modes: ADDRESS_WRAP, ADDRESS_CLAMP, ADDRESS_MIRROR, and ADDRESS_BORDER control how out-of-range texture coordinates are handled, configurable per axis.
Array interop: Texture objects provide copy_from_array() and copy_to_array() methods to transfer data between wp.array objects and texture memory. A cuda_surface property exposes the CUDA surface handle for advanced interop.
Broad dtype support: Textures accept integer and floating-point data types with 1, 2, or 4 channels. Integer types are automatically normalized to floating-point values on read.

Subscript-style type hints

When annotating kernel parameters with call-syntax forms like wp.array(dtype=float), static type checkers such as Pyright and Pylance flag these as errors because the expressions look like constructor calls rather than type annotations. Warp v1.12 adds subscript-style alternatives that are recognized as valid generic aliases (#1216):

# Before (flagged as error by Pyright/Pylance):
@wp.kernel
def my_kernel(a: wp.array(dtype=float), b: wp.array2d(dtype=wp.vec3)):
    ...

# After (clean subscript syntax):
@wp.kernel
def my_kernel(a: wp.array[float], b: wp.array2d[wp.vec3]):
    ...

The subscript syntax is supported for all array dimensionalities (wp.array[dtype] through wp.array4d[dtype]) as well as wp.tile[dtype] for tile-typed arguments.

Warp's static type checking compatibility is being improved incrementally, and you may encounter other Pyright/Pylance diagnostics that are not yet resolved. If you run into type checking issues, please report them as sub-issues of #549.

Diagnostics utility

The new wp.print_diagnostics() function displays a comprehensive snapshot of the Warp build and runtime environment (software versions, CUDA information, build flags, and available devices) in a single call (#1221). Two companion helpers, wp.get_cuda_toolkit_version() and wp.get_cuda_driver_version(), return the CUDA toolkit and driver versions as integer tuples (#1172). Together these are useful for debugging environment issues, capturing context in CI logs, and providing system information when filing bug reports.

Quaternion and spatial helpers

Warp v1.12 adds quaternion and spatial transformation helpers: wp.quat_from_euler(), wp.quat_to_euler(), wp.transform_twist(), and wp.transform_wrench() (#1237). The Euler conversion functions accept axis indices (0 = X, 1 = Y, 2 = Z) so you can specify arbitrary rotation-order conventions such as ZYX or XYZ, making them suitable for robotics and animation pipelines:

euler = wp.vec3(0.0, wp.PI / 4.0, 0.0)
q = wp.quat_from_euler(euler, 2, 1, 0)  # ZYX convention
print(q)  # [0.0, 0.3826834559440613, 0.0, 0.9238795042037964]

Approximate math intrinsics

wp.div_approx() and wp.inverse_approx() expose GPU hardware fast-math instructions (div.approx.f32 and rcp.approx.ftz.f64) for approximate floating-point division and reciprocal, offering higher throughput at reduced precision (#1199). Only floating-point types are supported. On CPU, both functions fall back to exact arithmetic so the same kernel code runs correctly on either device.

Marching cubes lookup tables

The internal marching cubes lookup tables are now exposed as public class attributes on wp.MarchingCubes: CUBE_CORNER_OFFSETS, EDGE_TO_CORNERS, CASE_TO_TRI_RANGE, and TRI_LOCAL_INDICES (#1151). These tables enable custom marching cubes implementations for advanced use cases such as sparse volume extraction or procedural mesh generation without having to duplicate the standard lookup data.

Graph coloring API

wp.utils.graph_coloring_assign(), wp.utils.graph_coloring_balance(), and wp.graph_coloring_get_groups() are now part of the public API (#1145). These graph coloring utilities were originally introduced in warp.sim in v1.5.0 for use with VBDIntegrator and were removed along with the warp.sim module in v1.10.0. They are now re-introduced as standalone functions in wp.utils, independent of any physics module. They partition a graph into independent color groups, which is useful for parallel constraint solving, conflict-free mesh updates, and other tasks that require concurrent writes to non-adjacent elements.

Tile programming enhancements

Tile arithmetic operators

Tiles now support native Python * and / operators for element-wise multiplication and division, including broadcast between tiles and scalar constants (#1006, #1009). The supported forms are tile * tile, tile * constant, constant * tile for multiplication, and tile / tile, tile / constant, constant / tile for division. All combinations are differentiable and work with scalar, vector, and matrix element types.

import warp as wp

TILE_SIZE = wp.constant(64)

@wp.kernel
def scale_and_normalize(
    a: wp.array[float],
    b: wp.array[float],
    out: wp.array[float],
):
    i = wp.tid()
    ta = wp.tile_load(a, shape=TILE_SIZE, offset=i * TILE_SIZE)
    tb = wp.tile_load(b, shape=TILE_SIZE, offset=i * TILE_SIZE)

    product = ta * tb          # element-wise multiply
    scaled = product * 0.5     # broadcast scalar multiply
    result = scaled / tb       # element-wise divide

    wp.tile_store(out, result, offset=i * TILE_SIZE)

N = 256
a = wp.ones(N, dtype=float)
b = wp.full(N, value=2.0, dtype=float)
out = wp.zeros(N, dtype=float)
wp.launch_tiled(scale_and_normalize, dim=[N // 64], inputs=[a, b, out], block_dim=64)

print(out.numpy()[:8])  # [0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]

`wp.tile_from_thread()`

wp.tile_from_thread() broadcasts a scalar or vector value held by a single thread to a shared tile visible to all threads in the block (#1178). This is useful when one thread computes a value (e.g., a reduction result or a loop-invariant parameter) that the entire block needs to use in subsequent tile operations. The function accepts a thread_idx argument to specify which thread's value is broadcast, and supports both "shared" and "register" storage modes.

Differentiable FFT

wp.tile_fft() and wp.tile_ifft() now support reverse-mode automatic differentiation when recorded on a wp.Tape() (#1138). Warp automatically provides the correct gradient implementations for both transforms, so gradients propagate seamlessly through frequency-domain operations. This enables end-to-end gradient computation through pipelines that mix spatial and spectral steps, which is useful for differentiable signal processing, spectral methods, and PDE solvers.

MathDx GEMM toggle

Setting wp.config.enable_mathdx_gemm = False (or passing "enable_mathdx_gemm": False as a module option) disables cuBLASDx for wp.tile_matmul(), falling back to an optimized scalar GEMM implementation (#1228). This avoids the slow link-time optimization (LTO) step required by libmathdx during development iteration, while keeping libmathdx available for operations that have no scalar fallback, such as Cholesky factorization and FFT. The scalar fallback may be slower than cuBLASDx depending on tile sizes, data types, and block_dim, so this option is primarily intended for faster compile–edit–run cycles during development rather than production use.

Accelerated tile load/store

Shared-memory tile loads and stores via wp.tile_load() / wp.tile_store() have been accelerated for non-power-of-two tile sizes (#1239). The improvement is most pronounced when source arrays fit within the GPU L2 cache, reducing...

Contributors

clatim, nawedume, and 3 other contributors

Assets 9

0 Join discussion

03 Feb 21:20

github-actions

v1.11.1

173e179

v1.11.1

Warp v1.11.1

Warp v1.11.1 is a bugfix release following v1.11.0. For a complete list of changes, see the changelog.

Highlights

This is primarily a bugfix release with no major new features. Key fixes include:

Tile Operations: Fixed wp.tile_matmul() sometimes producing NaN results when using the c = wp.tile_matmul(a, b) form due to reading uninitialized output memory (#1180). Also fixed tile multiplication with scalar constants when one operand is a vector or matrix type (#1175), and enabled scalar, vector, and matrix arguments in wp.tile_map() (#1136).
Code Generation: Fixed wp.static() incorrectly resolving loop variables to same-named global Python variables when used for static loop unrolling in kernels (e.g., wp.static(i) inside for i in range(n) would use a global i if one existed, instead of the loop iteration value) (#1139). Also fixed a segfault in conditional expressions (ternary if/else) when one branch accesses an array element and the other branch is taken.
CUDA Graphs: Fixed CUDA graphs with multiple temporary allocations using more memory than necessary. Previously, memory freed during graph capture wasn't properly sequenced for reuse by later allocations, causing memory to accumulate (e.g., three sequential 1GB allocations would consume 3GB instead of reusing the same 1GB).
Developer Experience: Fixed @wp.func decorated functions showing generic _Wrapped types in Pyright/Pylance instead of their actual signatures on Python 3.10+. Also fixed multiple issues with IDE autocomplete stubs that caused type checker errors in mypy and Pyright, including incorrect @overload usage, shadowed bool type references, and missing Literal[] syntax for integer type parameters.
Documentation: Added missing docstrings across the API, including built-in functions, type constructors, constants, and warp.fem symbols (#1159). Added example_particle_repulsion.py demonstrating how to use wp.grad() (#1137).

Announcements

Upcoming removals

The following feature is deprecated and will be removed in v1.12:

Constructing matrices from column vectors via wp.matrix(): The ability to construct matrices by passing column vectors to wp.matrix() has been deprecated at both Python and kernel scopes. Use wp.matrix_from_cols() instead.

Constructing matrices from vectors via wp.matrix(): The ability to construct matrices by passing vectors to wp.matrix() has been deprecated at both Python and kernel scopes. The replacement depends on the scope because the behavior was inconsistent:

# Kernel scope (vectors were interpreted as columns)
# Deprecated (will be removed in v1.12):
m = wp.mat33(col0, col1, col2)
# Use instead:
m = wp.matrix_from_cols(col0, col1, col2)

# Python scope (vectors were interpreted as rows)
# Deprecated (will be removed in v1.12):
m = wp.mat22(wp.vec2(1, 2), wp.vec2(3, 4))
# Use instead:
m = wp.matrix_from_rows(wp.vec2(1, 2), wp.vec2(3, 4))

Acknowledgments

We thank the following contributors:

@Adityakk9031 for fixing the JAX FFI deadlock when using cached graphs across multiple GPUs (#1181).
@Cucchi01 for fixing bsr_get_diag() not zeroing the output buffer when provided (#1170).
@liblaf for fixing the inverted verbose flag in wp.capture_debug_dot_print() (#1202).

Contributors

liblaf, Cucchi01, and Adityakk9031

Assets 9

02 Jan 13:26

github-actions

v1.11.0

8a3c350

v1.11.0

Warp v1.11.0

Warp v1.11 introduces group-aware spatial queries for multi-world workloads, provides new options for managing JIT compilation overhead, and expands differentiation capabilities with wp.grad(). This release also includes expanded tile operations, the unpack operator in kernels, C++ integration examples, and a major API cleanup clarifying public versus internal interfaces.

New features

Group-aware spatial queries

Warp v1.11 introduces group-aware construction and queries for wp.Bvh and wp.Mesh data structures, enabling efficient spatial queries across multiple independent environments. This feature allows you to build a single acceleration structure containing geometry from multiple worlds or scenes, then query each world independently without traversing primitives from other worlds.

When constructing a BVH or Mesh, assign each primitive to a group using the groups parameter. Warp builds isolated sub-trees for each group within a unified structure:

# Build a BVH in Python containing multiple worlds
lowers = wp.array(...)  # Shape bounds for all worlds
uppers = wp.array(...)
world_ids = wp.array([0, 0, 1, 1, 2, 2, ...], dtype=int)

bvh = wp.Bvh(lowers, uppers, groups=world_ids)

@wp.kernel
def raycast_world(
    bvh_id: wp.uint64,
    world_id: int,
    ray_origin: wp.vec3,
    ray_dir: wp.vec3
):
    # Get the root node for this world's sub-tree
    root = wp.bvh_get_group_root(bvh_id, world_id)
    
    # Query only intersects geometry from this world
    query = wp.bvh_query_ray(bvh_id, ray_origin, ray_dir, root)
    
    # Process hits
    shape_idx = int(0)
    while wp.bvh_query_next(query, shape_idx):
        # Handle intersection with shape_idx
        pass

# Launch kernel to query world 2
wp.launch(raycast_world, dim=1, inputs=[bvh.id, 2, origin, direction])

This example shows a single-world query for clarity. For production use, launch multiple threads in parallel, each querying its assigned world from arrays of world IDs and ray parameters. See Newton's raytrace implementation for a real-world example of parallel multi-world raycasting.

Key features

Group construction: Pass a groups array during construction to organize primitives into isolated sub-trees
Group-restricted queries: All query functions accept an optional root parameter to limit traversal to a specific group
Helper functions: wp.bvh_get_group_root() and wp.mesh_get_group_root() retrieve sub-tree roots for each group

Thanks to @StafaH for implementing this feature.

Geometry query enhancements

Warp v1.11 adds several new query functions and improvements for spatial queries:

wp.mesh_query_ray_anyhit(): Fast any-hit query that returns immediately upon finding any intersection, useful for shadow ray calculations in rendering
wp.mesh_query_ray_count_intersections(): Counts all ray-triangle intersections along a ray path
wp.mesh_query_point_sign_parity(): Point-in-mesh query using perturbed ray casting with majority voting for improved robustness in challenging cases
max_dist parameter: wp.bvh_query_next() now accepts a maximum distance to filter intersections, useful for early ray termination
Tiled query functions: Cooperative thread-block queries for use in tiled kernels (wp.bvh_query_aabb_tiled(), wp.bvh_query_ray_tiled(), wp.mesh_query_aabb_tiled(), etc.)

Evaluate the gradients of functions

wp.grad() directly evaluates the gradient of a Warp function at specific input values, computing gradients inline during the forward pass. This is useful for computing forces from energy functions or when implementing custom adjoints that need to call auto-generated gradients of subfunctions, avoiding the need to manually code the entire adjoint chain. This contrasts with wp.Tape(), which records an entire computation graph for reverse-mode automatic differentiation across multiple kernel launches. This feature was implemented in response to community feedback (#125).

import warp as wp
import numpy as np

k = 1.0

@wp.func
def compute_energy(x: float):
    return 0.5 * k * x * x

@wp.kernel
def compute_force(x: wp.array(dtype=float), U: wp.array(dtype=float), F: wp.array(dtype=float)):
    i = wp.tid()
    U[i] = compute_energy(x[i])
    F[i] = -wp.grad(compute_energy)(x[i])

N = 5
x = wp.array(np.arange(N, dtype=np.float32), dtype=float)
U = wp.zeros_like(x)
F = wp.zeros_like(x)

wp.launch(compute_force, N, inputs=[x], outputs=[U, F])

print(U.numpy())  # Energy: [0.  0.5 2.  4.5 8. ]
print(F.numpy())  # Force:  [ 0. -1. -2. -3. -4.]

`wp.tile_map()` supports n-ary maps (up to n=8)

User-defined functions that accept up to 8 arguments may now be used as tile mapping functions. An equivalent number of tiles must be passed to wp.tile_map(). For example:

@wp.func
def weighted_sum(a: float, b: float, c: float):
    return 0.5 * a + 0.3 * b + 0.2 * c

@wp.kernel
def compute():

    a = wp.tile_arange(0.0, 1.0, 0.1, dtype=float)
    b = wp.tile_ones(shape=10, dtype=float)
    c = wp.tile_arange(1.0, 2.0, 0.1, dtype=float)

    s = wp.tile_map(weighted_sum, a, b, c)

    print(s)

wp.launch_tiled(compute, dim=[1], inputs=[], block_dim=16)

Generate tiles of random numbers

wp.tile_randf() and wp.tile_randi() have been introduced to generate tiles of random floats and ints, respectively. These functions accept optional lower and upper bound arguments to control the range of generated values. This snippet generates 4x4 tensors of random floats using 2x2 tiles:

TILE_M, TILE_N = 2, 2
M, N = 2, 2
seed = 42

@wp.kernel
def rand_kernel(seed: int, x: wp.array2d(dtype=float)):
    i, j = wp.tid()
    rng = wp.rand_init(seed, i * TILE_M + j)
    t = wp.tile_randf(shape=(TILE_M, TILE_N), rng=rng)
    wp.tile_store(x, t, offset=(i * TILE_M, j * TILE_N))

x = wp.zeros(shape=(M * TILE_M, N * TILE_N), dtype=float)
wp.launch_tiled(rand_kernel, dim=[M, N], inputs=[seed, x], block_dim=32)
print(x.numpy())

Alpha and Beta scalings in `wp.tile_matmul()`

Optional alpha and beta scaling arguments have been added to wp.tile_matmul() builtins.

Previous Behavior	Updated Behavior
`out = A * B + out`	`out = alpha * A * B + beta * out`
`out = A * B`	`out = alpha * A * B`

In-place variants of Cholesky decomposition and linear solvers

wp.tile_cholesky_inplace(), wp.tile_cholesky_solve_inplace(), wp.tile_lower_solve_inplace(), and wp.tile_upper_solve_inplace() give the same results as their non-inplace counterparts, but overwrite input memory rather than allocate additional output memory, thereby halving shared memory usage. This is particularly beneficial in memory-constrained kernels where shared memory is limited. A standard example using Cholesky decomposition and the Cholesky solver looks like:

@wp.kernel()
def tile_math_cholesky_inplace(
    gA: wp.array2d(dtype=wp.float64),
    gy: wp.array1d(dtype=wp.float64),
):
    i, j = wp.tid()
    # Load A & y
    a = wp.tile_load(gA, shape=(TILE_M, TILE_M), storage="shared")
    y = wp.tile_load(gy, shape=TILE_M, storage="shared")
    # Compute L st LL^T = A inplace
    wp.tile_cholesky_inplace(a)
    # Solve for y in LL^T x = y inplace
    wp.tile_cholesky_solve_inplace(a, y)
    # Store L & y
    wp.tile_store(gA, a)
    wp.tile_store(gy, y)

Performance improvements

JIT-compile time improvements

Warp v1.11 brings three changes that aim to reduce the time to compile and load modules:

Precompiled headers

The CUDA C++ files that are generated from the Python modules all include the same set of header files. Warp now leverages NVRTC precompiled headers to cache the result of parsing these headers and reuse it for subsequent modules.

The first module that gets compiled incurs a 50 ms overhead to create the precompiled header, but every subsequent module in the same Python session gains 50-500 ms in compile time, with larger modules seeing the greatest benefit. The precompiled header is stored in a temporary directory and cached for the lifetime of the Python process. Each new Python process must recreate the precompiled header, as PCH files cannot be shared across processes due to internal memory layout requirements.

This feature is enabled by default, but can be disabled using wp.config.use_precompiled_headers=False.

Note for source builds: Precompiled headers require building Warp against CUDA Toolkit 12.8 or newer. Users installing from PyPI automatically have this feature because the Warp libraries on PyPI are now built against CUDA Toolkit 12.9.1.

For more details, see the NVRTC PCH documentation.

Optimization level control

By default, the CUDA Runtime Compiler performs a high level of optimizations on GPU kernels, favoring runtime performance at the cost of longer compilation times. Warp v1.11 introduces the wp.config.optimization_level setting to control this tradeoff. When set to None (the default), Warp uses level 3, which corresponds to maximum runtime optimization.

This setting controls GPU kernel compilation and accepts values from 0 to 3:

Level 3 (default): Maximum runtime performance, longest compile times
Level 2: Balanced tradeoff, can reduce initial compile times by up to 30%
Levels 0-1: Faster compilation, but may offer diminishing returns compared to level 2

The setting can be configu...

Contributors

RSchwan and StafaH

Assets 9

0 Join discussion

01 Dec 13:13

github-actions

v1.10.1

7e719ed

v1.10.1

Warp v1.10.1

Warp v1.10.1 is a bugfix release following v1.10.0. For a complete list of changes, see the changelog.

Highlights

This is primarily a bugfix release with no major new features. Key fixes include:

Module reuse with module="unique": Fixed kernels using @wp.kernel(module="unique") to properly reuse existing module objects when the kernel is defined multiple times, avoiding unnecessary module creation overhead.
Kernel-local arrays: Fixed several issues with arrays created using wp.zeros() inside kernels, including .ptr access, indexing for subarrays, and accepting single integers for the shape parameter.
Custom gradients: Fixed a code-generation ordering bug that could prevent custom gradient functions (@wp.func_grad) from compiling when used with nested function calls.
FEM improvements: Fixed invalid reads when using wp.fem.TemporaryStore during tape capture and resolved reference cycles in wp.fem.Temporary and wp.fem.ShapeBasisSpace.

Announcements

Upcoming removals

The following feature is deprecated and will be removed in v1.11 (planned for January 2026):

graph_compatible parameter in jax_callable(): The boolean graph_compatible flag has been deprecated in favor of the new graph_mode parameter which accepts GraphMode enum values. Use GraphMode.JAX, GraphMode.WARP, or GraphMode.NONE instead.

# Deprecated (v1.10.1, will be removed in v1.11)
callable = wp.jax_experimental.jax_callable(func, graph_compatible=True)

# Use instead
from warp.jax_experimental import GraphMode
callable = wp.jax_experimental.jax_callable(func, graph_mode=GraphMode.JAX)

Platform support

Python 3.8: We plan to drop support for Python 3.8 (end-of-life since October 2024) starting with v1.11.
CUDA Toolkit: Starting with v1.11, the default pre-built wheels published on PyPI will be built with CUDA Toolkit 12.9 instead of 12.8. This does not change driver requirements but enables new compiler options to control the tradeoff between kernel compilation speed and runtime performance. We plan a second transition to CUDA Toolkit 13.x in mid-2026.

Acknowledgments

We thank the following contributors:

@mehdiataei for fixing loop unrolling with wp.static() expressions that prevented certain code patterns from compiling correctly.

Contributors

mehdiataei

Assets 9

02 Nov 14:11

github-actions

v1.10.0

c19d0de

v1.10.0

Warp v1.10.0

Warp v1.10 expands JAX integration with automatic differentiation support and multi-device jax.pmap() compatibility. The tile programming model has been enhanced with axis-specific reductions, component-level indexing, and convenience functions for creating tiles.

Performance has been significantly improved in several areas: BVH operations now support in-place rebuilding for CUDA graphs and configurable leaf sizes, built-in function calls from Python are up to 70× faster, and additional sparse matrix and FEM operations can now be captured in CUDA graphs.

Additional usability improvements include negative indexing and slicing for arrays, atomic bitwise operations, and new built-in functions including error functions and type casting.

Important: This release removes the warp.sim module (deprecated since v1.8), which has been superseded by the Newton physics engine. See the Announcements section below for migration guidance and other upcoming changes.

For a complete list of changes, see the full changelog.

New features

JAX automatic differentiation (experimental)

Warp now supports experimental automatic differentiation with JAX, allowing kernels to participate in JAX automatic differentiation workflows. This feature is contributed by @mehdiataei and builds on earlier work by @jaro-sevcik. It enables computing gradients through Warp kernels using jax.grad() by passing enable_backward=True to jax_kernel().

Key capabilities include:

Single and multiple output kernels: Compute gradients for kernels with one or more output arrays
Static input auto-detection: Scalar inputs are automatically treated as static (non-differentiable) arguments
Vector and matrix arrays: Arrays of composite types like wp.vec2 or wp.mat22 are fully supported
Multi-device execution: Compatible with jax.pmap() for distributed forward and backward passes across multiple GPUs

import jax
from warp.jax_experimental import jax_kernel

@wp.kernel
def my_kernel(a: wp.array(dtype=float), out: wp.array(dtype=float)):
    i = wp.tid()
    out[i] = a[i] ** 2.0

# Enable automatic differentiation
jax_func = jax_kernel(my_kernel, num_outputs=1, enable_backward=True)

# Compute gradients through the kernel
grad_fn = jax.grad(lambda a: jax.numpy.sum(jax_func(a)[0]))
gradient = grad_fn(input_array)  # gradient: [2*a[0], 2*a[1], ...]

This feature is experimental and has some current limitations. See the JAX Automatic Differentiation documentation for complete examples, usage details, and limitations.

Multi-device JAX support with `jax.pmap()`

Warp now properly supports jax.pmap() and jax.shard_map() for multi-device parallel execution, thanks to fixes contributed by @chaserileyroberts. Previously, device targeting issues prevented Warp callables from working correctly within these JAX primitives—JAX would invoke callbacks from multiple threads targeting different devices, but Warp would always execute on the default device. The fix ensures proper device coordination by extracting device ordinals from XLA FFI and adding thread synchronization for concurrent callbacks, enabling efficient data-parallel workflows across multiple GPUs.

In-place BVH rebuilding with CUDA graph support

A new wp.Bvh.rebuild() method enables rebuilding BVH hierarchies in-place without allocating new memory. This complements the existing refit() method and is particularly useful when primitive distributions change significantly.

CUDA graph capture: Unlike creating a new BVH, rebuild() reuses existing buffers, making it safe to capture in CUDA graphs. Previously captured graphs that include queries on the BVH remain valid after rebuilding, enabling high-performance repeated updates without graph re-capture overhead.

Construction algorithms: On CUDA devices, in-place rebuild supports "lbvh" only. On CPU, "sah" and "median" are supported. Defaults are chosen automatically based on the device.

Tile programming enhancements

The tile programming model has been enhanced with new capabilities to make tile-based computations more expressive and convenient:

Axis-specific reductions

The tile-reduction functions wp.tile_reduce() and wp.tile_sum() now support an optional axis parameter, enabling reductions along a specific dimension of a tile rather than reducing the entire tile to a single value. This enhancement brings NumPy-like axis semantics to tile operations.

@wp.kernel
def tile_reduce_axis(x: wp.array2d(dtype=float), y: wp.array(dtype=float)):
    a = wp.tile_load(x, shape=(4, 8), storage="shared")
    # Sum along axis 0, reducing shape from (4, 8) to (8,)
    b = wp.tile_sum(a, axis=0)
    wp.tile_store(y, b)


x = wp.array(np.arange(32).reshape(4, 8), dtype=float)
# x = [[ 0.  1.  2.  3.  4.  5.  6.  7.]
#      [ 8.  9. 10. 11. 12. 13. 14. 15.]
#      [16. 17. 18. 19. 20. 21. 22. 23.]
#      [24. 25. 26. 27. 28. 29. 30. 31.]]
y = wp.zeros(8, dtype=float)

wp.launch_tiled(tile_reduce_axis, dim=(1,), inputs=[x], outputs=[y], block_dim=32)
# y = [48. 52. 56. 60. 64. 68. 72. 76.]  (column sums)

Component-level indexing

Tiles of composite types (vectors, matrices, quaternions) now support component-level indexing and assignment. You can directly index into individual components using extended indexing syntax:

Vector components: tile[i][1] extracts the second component of a vector at position i
Matrix elements: tile[i][1, 1] accesses the element at row 1, column 1 of a matrix at position i

This provides more convenient and expressive syntax for working with structured data in tiles.

Creating tiles filled with a constant value

The new wp.tile_full() function provides a convenient way to create tiles initialized with a constant value, similar to NumPy's np.full():

# Create an 8x8 tile filled with 3.14
tile = wp.tile_full(shape=(8, 8), value=3.14, dtype=float)

New example

The new example_tile_mcgp.py example demonstrates tile-based Monte Carlo methods by implementing a walk-on-spheres algorithm for solving Laplace's equation on volumetric domains.

Performance improvements

Built-in function calls from Python

Calling Warp built-in functions from Python scope (e.g., wp.normalize(), wp.transform_identity(), matrix arithmetic like mat * mat) is now significantly faster thanks to optimizations in overload resolution. Previously, each function call would iterate through all overloads, attempt argument binding, and pack parameters into C types until finding a match. Now, Warp caches the resolved overload and parameter packing strategy based on argument types using @functools.lru_cache, eliminating redundant resolution overhead on subsequent calls.

In microbenchmarks, repeated wp.mat44 multiplication at Python scope is up to 70× faster (~570 μs → ~8 μs), while operations like wp.transform_identity() see 3-4× speedups (~100 μs → ~30 μs). The magnitude of improvement varies by operation complexity, with greater gains for operations requiring more expensive overload resolution.

Breaking change: As part of this optimization, support for passing lists, tuples, and other non-Warp array arguments to built-in functions has been removed. Calls like wp.normalize([1.0, 2.0, 3.0]) must now be written as wp.normalize(wp.vec3(1.0, 2.0, 3.0)). This simplifies the function call path and removes expensive sequence-flattening logic that was incompatible with efficient caching.

Configurable BVH leaf size

wp.Bvh and wp.Mesh now expose tunable leaf_size and bvh_leaf_size parameters, respectively, allowing users to control the number of primitives stored in each leaf node for performance optimization. The optimal leaf size depends on the query workload:

Intersection queries (ray casting, AABB overlap): Smaller leaf sizes (e.g., 1) are generally optimal, reducing unnecessary primitive checks
Closest point queries: Larger leaf sizes (e.g., 4-8) can improve performance by checking more primitives together and reducing traversal overhead
Mixed workloads: Moderate values (e.g., 4) provide a balanced trade-off

Behavior change: The default leaf_size for wp.Bvh has changed from 4 (hardcoded) to 1, optimizing for intersection queries which are more common. wp.Mesh retains a default bvh_leaf_size of 4 as a compromise between intersection and closest-point query performance. Users performing primarily closest-point queries may benefit from explicitly setting larger leaf sizes.

Sparse matrix operations with CUDA graphs

Sparse matrix operations in warp.sparse can now be captured in CUDA graphs for allocation-free execution. Operations like bsr_axpy(), bsr_assign(), and bsr_set_transpose() preserve matrix topology when using masked=True, while bsr_mm() adds a new max_new_nnz parameter that allows specifying an upper bound on new non-zero blocks for flexible graph capture when sparsity patterns vary within known bounds.

FEM operations with CUDA graphs

Building warp.fem geometry and function space partitions can now be captured in CUDA graphs by specifying upper bounds on partition sizes: max_cell_count and max_side_count for ExplicitGeometryPartition, and max_node_count for make_space_partition(). Additionally, building fields and restrictions is now synchronization-free by default.

Language enhancements

Array indexing and slicing improvements

Warp arrays now support negative in...

Contributors

chaserileyroberts, jaro-sevcik, and 5 other contributors

Assets 9

0 Join discussion

01 Oct 07:37

github-actions

v1.9.1

c60ce15

v1.9.1

Warp v1.9.1

Warp 1.9.1 is a bugfix release that follows our recent feature update. For a full list of changes, see the changelog.

Highlights

GPU Compatibility: Support for older NVIDIA GPU architectures (Maxwell, Pascal, Volta) was unintentionally dropped in the pre-built wheels distributed for Warp 1.9.0 on PyPI. These architectures have been added back.
Documentation Improvements: We have corrected the documentation for wp.mesh_query_aabb() and wp.mesh_query_aabb_next(), added a caveat concerning the use of __cuda_array_interface__ on a system with multiple GPUs, and fixed the labeling of built-in functions that were incorrectly labeled as differentiable.
Corrected Slice Behavior: Empty slices (e.g. arr[i:i]) are now handled correctly at the Python scope, returning an empty array instead of raising an error.
Tile Stability and Correctness: A critical memory management issue with shared tiles has been fixed to prevent unpredictable crashes and memory leaks. Additionally, functions like wp.copy() and wp.where() now work with tiles and compute correct gradients (adjoints).
Tuple Type Hints: Resolved a TypeError that occurred when using modern tuple type hints (e.g., tuple[int, int]) with @wp.func-decorated functions on Python 3.9 and 3.10.

Announcements

Known limitations

CPU Kernels on ARM: Launching CPU kernels on Linux ARM systems, such as NVIDIA Jetson Thor and Grace Hopper, may result in segmentation faults. A fix for this issue is planned for the v1.10 release. GPU kernels are not affected.

Upcoming removals

The following features have been deprecated in prior releases and will be removed in v1.10 (early November):

warp.sim - Use the Newton engine.
Constructing a wp.matrix() from column vectors - Use wp.matrix_from_rows() or wp.matrix_from_cols() instead.
wp.select() - Use wp.where() instead (node: different argument order).
wp.matrix(pos, quat, scale) - Use wp.transform_compose() instead.

Platform support

We plan to drop support for Intel macOS (x86-64) in a future release (tentatively planned for v1.10).

Acknowledgments

We thank the following contributors for their valuable contributions to this release:

@RSchwan for a major contribution that fixed memory management issues with tiles and enabled functions like wp.copy() and wp.where() to work correctly with tile arguments (#777).
@liblaf for reporting issues related to GPU architecture compatibility (#960, #966) and code generation for wp.map() (#953).

Contributors

RSchwan and liblaf

Assets 9

05 Sep 03:54

github-actions

v1.9.0

d4440b4

v1.9.0

Warp 1.9 ships with a rewritten marching cubes implementation, compatibility with the CUDA 13 toolkit, and new functions for ahead-of-time module compilation. The programming model has also been enhanced with more flexible indexing for composite types, direct IntEnum support, and the ability to initialize local arrays in kernels.

New Features

Differentiable marching cubes

A fully differentiable wp.MarchingCubes implementation, contributed by @mikacuy and @nmwsharp, has been added. This version is written entirely in Warp, replacing the previous native CUDA C++ implementation and enabling it to run on both CPU and GPU devices. The implementation also addresses a long-standing off-by-one bug (#324). For more details, see the updated documentation.

Functions for module compilation and loading

We have added wp.compile_aot_module() and wp.load_aot_module() for more flexible ahead-of-time (AOT) compilation.

These functions include a strip_hash=True argument, which removes the unique hashes from compiled module and function
names. This change makes it possible to distribute pre-compiled modules without shipping the original Python source code.

See the documentation on ahead-of-time compilation workflows for more details. In future releases, we plan to continue to expand Warp's support for ahead-of-time workflows.

CUDA 13 Support

CUDA Toolkit 13.0 was released in early August.

PyPI Distribution: Warp wheels on PyPI and NVIDIA PyPI will continue to be built with CUDA 12.8 to provide a transition period for users upgrading their CUDA drivers.

CUDA 13.0 Compatibility: Users requiring Warp compiled against CUDA 13.x have two options:

Build Warp from source
Install pre-built wheels from GitHub releases

Driver Compatibility: CUDA 12.8 Warp wheels can run on systems with CUDA 13.x drivers thanks to CUDA's backward compatibility.

Performance Improvements

Graph-capturable linear solvers

The iterative linear solvers in warp.optim.linear (CG, BiCGSTAB, GMRES) are now fully compatible with CUDA graph capture. This adds support for device-side convergence checking via wp.capture_while(), enabling full CUDA graph capture when check_every=0. Users can now choose between traditional host-side convergence checks or fully graph-capturable device-side termination.

Automatic tiling for sparse linear algebra

warp.sparse now supports arbitrary-sized blocks and can leverage tile-based computations for certain matrix types. The system automatically chooses between tiled and non-tiled execution using heuristics based on matrix characteristics (block sizes, sparsity patterns, and workload dimensions). Note that the heuristic for choosing between tiled and non-tiled variants is still being refined, and that it can be manually overridden by providing the tile_size parameter to bsr_mm or bsr_mv.

Automatic tiling for finite element quadrature

warp.fem.integrate now leverages tile-based computations for quadrature point accumulation, with automatic tile size selection based on workload characteristics. The system automatically chooses between tiled and non-tiled execution to optimize performance based on the integration problem size and complexity.

Programming Model Updates

Slice and negative indexing improvements for composite types

We have enhanced the support for slice operations and negative indexing across all composite types (vectors, matrices, quaternions, and transforms).

m = wp.matrix_from_rows(
    wp.vec3(1.0, 2.0, 3.0),
    wp.vec3(4.0, 5.0, 6.0),
    wp.vec3(7.0, 8.0, 9.0),
)
subm = m[:-1, 1:]
print(subm)
# [[2.0, 3.0],
#  [5.0, 6.0]]

Support for `IntEnum` and `IntFlag` inside kernels

It is now possible to directly reference IntEnum and IntFlag values inside Warp functions and kernels. Previously, workarounds involving wp.static() were required.

from enum import IntEnum

class JointType(IntEnum):
    PRISMATIC = 0
    REVOLUTE = 1
    BALL = 2

@wp.kernel
def count_revolute_joints(
    joint_types: wp.array(dtype=JointType),
    counter: wp.array(dtype=int)
):
    tid = wp.tid()
    joint = joint_types[tid]

    # No longer requires wp.static(JointType.REVOLUTE.value)
    if joint == JointType.REVOLUTE:
        wp.atomic_add(counter, 0, 1)

Improved support for `wp.array()` views inside kernels

This enhancement allows kernels to create array views by accessing the ptr attribute of an array.

@wp.kernel
def kernel_array_from_ptr(arr_orig: wp.array2d(dtype=wp.float32)):
    arr = wp.array(ptr=arr_orig.ptr, shape=(2, 3), dtype=wp.float32)
    arr[0, 0] = 1.0
    arr[0, 1] = 2.0
    arr[0, 2] = 3.0

Additionally, these in-kernel views now support dynamic shapes and struct types.

Support for initializing fixed-size arrays inside kernels

It is now possible to allocate local arrays of a fixed size in kernels using wp.zeros(). The resulting arrays are allocated in registers, providing fast access and avoiding global memory overhead.

Previously, developers needed to create vectors to achieve a similar capability, e.g. v = wp.vector(length=8, dtype=float), but this came with various limitations.

@wp.kernel
def kernel_with_local_array():
    local_arr = wp.zeros(8, dtype=wp.float32)  # Allocated in registers
    # ... use local_arr

Indexed tile operations

Warp now provides three new indexed tile operations that enable more flexible memory access patterns beyond simple contiguous tile operations. These functions allow you to load, store, and perform atomic operations on tiles using custom index mappings along specified axes.

wp.tile_load_indexed() - Load tiles with custom index mapping along a specified axis
wp.tile_store_indexed() - Store tiles with custom index mapping along a specified axis
wp.tile_atomic_add_indexed() - Perform atomic additions with custom index mapping along a specified axis

x = wp.array(
    [
        [0.77395605, 0.43887844, 0.85859792, 0.69736803],
        [0.09417735, 0.97562235, 0.7611397, 0.78606431],
        [0.12811363, 0.45038594, 0.37079802, 0.92676499],
    ],
    dtype=float,
)

indices = wp.array([0, 2], dtype=int)


@wp.kernel
def indexed_data_lookup(data: wp.array2d(dtype=float), indices: wp.array(dtype=int)):
    # [0 2] = tile(shape=(2), storage=shared)
    indices_tile = wp.tile_load(indices, shape=(2,))

    # [[0.773956 0.438878 0.858598 0.697368]
    #  [0.128114 0.450386 0.370798 0.926765]] = tile(shape=(2,4), storage=register)
    data_rows_tile = wp.tile_load_indexed(data, indices_tile, axis=0, shape=(2, 4))
    print(data_rows_tile)

    # [[0.773956 0.858598]
    #  [0.0941774 0.76114]
    #  [0.128114 0.370798]] = tile(shape=(3,2), storage=register)
    data_columns_tile = wp.tile_load_indexed(data, indices_tile, axis=1, shape=(3, 2))


wp.launch_tiled(indexed_data_lookup, dim=1, inputs=[x, indices], block_dim=2)

Fixed nested matrix component support

Warp now properly supports writing to individual matrix elements stored within struct fields. Previously, operations like struct.matrix[1, 2] = value would result in a compile-time error.

@wp.struct
class MatStruct:
    m: wp.mat44

@wp.kernel
def kernel_nested_mat(out: wp.array(dtype=MatStruct)):
    s = MatStruct()
    s.m[1, 2] = 3.0  # This now works correctly (no longer raises a WarpCodegenError)
    s.m[2][2] = 5.0  # This has also been fixed (used to silently fail)
    out[0] = s

Announcements

Known limitations

Early testing on NVIDIA Jetson Thor indicates that launching CPU kernels may sometimes result in segmentation faults. GPU kernel launches are unaffected. We believe this can be resolved by building Warp from source against LLVM/Clang version 18 or newer.

Upcoming removals

The following features have been deprecated in prior releases and will be removed in v1.10 (early November):

warp.sim - Use the Newton engine.
Constructing a wp.matrix() from column vectors - Use wp.matrix_from_rows() or wp.matrix_from_cols() instead.
wp.select() - Use wp.where() instead (note: different argument order).
wp.matrix(pos, quat, scale) - Use wp.transform_compose() instead.

Platform support

We plan to drop support for Intel macOS (x86-64) in a future release (tentatively planned for v1.10).

Acknowledgments

We thank the following contributors for their valuable contributions to this release:

@liblaf for fixing an issue with using warp.jax_experimental.ffi.jax_callable() with a function annotated with the -> None return type (#893).
@matthewdcong for providing an updated version of NanoVDB compatible with CUDA 13 (#888).
@YuyangLee for contributing an early prototype that helped shape the strip_hash=True option for the new ahead-of-time compilation functions (#661).

Full Changelog

For a curated list of all changes in this release, please see the v1.9.0 section in CHANGELOG.md.

Contributors

matthewdcong, mikacuy, and 3 other contributors

Assets 9

1 Join discussion

20 Aug 15:59

shi-eric

v1.9.0rc1

d641e89

v1.9.0rc1 Pre-release

Pre-release

Release candidate for Isaac Lab testing.

Assets 4

01 Aug 17:41

github-actions

v1.8.1

ad1092b

v1.8.1

This patch release primarily contains bug fixes as expected.

However, to support the adoption of Warp by the MuJoCo MJX physics engine, it also includes new features and deprecations limited to the jax_experimental module. We are flagging this deviation from our standard versioning practices to ensure clarity. Normal versioning practices will resume with the next release.

Full Changelog

Deprecated

This is the final release that will provide builds for or support the CUDA 11.x Toolkit and driver. Starting with v1.9.0, Warp will require CUDA 12.x or newer.
Deprecate the graph_compatible boolean flag in jax_callable() in favor of the new graph_mode argument with GraphMode enum (#848).

Added

Add documentation for creating and manipulating Warp structured arrays using NumPy (#852)
Add documentation for wp.indexedarray() (#468).
Support input-output aliasing in JAX FFI (#815).
Support capturing jax_callable() using Warp via the new graph_mode parameter (GraphMode.WARP), enabling capture of graphs with conditional nodes that cannot be used as subgraphs in a JAX capture (#848).

Fixed

Fix tape.zero() to correctly reset gradient arrays in nested structs (#807).
Fix incorrect adjoints for div(scalar, vec), div(scalar, mat), and div(scalar, quat), and other miscellaneous issues with adjoints (#831).
Fix a module-hashing issue for functions or kernels using static expressions that cannot be resolved at the time of declaration (#830).
Fix a bug in which changes to wp.config.mode were not being picked up after module initialization (#856).
Fix a bug where CUDA modules could get prematurely unloaded when conditional graph nodes are used.
Fix compile time regression for kernels using matmul, Cholesky, and FFT solvers by upgrading to libmathdx 0.2.2 (#809).
Fix potential uninitialized memory issues in wp.tile_sort() (#836).
Fix wp.tile_min() and wp.tile_argmin() to return correct values for large tiles with low occupancy (#725).
Fix codegen errors associated with adjoint of wp.tile_sum() when using shared tiles (#822).
Fix driver entry point error for cuDeviceGetUuid caused by using an incorrect version (#851).
Fix an issue that caused Warp to request PTX generation from NVRTC for architectures unsupported by the compiler (#858).
Fix a regression where wp.sparse.bsr_from_triplets() ignored the prune_numerical_zeros=False setting (#832).
Fix missing cloth-body contact in wp.sim.VBDIntegrator with handle_self_contact=False (#862).
Fix a bug causing potential infinite loops in the color balancing calculation (#816).
Fix box-box collision by computing the contact normal at the closest point of approach instead of at the center of the source box (#839).
Fix the OpenGL renderer not correctly displaying colors for box shapes (#810).
Fix a bug in OpenGLRenderer where meshes with different scale attributes were incorrectly instanced, causing them all to be rendered with the same scale OpenGLRenderer (#828).

Assets 9

01 Jul 18:32

github-actions

v1.8.0

dc693d8

v1.8.0

Changelog

[1.8.0] - 2025-07-01

Added

Add wp.map() to map a function over arrays and add math operators for Warp arrays (docs, #694).
Add support for dynamic control flow in CUDA graphs, see wp.capture_if() and wp.capture_while() (docs, #597).
Add wp.capture_debug_dot_print() to write a DOT file describing the structure of a captured CUDA graph (#746).
Add the Device.sm_count property to get the number of streaming multiprocessors on a CUDA device (#584).
Add wp.block_dim() to query the number of threads in the current block inside a kernel (#695).
Add wp.atomic_cas() and wp.atomic_exch() built-ins for atomic compare-and-swap and exchange operations (#767).
Add support for profiling GPU runtime module compilation using the global wp.config.compile_time_trace setting or the module-level "compile_time_trace" option. When used, JSON files in the Trace Event format will be written in the kernel cache, which can be opened in a viewer like chrome://tracing/ (docs, #609).
Add support for returning multiple values from native functions like wp.svd3() and wp.quat_to_axis_angle() (#503).
Add support for passing tiles to user wp.func functions (#682).
Add wp.tile_squeeze() to remove axes of length one (#662).
Add wp.tile_reshape() to reshape a tile (#663).
Add wp.tile_astype() to return a new tile with the same data but different data type. (#683).
Add support for in-place tile add and subtract operations (#518).
Add support for in-place tile-component addition and subtraction (#659).
Add support for 2D solves using wp.tile_cholesky_solve() (#773).
Add wp.tile_scan_inclusive() and wp.tile_scan_exclusive() for performing inclusive and exclusive scans over tiles (#731).
Support attribute indexing for quaternions on the right-hand side of expressions (#625).
Add wp.transform_compose() and wp.transform_decompose() for converting between transforms and 4x4 matrices with 3D scale information (#576).
Add various wp.transform syntax operations for loading and storing (#710).
Add the as_spheres parameter to UsdRenderer.render_points() in order to choose whether to render the points as USD spheres using a point instancer or as simple USD points (#634).
Add support for animating visibility of objects in the USD renderer (#598).
Add wp.sim.VBDIntegrator.rebuild_bvh() to rebuild the BVH used for detecting self-contacts.
Add damping terms wp.sim.VBDIntegrator collisions, with strength is controlled by Model.soft_contact_kd.
Improve consistency of the wp.fem.lookup() operator across geometries and add filtering parameters (#618).
Add two examples demonstrating shape optimization using warp.fem: fem/example_elastic_shape_optimization.py and fem/example_darcy_ls_optimization.py (#698).
Add a py.typed marker file (per PEP 561) to the package to formally support static type checking by downstream users (#780).

Removed

Remove wp.mlp() (deprecated in v1.6.0). Use tile primitives instead.
Remove wp.autograd.plot_kernel_jacobians() (deprecated in v1.4.0). Use wp.autograd.jacobian_plot() instead.
Remove the length and owner keyword arguments from wp.array() constructor (deprecated in v1.6.0). Use the shape and deleter keywords instead.
Remove the kernel keyword argument from wp.autograd.jacobian() and wp.autograd.jacobian_fd() (deprecated in v1.6.0). Use the function keyword argument instead.
Remove the outputs keyword argument from wp.autograd.jacobian_plot() (deprecated in v1.6.0).

Changed

Deprecate the warp.sim module (planned for removal in v1.10). It will be superseded by the upcoming Newton library, a separate package with a new API. Migrating will require code changes; a future guide will be provided (current draft). See the GitHub announcement for details (#735).
Deprecate the wp.matrix(pos, quat, scale) built-in function. Use wp.transform_compose() instead (#576).
Improve support for tuples in kernels (#506).
Return a constant value from len() where possible.
Rename the internal function wp.types.type_length() to wp.types.type_size().
Rename wp.tile_cholesky_solve() input parameters to align with its docstring (#726).
Change wp.tile_upper_solve() and wp.tile_lower_solve() to use libmathdx 0.2.1 TRSM solver (#773).
Skip adjoint compilation for wp.tile_matmul() if enable_backward is disabled (#644).
Allow tile reductions to work with non-scalar tile types (#771).
Permit data-type preservation with preserve_type=True when tiling a value across the block with wp.Tile() (#772).
Make wp.sparse.bsr_[set_]from_triplets differentiable with respect to the input triplet values (#760).
Expose new warp.fem operators: node_count, node_index, element_coordinates, element_closest_point.
Change wp.sim.VBDIntegrator rigid-body-contact handling to use only the shape's friction coefficient, rather than averaging the shape's and the cloth's coefficients.
Limit usage of the wp.assign_copy() hidden built-in to the kernel scope.
Describe the distinction between inputs and outputs arguments in the Kernel documentation.
Reduce the overhead of wp.launch() by avoiding costly native API calls (#774).
Improve error reporting when calling @wp.func-decorated functions from the Python scope (#521).

Fixed

Fix missing documentation for geometric structs (#674).
Fix the type annotations in various tile functions (#714).
Fix incorrect stride initialization in tiles returned from functions taking transposed tiles as input (#722).
Fix adjoint generation for user functions that return a tile (#749).
Fix tile-based solvers failing to accept and return transposed tiles (#768).
Fix the Formal parameter space overflowed error during wp.sim.VBDIntegrator kernel compilation for the backward pass in CUDA 11 Warp builds. This was resolved by decoupling collision and elasticity evaluations into separate kernels, increasing parallelism and speeding up the solver (#442).
Fix an issue with graph coloring on an empty graph (#509).
Fix an integer overflow bug in the native graph coloring module (#718).
Fix UsdRenderer.render_points() not supporting multiple colors (#634).
Fix an inconsistency in the wp.fem module regarding the orientation of 2D geometry side normals (#629).
Fix premature unloading of CUDA modules used in JAX FFI graph captures (#782).

Assets 9

0 Join discussion

Releases: NVIDIA/warp

v1.12.0

Warp v1.12.0

New features

Hardware-accelerated textures

Subscript-style type hints

Diagnostics utility

Quaternion and spatial helpers

Approximate math intrinsics

Marching cubes lookup tables

Graph coloring API

Tile programming enhancements

Tile arithmetic operators

wp.tile_from_thread()

Differentiable FFT

MathDx GEMM toggle

Accelerated tile load/store

Contributors

Uh oh!

v1.11.1

Warp v1.11.1

Highlights

Announcements

Upcoming removals

Acknowledgments

Contributors

Uh oh!

v1.11.0

Warp v1.11.0

New features

Group-aware spatial queries

Key features

Geometry query enhancements

Evaluate the gradients of functions

wp.tile_map() supports n-ary maps (up to n=8)

Generate tiles of random numbers

Alpha and Beta scalings in wp.tile_matmul()

In-place variants of Cholesky decomposition and linear solvers

Performance improvements

JIT-compile time improvements

Precompiled headers

Optimization level control

Contributors

Uh oh!

v1.10.1

Warp v1.10.1

Highlights

Announcements

Upcoming removals

Platform support

Acknowledgments

Contributors

Uh oh!

v1.10.0

Warp v1.10.0

New features

JAX automatic differentiation (experimental)

Multi-device JAX support with jax.pmap()

In-place BVH rebuilding with CUDA graph support

Tile programming enhancements

Axis-specific reductions

Component-level indexing

Creating tiles filled with a constant value

New example

Performance improvements

Built-in function calls from Python

Configurable BVH leaf size

Sparse matrix operations with CUDA graphs

FEM operations with CUDA graphs

Language enhancements

Array indexing and slicing improvements

Contributors

Uh oh!

v1.9.1

Warp v1.9.1

Highlights

Announcements

Known limitations

Upcoming removals

Platform support

`wp.tile_from_thread()`

`wp.tile_map()` supports n-ary maps (up to n=8)

Alpha and Beta scalings in `wp.tile_matmul()`

Multi-device JAX support with `jax.pmap()`

Support for `IntEnum` and `IntFlag` inside kernels

Improved support for `wp.array()` views inside kernels