Add support to ndarray for DLPack version 1 #1175

hpkfft · 2025-09-29T22:07:13Z

This PR adds support for DLPack version 1 and adds the ndarray framework nb::arrayapi, which returns an object that provides the buffer interface and has the two DLPack methods __dlpack__() and __dlpack_device__().

Given the following:

using array_t    = nb::ndarray<float, nb::ndim<1>, nb::c_contig>;
using array_np_t = nb::ndarray<float, nb::ndim<1>, nb::c_contig, nb::numpy>;

void init_array(const array_t& a) {
    const std::size_t n = a.shape(0);
    float* ptr = a.data();
    for (std::size_t i = 0; i < n; ++i) ptr[i] = 1.0f;
}

array_np_t create_array_np(std::size_t n) {
    float* ptr = new float[n];
    nb::capsule deleter(ptr, [](void* p) noexcept { delete[] (float*) p; });
    return array_np_t(ptr, {n}, std::move(deleter));
}

NB_MODULE(my_extension, m) {
    m.doc() = "nanobind my_extension module";
    m.def("init_array",      &init_array,      "Initialize array.");
    m.def("create_array_np", &create_array_np, "Create NumPy array.");
}

I measure performance as follows:

test	old	new	ratio
init_array(array)	435 ns	278 ns	1.56
init_array(numpy)	160 ns	111 ns	1.44
create_array_np	565 ns	450 ns	1.25

using Python 3.14 and

python3 -m timeit -n 10000000 -r 10 -s "import array, my_extension as me; a = array.array('f', [1,2,3,4,5,6,7,8])" "me.init_array(a)"

python3 -m timeit -n 10000000 -r 10 -s "import numpy as np, my_extension as me; a = np.zeros(8, dtype=np.float32)" "me.init_array(a)"

python3 -m timeit -n 1000000 -r 10 -s "import numpy as np, my_extension as me;" "me.create_array_np(8)"

include/nanobind/ndarray.h

src/nb_ndarray.cpp

wjakob

Hi @hpkfft,

this looks great, here is a first batch of comments from me. I feel like this change also needs some documentation.

If I have a project using nb::ndarray, what do I need to benefit from the new interfaces? Can I opt out? What are the implications on compatibility? These questions are both relevant for code accepting dlpack-capable objects, and for returning them.

Thanks!

src/nb_ndarray.cpp

wjakob · 2025-10-03T11:19:45Z

src/nb_ndarray.cpp

+    if (framework == numpy::value) {
+        try {
+            static PyObject* const array_str = PyUnicode_FromString("array");
+#if PY_VERSION_HEX < 0x03090000


Curious what's going on here. Is this a performance optimization? Why is it needed? Should we instead improve operator() to dispatch the call more efficiently?

Yes, using PyObject_VectorcallMethod() directly is only done as a performance optimization. It's faster to customize the call site and use static objects (the pre-made tuple copy_tpl for the kwnames argument).
I don't see how to improve operator() generally since it has to work at any call site. In other words, it must create a tuple of keyword names at runtime. I suppose there's a way to do something (similar to JIT compiling), but that's beyond the scope of this PR.
Philosophically, I think it OK to use the low-level Python C-API from within nanobind itself to squeeze that last drop of performance.

src/nb_ndarray.cpp

hpkfft · 2025-10-17T06:07:16Z

docs/ndarray.rst

-   nb::class_<MyArray>(m, "MyArray")
-      // ...
-      .def("__dlpack__", [](nb::kwargs kwargs) {
-          return nb::ndarray<>( /* ... */);


I don't think this works (implementing a Python method as a lambda), or I'm missing something interesting.
I don't see how the /* ... */ can access the this pointer (to get a pointer to the actual data in MyArray).
In the new documentation, I added member functions to MyArray and used them by name in the binding.

You can implement a lambda function that accesses self, either by taking a C++ type as first argument, by taking a nb::handle as first argument, or by taking nb::pointer_and_handle<T> that gives you both.

Thanks! I changed the example to use a lambda, since it's nice to show that these methods can be added in a binding without having to change the C++ class.

wjakob · 2025-10-17T07:03:30Z

Is this still a draft PR?

wjakob · 2025-10-17T07:03:50Z

docs/api_extra.rst


   Builtin Python ``memoryview`` for CPU-resident data.

+.. cpp:class:: arrayapi


how about array_api?

I slightly prefer arrayapi, without the underscore, copying the style of memview.
In the example code, I think

using arrayapi_t = nb::ndarray<nb::arrayapi, double>;

looks a bit nicer than

using array_api_t = nb::ndarray<nb::array_api, double>;

But I'd be happy to change it if you prefer, so don't hesitate to say so.

I prefer array_api since these are separate words (always written with a space in public communications). For the memview I did not put a separator because even the python type does not use one. (Though arguably I should have written the long version "memoryview" to be 100% consistent, oh well..)

Ah, I probably did not do that because it was already taken.

docs/ndarray.rst

wjakob · 2025-10-17T07:06:11Z

include/nanobind/ndarray.h

 enum class dtype_code : uint8_t {
-    Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6
+    Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6,
+    Float8_e3m4 = 7, Float8_e4m3 = 8, Float8_e4m3b11fnuz = 9,


Minor: I would prefer the letters to be uppercase. e.g. Float8_E4M3.

hpkfft · 2025-10-17T07:50:00Z

If I have a project using nb::ndarray, what do I need to benefit from the new interfaces? Can I opt out? What are the implications on compatibility?

Nothing. No. Only goodness.

When nanobind imports a DLPack-capable object, it first tries to call the object's __dlpack__() method with keyword argument "max_version" set to (1, 1), indicating that nanobind can accept a versioned tensor. (The minor version is irrelevant.) The object can return either the old unversioned tensor or a versioned tensor--either way nanobind does the import. If the object cannot accept the kwarg at all (i.e., raises TypeError), nanobind calls __dlpack__() without any kwargs and imports the unversioned tensor. (In theory, it could be versioned (which would be a bug in their code), but in reality, if the object doesn't even know about "max_version", then it doesn't know about versioned tensors.)
If the object is not DLPack-capable, nanobind tries to import using the buffer protocol.
If that doesn't work, nanobind tries to call to_dlpack(obj) on the framework to get an unversioned capsule. [This is very obsolete, but the code was there, so might as well keep it.]

In the case of a versioned capsule, a flag bit can be set to indicate that the tensor is read-only. Nanobind honors this and creates a read-only nd-array.
In the case of an unversioned capsule, nanobind assumes it's writable. As before, it would be the user's responsibility to know if that's not the case and to refrain from actually writing to it.

On export, it depends on the framework.

no_framework is unchanged. It continues to return an unversioned capsule for backward compatibility.

Tensorflow is unchanged. An unversioned capsule is passed to tensorflow.experimental.dlpack.from_dlpack(). Their online docs show that's the thing to do.

arrayapi is new. It returns an object of type nanobind.nb_ndarray, which supports both the buffer protocol and the DLPack __dlpack__() and __dlpack_device__() methods. The __dlpack__() method accepts and honors the keyword argument "max_version" and returns a versioned tensor if and only if the value tuple[int, int] has first component (i.e., major version) of at least 1. (If the value is None, or the keyword argument is missing, that is equivalent to passing a maximum major version of 0.)

NumPy is unchanged. It first makes a new nanobind.nb_ndarray and then passes it to NumPy, which imports it using the buffer protocol. I did not see a performance improvement in changing to DLPack. Also, numpy.array() supports a "copy" keyword argument, so if a copy is needed, it's done in the same call without having to subsequently call a copy() or clone() function.

memview is unchanged. It uses the buffer protocol on a new nanobind.nb_ndarray object.

PyTorch, JAX, and CuPy: nanobind creates a new nanobind.nb_ndarray object and then passes that to the framework's from_dlpack() function. That's not different per se, but these frameworks can now call our __dlpack__() with a maximum major version of 1 (and any minor version) and get a versioned tensor in return. They can also pass a maximum major version of 0 and get an unversioned tensor, as before. Or, pass max_version=None, or omit the keyword argument, and get an unversioned tensor, as before.

wjakob · 2025-10-17T09:52:53Z

Beautiful, thank you for this clarification. I guess there could be a performance cost when we try to import a tensor from an older framework that doesn't support versioned capsules (due to calling dlpack multiple times), correct? But I suppose the impact of that should diminish over time.

wjakob · 2025-10-17T09:55:26Z

One more potential optimization opportunity. Do you think that it would be possible to use the static object table to reduce all of these costly API calls and string comparisons to a few pointer comparisons? (this is from the function that checks if an object is an ndarray).

    PyObject *name = nb_type_name((PyObject *) tp);
    check(name, "Could not obtain type name! (1)");

    const char *tp_name = PyUnicode_AsUTF8AndSize(name, nullptr);
    check(tp_name, "Could not obtain type name! (2)");

    bool result =
        // PyTorch
        strcmp(tp_name, "torch.Tensor") == 0 ||
        // XLA
        strcmp(tp_name, "jaxlib.xla_extension.ArrayImpl") == 0 ||
        // Tensorflow
        strcmp(tp_name, "tensorflow.python.framework.ops.EagerTensor") == 0 ||
        // Cupy
        strcmp(tp_name, "cupy.ndarray") == 0;

hpkfft · 2025-10-18T06:23:39Z

I guess there could be a performance cost when we try to import a tensor from an older framework that doesn't support versioned capsules (due to calling dlpack multiple times), correct?

Yes, if __dlpack__(max_version=(1, 1)) fails and then __dlpack__() succeeds, we spend some time on the first call, which is not currently the case in nanobind. But that's unavoidable.
The max_version kwarg was added in Python array API standard v2023.12.
Note that a framework could trivially add support for max_version by simply accepting it as a kwarg and then ignoring it.
It's not required to return a versioned tensor when the caller asks for one.
It's always OK to return an unversioned tensor.
It is prohibited to return a versioned tensor unless the max_version is (1, 0) or greater.

But I suppose the impact of that should diminish over time.

Yes.

Do you think that it would be possible to use the static object table to reduce all of these costly API calls and string comparisons to a few pointer comparisons [in ndarray_check]?

I don't think it would help.

The problem is that the pointer comparison name == something only succeeds if both name and something are the same object, which can be achieved if they have both been interned. We can intern something, but we can't intern name, which is whatever was set as the type name of the object. In other words, if the pointer comparison succeeds, then we know the strings are equal since they are the same object. But even if they are not the same object, they may still be the same UTF8 string.
In nb_ndarray_dlpack(), there is now the following code to check whether key is UTF8 string "max_version":

    if (key == static_pyobjects[pyobj_name::max_version_str] ||
        PyObject_RichCompareBool(key, static_pyobjects[pyobj_name::max_version_str], Py_EQ) == 1) {

This short-circuiting is good, since the pointer comparison is cheap and should be expected to succeed, because keyword argument names used across API boundaries ought to be interned by both sides (in order to support this optimization). [but see footnote 1]
If there are multiple kwnames, each key should be pointer compared to all supported names before doing any RichCompares. Hopefully, all keys pointer compare equal to some expected name and there's no need to do any RichCompares.

Now, consider ndarray_check.
If the result will be true, the common cases (PyObject either has attribute __dlpack__, or supports the buffer protocol, or is a PyCapsule) are all tested first, and the function returns before reaching the existing string comparisons.

If the result will be false, then the pointer compare will be false, and we'll have to do either strcmp or PyObject_RichCompareBool anyway to be sure the strings are not the same UTF8 strings (despite being different PyObjects). (And the former (as it is now) seems it would be faster than the latter.)

The frameworks should implement __dlpack__() from Python array API standard v2021.12.
Then the test we have now,

    if (PyObject_HasAttr(o, static_pyobjects[pyobj_name::dunder_dlpack_str]) ||
        PyObject_CheckBuffer(o))
        return true;

will be fast.

[footnote 1] The current (and past) release of NumPy does not intern "dl_device", "copy", or "max_version", so nanobind does the RichCompare, which succeeds. This is fixed in the development version by numpy/numpy#29875 So, nanobind will be a bit faster with the next release of NumPy.

wjakob · 2025-10-18T13:03:19Z

I don't think it would help.

My assumption was that the python type construction will intern type and module names so that pointer equality is legal.

hpkfft · 2025-10-18T18:08:45Z

That doesn't seem to be the case. Using Python 3.11.2 and adding the following to ndarray_check:

    PyObject* tpfoe = 
      PyUnicode_InternFromString("tensorflow.python.framework.ops.EagerTensor");
    printf("name %s tpfoe\n", (name == tpfoe) ? "is" : "is not");
    int sc = strcmp(tp_name, "tensorflow.python.framework.ops.EagerTensor");
    printf("name %s tpfoe\n", (sc == 0) ? "==" : "!=");

I get

name is not tpfoe
name == tpfoe

when running

python3 -m pytest --capture=tee-sys tests/test_tensorflow.py

This commit adds support for the struct ``DLManagedTensorVersioned`` as defined by DLPack version 1. It also adds the ndarray framework ``nb::array_api``, which returns an object that provides the buffer interface and provides the two DLPack methods ``__dlpack__()`` and ``__dlpack_device__()``.

hpkfft commented Sep 29, 2025

View reviewed changes

include/nanobind/ndarray.h Outdated Show resolved Hide resolved

hpkfft commented Sep 29, 2025

View reviewed changes

src/nb_ndarray.cpp Outdated Show resolved Hide resolved

hpkfft force-pushed the dlpack-v1 branch from e65c393 to 1e5838f Compare October 3, 2025 03:48

wjakob reviewed Oct 3, 2025

View reviewed changes

hpkfft force-pushed the dlpack-v1 branch from 1e5838f to 87d278d Compare October 17, 2025 05:30

hpkfft commented Oct 17, 2025

View reviewed changes

wjakob reviewed Oct 17, 2025

View reviewed changes

docs/ndarray.rst Show resolved Hide resolved

wjakob reviewed Oct 17, 2025

View reviewed changes

hpkfft force-pushed the dlpack-v1 branch from 87d278d to 1c1a761 Compare October 18, 2025 04:43

hpkfft marked this pull request as ready for review October 18, 2025 04:50

hpkfft force-pushed the dlpack-v1 branch from 1c1a761 to a7d3550 Compare October 18, 2025 20:17

hpkfft force-pushed the dlpack-v1 branch from a7d3550 to f36bba4 Compare October 19, 2025 01:20


		Builtin Python ``memoryview`` for CPU-resident data.

		.. cpp:class:: arrayapi

Add support to ndarray for DLPack version 1 #1175

Are you sure you want to change the base?

Add support to ndarray for DLPack version 1 #1175

Uh oh!

Conversation

hpkfft commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjakob left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjakob commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpkfft commented Oct 17, 2025

Uh oh!

wjakob commented Oct 17, 2025

Uh oh!

wjakob commented Oct 17, 2025

Uh oh!

hpkfft commented Oct 18, 2025

Uh oh!

wjakob commented Oct 18, 2025

Uh oh!

hpkfft commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hpkfft commented Sep 29, 2025 •

edited

Loading