Skip to content

Conversation

hpkfft
Copy link
Contributor

@hpkfft hpkfft commented Sep 29, 2025

This PR adds support for DLPack version 1 and adds the ndarray framework nb::arrayapi, which returns an object that provides the buffer interface and has the two DLPack methods __dlpack__() and __dlpack_device__().

Given the following:

using array_t    = nb::ndarray<float, nb::ndim<1>, nb::c_contig>;
using array_np_t = nb::ndarray<float, nb::ndim<1>, nb::c_contig, nb::numpy>;

void init_array(const array_t& a) {
    const std::size_t n = a.shape(0);
    float* ptr = a.data();
    for (std::size_t i = 0; i < n; ++i) ptr[i] = 1.0f;
}

array_np_t create_array_np(std::size_t n) {
    float* ptr = new float[n];
    nb::capsule deleter(ptr, [](void* p) noexcept { delete[] (float*) p; });
    return array_np_t(ptr, {n}, std::move(deleter));
}

NB_MODULE(my_extension, m) {
    m.doc() = "nanobind my_extension module";
    m.def("init_array",      &init_array,      "Initialize array.");
    m.def("create_array_np", &create_array_np, "Create NumPy array.");
}

I measure performance as follows:

test old new ratio
init_array(array) 435 ns 278 ns 1.56
init_array(numpy) 160 ns 111 ns 1.44
create_array_np 565 ns 450 ns 1.25

using Python 3.14 and

python3 -m timeit -n 10000000 -r 10 -s "import array, my_extension as me; a = array.array('f', [1,2,3,4,5,6,7,8])" "me.init_array(a)"

python3 -m timeit -n 10000000 -r 10 -s "import numpy as np, my_extension as me; a = np.zeros(8, dtype=np.float32)" "me.init_array(a)"

python3 -m timeit -n 1000000 -r 10 -s "import numpy as np, my_extension as me;" "me.create_array_np(8)"

Copy link
Owner

@wjakob wjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hpkfft,

this looks great, here is a first batch of comments from me. I feel like this change also needs some documentation.

If I have a project using nb::ndarray, what do I need to benefit from the new interfaces? Can I opt out? What are the implications on compatibility? These questions are both relevant for code accepting dlpack-capable objects, and for returning them.

Thanks!

if (framework == numpy::value) {
try {
static PyObject* const array_str = PyUnicode_FromString("array");
#if PY_VERSION_HEX < 0x03090000
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what's going on here. Is this a performance optimization? Why is it needed? Should we instead improve operator() to dispatch the call more efficiently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using PyObject_VectorcallMethod() directly is only done as a performance optimization. It's faster to customize the call site and use static objects (the pre-made tuple copy_tpl for the kwnames argument).
I don't see how to improve operator() generally since it has to work at any call site. In other words, it must create a tuple of keyword names at runtime. I suppose there's a way to do something (similar to JIT compiling), but that's beyond the scope of this PR.
Philosophically, I think it OK to use the low-level Python C-API from within nanobind itself to squeeze that last drop of performance.

nb::class_<MyArray>(m, "MyArray")
// ...
.def("__dlpack__", [](nb::kwargs kwargs) {
return nb::ndarray<>( /* ... */);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works (implementing a Python method as a lambda), or I'm missing something interesting.
I don't see how the /* ... */ can access the this pointer (to get a pointer to the actual data in MyArray).
In the new documentation, I added member functions to MyArray and used them by name in the binding.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can implement a lambda function that accesses self, either by taking a C++ type as first argument, by taking a nb::handle as first argument, or by taking nb::pointer_and_handle<T> that gives you both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I changed the example to use a lambda, since it's nice to show that these methods can be added in a binding without having to change the C++ class.

@wjakob
Copy link
Owner

wjakob commented Oct 17, 2025

Is this still a draft PR?


Builtin Python ``memoryview`` for CPU-resident data.

.. cpp:class:: arrayapi
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about array_api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer arrayapi, without the underscore, copying the style of memview.
In the example code, I think

using arrayapi_t = nb::ndarray<nb::arrayapi, double>;

looks a bit nicer than

using array_api_t = nb::ndarray<nb::array_api, double>;

But I'd be happy to change it if you prefer, so don't hesitate to say so.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer array_api since these are separate words (always written with a space in public communications). For the memview I did not put a separator because even the python type does not use one. (Though arguably I should have written the long version "memoryview" to be 100% consistent, oh well..)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I probably did not do that because it was already taken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

enum class dtype_code : uint8_t {
Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6
Int = 0, UInt = 1, Float = 2, Bfloat = 4, Complex = 5, Bool = 6,
Float8_e3m4 = 7, Float8_e4m3 = 8, Float8_e4m3b11fnuz = 9,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I would prefer the letters to be uppercase. e.g. Float8_E4M3.

@hpkfft
Copy link
Contributor Author

hpkfft commented Oct 17, 2025

If I have a project using nb::ndarray, what do I need to benefit from the new interfaces? Can I opt out? What are the implications on compatibility?

Nothing. No. Only goodness.

When nanobind imports a DLPack-capable object, it first tries to call the object's __dlpack__() method with keyword argument "max_version" set to (1, 1), indicating that nanobind can accept a versioned tensor. (The minor version is irrelevant.) The object can return either the old unversioned tensor or a versioned tensor--either way nanobind does the import. If the object cannot accept the kwarg at all (i.e., raises TypeError), nanobind calls __dlpack__() without any kwargs and imports the unversioned tensor. (In theory, it could be versioned (which would be a bug in their code), but in reality, if the object doesn't even know about "max_version", then it doesn't know about versioned tensors.)
If the object is not DLPack-capable, nanobind tries to import using the buffer protocol.
If that doesn't work, nanobind tries to call to_dlpack(obj) on the framework to get an unversioned capsule. [This is very obsolete, but the code was there, so might as well keep it.]

In the case of a versioned capsule, a flag bit can be set to indicate that the tensor is read-only. Nanobind honors this and creates a read-only nd-array.
In the case of an unversioned capsule, nanobind assumes it's writable. As before, it would be the user's responsibility to know if that's not the case and to refrain from actually writing to it.

On export, it depends on the framework.

no_framework is unchanged. It continues to return an unversioned capsule for backward compatibility.

Tensorflow is unchanged. An unversioned capsule is passed to tensorflow.experimental.dlpack.from_dlpack(). Their online docs show that's the thing to do.

arrayapi is new. It returns an object of type nanobind.nb_ndarray, which supports both the buffer protocol and the DLPack __dlpack__() and __dlpack_device__() methods. The __dlpack__() method accepts and honors the keyword argument "max_version" and returns a versioned tensor if and only if the value tuple[int, int] has first component (i.e., major version) of at least 1. (If the value is None, or the keyword argument is missing, that is equivalent to passing a maximum major version of 0.)

NumPy is unchanged. It first makes a new nanobind.nb_ndarray and then passes it to NumPy, which imports it using the buffer protocol. I did not see a performance improvement in changing to DLPack. Also, numpy.array() supports a "copy" keyword argument, so if a copy is needed, it's done in the same call without having to subsequently call a copy() or clone() function.

memview is unchanged. It uses the buffer protocol on a new nanobind.nb_ndarray object.

PyTorch, JAX, and CuPy: nanobind creates a new nanobind.nb_ndarray object and then passes that to the framework's from_dlpack() function. That's not different per se, but these frameworks can now call our __dlpack__() with a maximum major version of 1 (and any minor version) and get a versioned tensor in return. They can also pass a maximum major version of 0 and get an unversioned tensor, as before. Or, pass max_version=None, or omit the keyword argument, and get an unversioned tensor, as before.

@wjakob
Copy link
Owner

wjakob commented Oct 17, 2025

Beautiful, thank you for this clarification. I guess there could be a performance cost when we try to import a tensor from an older framework that doesn't support versioned capsules (due to calling dlpack multiple times), correct? But I suppose the impact of that should diminish over time.

@wjakob
Copy link
Owner

wjakob commented Oct 17, 2025

One more potential optimization opportunity. Do you think that it would be possible to use the static object table to reduce all of these costly API calls and string comparisons to a few pointer comparisons? (this is from the function that checks if an object is an ndarray).

    PyObject *name = nb_type_name((PyObject *) tp);
    check(name, "Could not obtain type name! (1)");

    const char *tp_name = PyUnicode_AsUTF8AndSize(name, nullptr);
    check(tp_name, "Could not obtain type name! (2)");

    bool result =
        // PyTorch
        strcmp(tp_name, "torch.Tensor") == 0 ||
        // XLA
        strcmp(tp_name, "jaxlib.xla_extension.ArrayImpl") == 0 ||
        // Tensorflow
        strcmp(tp_name, "tensorflow.python.framework.ops.EagerTensor") == 0 ||
        // Cupy
        strcmp(tp_name, "cupy.ndarray") == 0;

@hpkfft hpkfft marked this pull request as ready for review October 18, 2025 04:50
@hpkfft
Copy link
Contributor Author

hpkfft commented Oct 18, 2025

I guess there could be a performance cost when we try to import a tensor from an older framework that doesn't support versioned capsules (due to calling dlpack multiple times), correct?

Yes, if __dlpack__(max_version=(1, 1)) fails and then __dlpack__() succeeds, we spend some time on the first call, which is not currently the case in nanobind. But that's unavoidable.
The max_version kwarg was added in Python array API standard v2023.12.
Note that a framework could trivially add support for max_version by simply accepting it as a kwarg and then ignoring it.
It's not required to return a versioned tensor when the caller asks for one.
It's always OK to return an unversioned tensor.
It is prohibited to return a versioned tensor unless the max_version is (1, 0) or greater.

But I suppose the impact of that should diminish over time.

Yes.

Do you think that it would be possible to use the static object table to reduce all of these costly API calls and string comparisons to a few pointer comparisons [in ndarray_check]?

I don't think it would help.

The problem is that the pointer comparison name == something only succeeds if both name and something are the same object, which can be achieved if they have both been interned. We can intern something, but we can't intern name, which is whatever was set as the type name of the object. In other words, if the pointer comparison succeeds, then we know the strings are equal since they are the same object. But even if they are not the same object, they may still be the same UTF8 string.
In nb_ndarray_dlpack(), there is now the following code to check whether key is UTF8 string "max_version":

    if (key == static_pyobjects[pyobj_name::max_version_str] ||
        PyObject_RichCompareBool(key, static_pyobjects[pyobj_name::max_version_str], Py_EQ) == 1) {

This short-circuiting is good, since the pointer comparison is cheap and should be expected to succeed, because keyword argument names used across API boundaries ought to be interned by both sides (in order to support this optimization). [but see footnote 1]
If there are multiple kwnames, each key should be pointer compared to all supported names before doing any RichCompares. Hopefully, all keys pointer compare equal to some expected name and there's no need to do any RichCompares.

Now, consider ndarray_check.
If the result will be true, the common cases (PyObject either has attribute __dlpack__, or supports the buffer protocol, or is a PyCapsule) are all tested first, and the function returns before reaching the existing string comparisons.

If the result will be false, then the pointer compare will be false, and we'll have to do either strcmp or PyObject_RichCompareBool anyway to be sure the strings are not the same UTF8 strings (despite being different PyObjects). (And the former (as it is now) seems it would be faster than the latter.)

The frameworks should implement __dlpack__() from Python array API standard v2021.12.
Then the test we have now,

    if (PyObject_HasAttr(o, static_pyobjects[pyobj_name::dunder_dlpack_str]) ||
        PyObject_CheckBuffer(o))
        return true;

will be fast.

[footnote 1] The current (and past) release of NumPy does not intern "dl_device", "copy", or "max_version", so nanobind does the RichCompare, which succeeds. This is fixed in the development version by numpy/numpy#29875 So, nanobind will be a bit faster with the next release of NumPy.

@wjakob
Copy link
Owner

wjakob commented Oct 18, 2025

I don't think it would help.

My assumption was that the python type construction will intern type and module names so that pointer equality is legal.

@hpkfft
Copy link
Contributor Author

hpkfft commented Oct 18, 2025

That doesn't seem to be the case. Using Python 3.11.2 and adding the following to ndarray_check:

    PyObject* tpfoe = 
      PyUnicode_InternFromString("tensorflow.python.framework.ops.EagerTensor");
    printf("name %s tpfoe\n", (name == tpfoe) ? "is" : "is not");
    int sc = strcmp(tp_name, "tensorflow.python.framework.ops.EagerTensor");
    printf("name %s tpfoe\n", (sc == 0) ? "==" : "!=");

I get

name is not tpfoe
name == tpfoe

when running

python3 -m pytest --capture=tee-sys tests/test_tensorflow.py

This commit adds support for the struct ``DLManagedTensorVersioned``
as defined by DLPack version 1.  It also adds the ndarray framework
``nb::array_api``, which returns an object that provides the buffer
interface and provides the two DLPack methods ``__dlpack__()`` and
``__dlpack_device__()``.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants