Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions python/pyarrow/src/arrow/python/numpy_convert.cc
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@
namespace arrow {
namespace py {

#ifndef NPY_VSTRING
# define NPY_VSTRING 2056
#endif
Comment on lines +40 to +42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this? Can IsStringDType just return false if this constant is not defined (this is presumably when compiling with NumPy < 2)?


NumPyBuffer::NumPyBuffer(PyObject* ao) : Buffer(nullptr, 0) {
PyAcquireGIL lock;
arr_ = ao;
Expand Down Expand Up @@ -122,6 +126,10 @@ Result<std::shared_ptr<DataType>> NumPyScalarToArrowDataType(PyObject* scalar) {
return NumPyDtypeToArrow(descr);
}

bool IsStringDType(PyArray_Descr* descr) {
return descr != nullptr && descr->type_num == NPY_VSTRING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nullptr check seems superfluous, why would one call this function with a null pointer?

}

Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyObject* dtype) {
if (!PyObject_TypeCheck(dtype, &PyArrayDescr_Type)) {
return Status::TypeError("Did not pass numpy.dtype object");
Expand All @@ -133,6 +141,10 @@ Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyObject* dtype) {
Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyArray_Descr* descr) {
int type_num = fix_numpy_type_num(descr->type_num);

if (IsStringDType(descr)) {
return utf8();
}

switch (type_num) {
TO_ARROW_TYPE_CASE(BOOL, boolean);
TO_ARROW_TYPE_CASE(INT8, int8);
Expand Down
2 changes: 2 additions & 0 deletions python/pyarrow/src/arrow/python/numpy_convert.h
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyArray_Descr* descr);
ARROW_PYTHON_EXPORT
Result<std::shared_ptr<DataType>> NumPyScalarToArrowDataType(PyObject* scalar);

ARROW_PYTHON_EXPORT bool IsStringDType(PyArray_Descr* descr);

ARROW_PYTHON_EXPORT Status NdarrayToTensor(MemoryPool* pool, PyObject* ao,
const std::vector<std::string>& dim_names,
std::shared_ptr<Tensor>* out);
Expand Down
145 changes: 145 additions & 0 deletions python/pyarrow/src/arrow/python/numpy_to_arrow.cc
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#include <limits>
#include <memory>
#include <string>
#include <string_view>
#include <utility>
#include <vector>

Expand Down Expand Up @@ -59,6 +60,8 @@
#include "arrow/python/type_traits.h"
#include "arrow/python/vendored/pythoncapi_compat.h"

#include <numpy/arrayobject.h>

namespace arrow {

using internal::checked_cast;
Expand All @@ -74,6 +77,27 @@ using internal::NumPyTypeSize;

namespace {

#if NPY_ABI_VERSION >= 0x02000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we're guarding with NPY_ABI_VERSION rather than either NPY_VERSION or NPY_FEATURE_VERSION. Can a NumPy maintainer explain how these 3 macros differ? @ngoldbaum @seberg

(also, would be nice if the NumPy docs were a bit more talkative about this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABI version is the compile time header version of NumPy. NPY_FEATURE_VERSION is the runtime one you are compiling for.
I.e. by default API that is only available on newer versions is disabled so you can't compile to run with 1.x support but use newer API.

In this particular case (effectively using future API) #if NPY_ABI_VERSION >= 0x02000000 may tell you that a definition is already included in the header, in this case I guess PyArray_StringDTypeObject.
There is no reason to hide such a definition, so we don't (the thing to hide is mostly API table entries).

Maybe there should be a section on "how to use future API depending on the NumPy runtime version" (although for things we really expect it, we may want to add it to the npy2_compat.h header ourselves instead).

FWIW, I still think it makes most sense to wholesale copy-paste the NumPy header definitions. Then add some form of guard (and be it #if NPY_FEATURE_VERSION < NUMPY_2_0_VERSION or what it was), so that when it is exposed by the NumPy headers you stop using it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to define NPY_FEATURE_VERSION anywhere, so that does mean the 2.0 macros are not available in the NumPy headers? If we define NPY_FEATURE_VERSION to 2.0, does that mean that the produced extension with not be compiled with NumPy 1.x?

It would be nice if these interactions were made clearer somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define NPY_FEATURE_VERSION to 2.0, does that mean that the produced extension with not be compiled with NumPy 1.x?

Yeah, except what you define would be NPY_TARGET_VERSION. Don't ask me why I used a different name, it was probably silly, but it is what we have now.
I think NPY_TARGET_VERSION is described. But this way of using future API conditionally is not (the only place is that npy_2_compat.h header in NumPy itself uses similar conditionals).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, can we assume that NPY_FEATURE_VERSION is always <= NPY_ABI_VERSION? and what about NPY_VERSION, is it something different?)

inline npy_string_allocator* ArrowNpyString_acquire_allocator(
const PyArray_StringDTypeObject* descr) {
using Func = npy_string_allocator* (*)(const PyArray_StringDTypeObject*);
return reinterpret_cast<Func>(PyArray_API[316])(descr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call NpyString_acquire_allocator directly? AFAICT it is a macro that does roughly what your code is doing:

#if NPY_FEATURE_VERSION >= NPY_2_0_API_VERSION
#define NpyString_acquire_allocator \
        (*(npy_string_allocator * (*)(const PyArray_StringDTypeObject *)) \
    PyArray_API[316])
#endif

}

inline void ArrowNpyString_release_allocator(npy_string_allocator* allocator) {
using Func = void (*)(npy_string_allocator*);
reinterpret_cast<Func>(PyArray_API[318])(allocator);
}

inline int ArrowNpyString_load(npy_string_allocator* allocator,
const npy_packed_static_string* packed,
npy_static_string* out) {
using Func =
int (*)(npy_string_allocator*, const npy_packed_static_string*, npy_static_string*);
return reinterpret_cast<Func>(PyArray_API[313])(allocator, packed, out);
}
#endif

Status AllocateNullBitmap(MemoryPool* pool, int64_t length,
std::shared_ptr<ResizableBuffer>* out) {
int64_t null_bytes = bit_util::BytesForBits(length);
Expand Down Expand Up @@ -233,6 +257,13 @@ class NumPyConverter {
Status Visit(const LargeStringType& type);
Status Visit(const StringViewType& type);

#if NPY_ABI_VERSION >= 0x02000000
template <typename Builder>
Status AppendStringDTypeValues(Builder* builder);

Status ConvertStringDType();
#endif

Status Visit(const StructType& type);

Status Visit(const FixedSizeBinaryType& type);
Expand Down Expand Up @@ -338,6 +369,16 @@ Status NumPyConverter::Convert() {
return Status::OK();
}

if (IsStringDType(dtype_)) {
#if NPY_ABI_VERSION >= 0x02000000
RETURN_NOT_OK(ConvertStringDType());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird and confusing to do this outside of the visitor machinery. Is it possible to straighten that out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also NumPyConverter::VisitString has a special case for all-NaN Pandas arrays, I'm not sure that can apply here @jorisvandenbossche )

return Status::OK();
#else
return Status::NotImplemented(
"NumPy StringDType requires building PyArrow with NumPy >= 2.0");
Comment on lines +377 to +378
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this at all possible to happen? I got the impression that one cannot use NumPy 2 if PyArrow was compiled for NumPy < 2, am I mistaken @ngoldbaum @seberg ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, the NumPy C-API import will just barf at you and refuse to run.

#endif
}

if (type_ == nullptr) {
return Status::Invalid("Must pass data type for non-object arrays");
}
Expand Down Expand Up @@ -815,6 +856,110 @@ Status NumPyConverter::Visit(const StringViewType& type) {
return Status::OK();
}

#if NPY_ABI_VERSION >= 0x02000000
template <typename Builder>
Status NumPyConverter::AppendStringDTypeValues(Builder* builder) {
auto* descr = reinterpret_cast<PyArray_StringDTypeObject*>(dtype_);

npy_string_allocator* allocator = ArrowNpyString_acquire_allocator(descr);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI for other reviewers: this locks a mutex internally in NumPy.

if (allocator == nullptr) {
return Status::Invalid("Failed to acquire NumPy StringDType allocator");
}

struct AllocatorGuard {
npy_string_allocator* ptr;
explicit AllocatorGuard(npy_string_allocator* p) : ptr(p) {}
~AllocatorGuard() {
if (ptr != nullptr) {
ArrowNpyString_release_allocator(ptr);
}
}
} guard(allocator);
Comment on lines +869 to +877
Copy link
Member

@pitrou pitrou Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be written more concisely using std::unique_ptr:

Suggested change
struct AllocatorGuard {
npy_string_allocator* ptr;
explicit AllocatorGuard(npy_string_allocator* p) : ptr(p) {}
~AllocatorGuard() {
if (ptr != nullptr) {
ArrowNpyString_release_allocator(ptr);
}
}
} guard(allocator);
std::unique_ptr<npy_string_allocator, void(*)(npy_string_allocator*)>
allocator_guard(allocator, &ArrowNpyString_release_allocator);


npy_static_string value = {0, nullptr};
char* data = PyArray_BYTES(arr_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this const char*?


if (mask_ != nullptr) {
Ndarray1DIndexer<uint8_t> mask_values(mask_);
for (int64_t i = 0; i < length_; ++i) {
if (mask_values[i]) {
RETURN_NOT_OK(builder->AppendNull());
continue;
}

const auto* packed =
reinterpret_cast<const npy_packed_static_string*>(data + i * stride_);
Comment on lines +890 to +891
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious, is the StringDType layout documented somewhere? I couldn't find any reference easily in the NumPy docs. @ngoldbaum @seberg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is described in the NEP, but npy_packed_static_string is intentionally opaque.

const int is_null = ArrowNpyString_load(allocator, packed, &value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why does this NumPy API need an allocator is all it does is return a view into the array contents? Especially as no deallocation seems involved afterwards... @ngoldbaum @seberg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://numpy.org/neps/nep-0055-string_dtype.html#memory-layout-and-managing-heap-allocations is the NEP about it.

But basically, the string can't be stored in the array data itself, because NumPy requires a fixed size per element.
So, instead the string (if not very small) is stored in a separately allocated buffer (much like in arrow except it's more annoying as we are N-D). And that buffer is owned by the dtype, which is the reason you fetch the allocator from the dtype/descr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but this is just loading an existing string. Why is an allocator needed for that? There is no corresponding deallocation call, it seems...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "allocator" knows where the second buffer is. The other reason is that it adds locking so that we don't point to corrupted data for StringDType.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "allocator" is the wrong name if it's confusing you. The allocator holds the reference to the per-descriptor memory pool as well as the per-descriptor lock.

if (is_null == -1) {
RETURN_IF_PYERROR();
return Status::Invalid("Failed to unpack NumPy StringDType value");
}
if (is_null) {
RETURN_NOT_OK(builder->AppendNull());
} else {
RETURN_NOT_OK(builder->Append(std::string_view{value.buf, value.size}));
}
}
return Status::OK();
}

for (int64_t i = 0; i < length_; ++i) {
const auto* packed = reinterpret_cast<const npy_packed_static_string*>(data);
const int is_null = ArrowNpyString_load(allocator, packed, &value);
if (is_null == -1) {
RETURN_IF_PYERROR();
return Status::Invalid("Failed to unpack NumPy StringDType value");
}
if (is_null) {
RETURN_NOT_OK(builder->AppendNull());
} else {
RETURN_NOT_OK(builder->Append(std::string_view{value.buf, value.size}));
}
Comment on lines +907 to +917
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could easily have this snippet factored out in a lambda to avoid code repetition with the loop above.

data += stride_;
}

return Status::OK();
}

Status NumPyConverter::ConvertStringDType() {
util::InitializeUTF8();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is only useful if we validate UTF8 values, which this doesn't seem to be doing? (presumably because NumPy already advertises the data as valid UTF8?)


if (type_ == nullptr) {
type_ = utf8();
}

switch (type_->id()) {
case Type::STRING: {
arrow::internal::ChunkedStringBuilder builder(kBinaryChunksize, pool_);
RETURN_NOT_OK(builder.Reserve(length_));
RETURN_NOT_OK(AppendStringDTypeValues(&builder));

ArrayVector chunks;
RETURN_NOT_OK(builder.Finish(&chunks));
for (const auto& chunk : chunks) {
RETURN_NOT_OK(PushArray(chunk->data()));
}
return Status::OK();
}
case Type::LARGE_STRING: {
LargeStringBuilder builder(pool_);
RETURN_NOT_OK(builder.Reserve(length_));
RETURN_NOT_OK(AppendStringDTypeValues(&builder));
return PushBuilderResult(&builder);
}
case Type::STRING_VIEW: {
StringViewBuilder builder(pool_);
RETURN_NOT_OK(builder.Reserve(length_));
RETURN_NOT_OK(AppendStringDTypeValues(&builder));
return PushBuilderResult(&builder);
}
default:
return Status::TypeError(
"NumPy StringDType can only be converted to Arrow string types");
}
}
#endif

Status NumPyConverter::Visit(const StructType& type) {
std::vector<NumPyConverter> sub_converters;
std::vector<OwnedRefNoGIL> sub_arrays;
Expand Down
113 changes: 113 additions & 0 deletions python/pyarrow/tests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -2758,6 +2758,119 @@ def test_array_from_numpy_unicode(string_type):
assert arrow_arr.equals(expected)


@pytest.mark.numpy
def test_array_from_numpy_string_dtype():
dtypes_mod = getattr(np, "dtypes", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use something like dtypes = pytest.importorskip("numpy.dtypes")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, even, better, you can use a pytest fixture:

@pytest.fixture
def string_dtype():
    dtypes = pytest.importorskip("numpy.dtypes")
    dtype_class = getattr(dtypes, "StringDType", None)
    if dtype_class is None:
        pytest.skip("NumPy StringDType not available (NumPy > 2 needed)")
    return dtype_class

and then simply:

@pytest.mark.numpy
def test_array_from_numpy_string_dtype(string_dtype):
    arr = np.array(["some", "strings"], dtype=string_dtype())
    # etc.

if dtypes_mod is None:
pytest.skip("NumPy dtypes module not available")

StringDType = getattr(dtypes_mod, "StringDType", None)
if StringDType is None:
pytest.skip("NumPy StringDType not available")

dtype = StringDType()

arr = np.array(["some", "strings"], dtype=dtype)

arrow_arr = pa.array(arr)

assert arrow_arr.type == pa.utf8()
assert arrow_arr.to_pylist() == ["some", "strings"]

arrow_arr = pa.array(arr, type=pa.string())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that pa.string() and pa.utf8() are the same thing, it's a bit confusing to use both spellings here.

assert arrow_arr.type == pa.string()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also call arrow_arr.validate(full=True)? (also in other places)

assert arrow_arr.to_pylist() == ["some", "strings"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also be written assert arrow_arr.to_pylist() == arr.tolist()


arrow_arr = pa.array(arr, type=pa.large_string())
assert arrow_arr.type == pa.large_string()
assert arrow_arr.to_pylist() == ["some", "strings"]

arrow_arr = pa.array(arr, type=pa.string_view())
assert arrow_arr.type == pa.string_view()
assert arrow_arr.to_pylist() == ["some", "strings"]

arr_full = np.array(["a", "b", "c", "d", "e"], dtype=dtype)
arr = arr_full[::2]
arrow_arr = pa.array(arr)
assert arrow_arr.type == pa.utf8()
assert arrow_arr.to_pylist() == ["a", "c", "e"]


@pytest.mark.numpy
def test_numpy_stringdtype_thresholds_and_unicode():
dtypes_mod = getattr(np, "dtypes", None)
if dtypes_mod is None:
pytest.skip("NumPy dtypes module not available")

StringDType = getattr(dtypes_mod, "StringDType", None)
if StringDType is None:
pytest.skip("NumPy StringDType not available")

dtype = StringDType()

short = "hello"
medium = "a" * 100
long_ = "b" * 300
unicode_ = "árvíztűrő tükörfúrógép 🥐 你好"
long_unicode = "🥐" * 200

arr = np.array([short, medium, long_, unicode_, long_unicode], dtype=dtype)
assert pa.array(arr).to_pylist() == [short, medium, long_, unicode_, long_unicode]


@pytest.mark.numpy
def test_array_from_numpy_string_dtype_nulls_and_mask():
dtypes_mod = getattr(np, "dtypes", None)
if dtypes_mod is None:
pytest.skip("NumPy dtypes module not available")

StringDType = getattr(dtypes_mod, "StringDType", None)
if StringDType is None:
pytest.skip("NumPy StringDType not available")

# Real StringDType, use its NA sentinel
dtype = StringDType(na_object=None)
arr = np.array(["this array has", None, "as an entry"], dtype=dtype)

arrow_arr = pa.array(arr)
assert arrow_arr.type == pa.utf8()
assert arrow_arr.to_pylist() == ["this array has", None, "as an entry"]

# Test interplay of NA sentinel and an explicit mask:
# - index 1 is null because of na_object / Python None
# - index 2 is forced null by the mask
mask = np.array([False, False, True], dtype=bool)
arrow_arr = pa.array(arr, mask=mask)
assert arrow_arr.type == pa.utf8()
assert arrow_arr.null_count == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just validate arrow_arr and it will check null_count consistency.

assert arrow_arr.to_pylist() == ["this array has", None, None]

mask = np.array([True, False, True], dtype=bool)
assert pa.array(arr, mask=mask).to_pylist() == [None, None, None]


@pytest.mark.numpy
def test_array_from_numpy_string_dtype_string_sentinel_and_mask():
dtypes_mod = getattr(np, "dtypes", None)
if dtypes_mod is None:
pytest.skip("NumPy dtypes module not available")

StringDType = getattr(dtypes_mod, "StringDType", None)
if StringDType is None:
pytest.skip("NumPy StringDType not available")

sentinel = "__placeholder__"
dtype = StringDType(na_object=sentinel)
arr = np.array(["this array has", sentinel, "as an entry"], dtype=dtype)

arrow_arr = pa.array(arr)
assert arrow_arr.type == pa.utf8()
assert arrow_arr.to_pylist() == ["this array has", None, "as an entry"]

mask = np.array([False, False, True], dtype=bool)
assert pa.array(arr, mask=mask).to_pylist() == ["this array has", None, None]


@pytest.mark.numpy
def test_array_string_from_non_string():
# ARROW-5682 - when converting to string raise on non string-like dtype
Expand Down
Loading