Skip to content

Commit dc0f5a9

Browse files
iabhi4pitrou
andauthored
GH-46589: [C++] Fix utf8_is_digit to support full Unicode digit range (#46590)
### Rationale for this change `pyarrow.compute.utf8_is_digit` did not recognize valid Unicode digit characters (e.g., superscripts like `'³'`), diverging from the behavior of Python's built-in `str.isdigit()` This caused inconsistencies in downstream libraries like pandas when using PyArrow-backed StringDtype. ### What changes are included in this PR? Updated `IsDigitCharacterUnicode` implementation to cover a broader range of Unicode digits by replacing category check with one that aligns with Python’s `str.isdigit()` semantics. Added tests in `scalar_string_test.cc` to validate correct digit detection across diverse Unicode digit inputs. ### Are these changes tested? Yes. New unit tests were added and pass successfully, verifying behavior on various Unicode digit characters. ### Are there any user-facing changes? Yes, users relying on `pc.utf8_is_digit()` will now get correct results for a wider range of Unicode digit characters, improving correctness and parity with Python semantics * GitHub Issue: #46589 Lead-authored-by: iabhi4 <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
1 parent dfac0cc commit dc0f5a9

File tree

2 files changed

+15
-7
lines changed

2 files changed

+15
-7
lines changed

cpp/src/arrow/compute/kernels/scalar_string_test.cc

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1384,10 +1384,15 @@ TYPED_TEST(TestStringKernels, IsDecimalUnicode) {
13841384
}
13851385

13861386
TYPED_TEST(TestStringKernels, IsDigitUnicode) {
1387-
// These are digits according to Python, but we don't have the information in
1388-
// utf8proc for this
1389-
// this->CheckUnary("utf8_is_digit", "[\"²\", \"①\"]", boolean(), "[true,
1390-
// true]");
1387+
// Tests for digits across various Unicode scripts.
1388+
// ٤: Arabic 4, ³: Superscript 3, ५: Devanagari 5, Ⅷ: Roman 8 (not digit),
1389+
// 123: Fullwidth 123.
1390+
// '¾' (vulgar fraction) is treated as a digit by utf8proc
1391+
this->CheckUnary(
1392+
"utf8_is_digit",
1393+
R"(["0", "٤", "۵", "३", "१२३", "٣٣", "²", "123", "٣٢", "٩", "①", "Ⅷ", "abc" , "⻁", ""])",
1394+
boolean(),
1395+
R"([true, true, true, true, true, true, true, true, true, true, true, false, false, false, false])");
13911396
}
13921397

13931398
TYPED_TEST(TestStringKernels, IsNumericUnicode) {

cpp/src/arrow/compute/kernels/scalar_string_utf8.cc

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -138,9 +138,12 @@ static inline bool IsDecimalCharacterUnicode(uint32_t codepoint) {
138138
}
139139

140140
static inline bool IsDigitCharacterUnicode(uint32_t codepoint) {
141-
// Python defines this as Numeric_Type=Digit or Numeric_Type=Decimal.
142-
// utf8proc has no support for this, this is the best we can do:
143-
return HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_ND);
141+
// Approximates Python's str.isnumeric():
142+
// returns true for Nd and No (e.g., '٣', '³'), but excludes Nl like Roman numerals
143+
// ('Ⅷ') due to utf8proc limits.
144+
// '¾' (vulgar fraction) is treated as a digit by utf8proc 'No'
145+
return HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_ND,
146+
UTF8PROC_CATEGORY_NO);
144147
}
145148

146149
static inline bool IsNumericCharacterUnicode(uint32_t codepoint) {

0 commit comments

Comments
 (0)