GH-46589: [C++] Fix utf8_is_digit to support full Unicode digit range (#46590)

iabhi4 · pitrou · web-flow · commit dc0f5a9415aa · 2025-06-02T14:33:33.000+02:00
### Rationale for this change `pyarrow.compute.utf8_is_digit` did not recognize valid Unicode digit characters (e.g., superscripts like `'³'`), diverging from the behavior of Python's built-in `str.isdigit()` This caused inconsistencies in downstream libraries like pandas when using PyArrow-backed StringDtype. ### What changes are included in this PR? Updated `IsDigitCharacterUnicode` implementation to cover a broader range of Unicode digits by replacing category check with one that aligns with Python’s `str.isdigit()` semantics. Added tests in `scalar_string_test.cc` to validate correct digit detection across diverse Unicode digit inputs. ### Are these changes tested? Yes. New unit tests were added and pass successfully, verifying behavior on various Unicode digit characters. ### Are there any user-facing changes? Yes, users relying on `pc.utf8_is_digit()` will now get correct results for a wider range of Unicode digit characters, improving correctness and parity with Python semantics * GitHub Issue: #46589 Lead-authored-by: iabhi4 <iamonecool@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
diff --git a/cpp/src/arrow/compute/kernels/scalar_string_test.cc b/cpp/src/arrow/compute/kernels/scalar_string_test.cc
@@ -1384,10 +1384,15 @@ TYPED_TEST(TestStringKernels, IsDecimalUnicode) {
 }
 
 TYPED_TEST(TestStringKernels, IsDigitUnicode) {
-  // These are digits according to Python, but we don't have the information in
-  // utf8proc for this
-  // this->CheckUnary("utf8_is_digit", "[\"²\", \"①\"]", boolean(), "[true,
-  // true]");
+  // Tests for digits across various Unicode scripts.
+  // ٤: Arabic 4, ³: Superscript 3, ५: Devanagari 5, Ⅷ: Roman 8 (not digit),
+  // １２３: Fullwidth 123.
+  // '¾' (vulgar fraction) is treated as a digit by utf8proc
+  this->CheckUnary(
+      "utf8_is_digit",
+      R"(["0", "٤", "۵", "३", "१२३", "٣٣", "²", "１２３", "٣٢", "٩", "①", "Ⅷ", "abc" , "⻁", ""])",
+      boolean(),
+      R"([true, true, true, true, true, true, true, true, true, true, true, false, false, false, false])");
 }
 
 TYPED_TEST(TestStringKernels, IsNumericUnicode) {
diff --git a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc b/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc
@@ -138,9 +138,12 @@ static inline bool IsDecimalCharacterUnicode(uint32_t codepoint) {
 }
 
 static inline bool IsDigitCharacterUnicode(uint32_t codepoint) {
-  // Python defines this as Numeric_Type=Digit or Numeric_Type=Decimal.
-  // utf8proc has no support for this, this is the best we can do:
-  return HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_ND);
+  // Approximates Python's str.isnumeric():
+  // returns true for Nd and No (e.g., '٣', '³'), but excludes Nl like Roman numerals
+  // ('Ⅷ') due to utf8proc limits.
+  // '¾' (vulgar fraction) is treated as a digit by utf8proc 'No'
+  return HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_ND,
+                                      UTF8PROC_CATEGORY_NO);
 }
 
 static inline bool IsNumericCharacterUnicode(uint32_t codepoint) {

Original file line number	Diff line number	Diff line change
`@@ -138,9 +138,12 @@ static inline bool IsDecimalCharacterUnicode(uint32_t codepoint) {`
`138`	`138`	`}`
`139`	`139`
`140`	`140`	`static inline bool IsDigitCharacterUnicode(uint32_t codepoint) {`
`141`		`- // Python defines this as Numeric_Type=Digit or Numeric_Type=Decimal.`
`142`		`- // utf8proc has no support for this, this is the best we can do:`
`143`		`- return HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_ND);`
	`141`	`+ // Approximates Python's str.isnumeric():`
	`142`	`+ // returns true for Nd and No (e.g., '٣', '³'), but excludes Nl like Roman numerals`
	`143`	`+ // ('Ⅷ') due to utf8proc limits.`
	`144`	`+ // '¾' (vulgar fraction) is treated as a digit by utf8proc 'No'`
	`145`	`+ return HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_ND,`
	`146`	`+ UTF8PROC_CATEGORY_NO);`
`144`	`147`	`}`
`145`	`148`
`146`	`149`	`static inline bool IsNumericCharacterUnicode(uint32_t codepoint) {`