Commit dc0f5a9
### Rationale for this change
`pyarrow.compute.utf8_is_digit` did not recognize valid Unicode digit characters (e.g., superscripts like `'³'`), diverging from the behavior of Python's built-in `str.isdigit()`
This caused inconsistencies in downstream libraries like pandas when using PyArrow-backed StringDtype.
### What changes are included in this PR?
Updated `IsDigitCharacterUnicode` implementation to cover a broader range of Unicode digits by replacing category check with one that aligns with Python’s `str.isdigit()` semantics.
Added tests in `scalar_string_test.cc` to validate correct digit detection across diverse Unicode digit inputs.
### Are these changes tested?
Yes. New unit tests were added and pass successfully, verifying behavior on various Unicode digit characters.
### Are there any user-facing changes?
Yes, users relying on `pc.utf8_is_digit()` will now get correct results for a wider range of Unicode digit characters, improving correctness and parity with Python semantics
* GitHub Issue: #46589
Lead-authored-by: iabhi4 <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
1 parent dfac0cc commit dc0f5a9
File tree
2 files changed
+15
-7
lines changed- cpp/src/arrow/compute/kernels
2 files changed
+15
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1384 | 1384 | | |
1385 | 1385 | | |
1386 | 1386 | | |
1387 | | - | |
1388 | | - | |
1389 | | - | |
1390 | | - | |
| 1387 | + | |
| 1388 | + | |
| 1389 | + | |
| 1390 | + | |
| 1391 | + | |
| 1392 | + | |
| 1393 | + | |
| 1394 | + | |
| 1395 | + | |
1391 | 1396 | | |
1392 | 1397 | | |
1393 | 1398 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
142 | | - | |
143 | | - | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
144 | 147 | | |
145 | 148 | | |
146 | 149 | | |
| |||
0 commit comments