BUG: fix bug in str.fullmatch for Arrow backend with optional groups #61073

ptth222 · 2025-03-07T00:25:43Z

Trying to address #61072 since I created the issue. I've never contributed to pandas, but I have tried to follow what is in the guide.

Closes #61072

Address pandas-dev#61072.

…nation in PyArrow strings Fixes an issue where regex patterns with alternation (|) produce different results between str dtype and string[pyarrow] dtype. When using patterns like "(as)|(as)", PyArrow implementation would incorrectly match "asdf" while Python's implementation correctly rejects it. The fix adds special handling to ensure alternation patterns are properly parenthesized when using PyArrow-backed strings.

mroeschke · 2025-03-24T16:44:53Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

jorisvandenbossche · 2025-09-12T16:12:34Z

@ptth222 sorry you didn't get any feedback on the PR initially, but thanks for opening the issue and this PR! It looks like a good start to me, but will need to address some failing tests (and also add a new test case for what it is fixing)

pandas/core/arrays/_arrow_string_mixins.py

@jorisvandenbossche

Made the changes suggested by @jorisvandenbossche.

Added a test to confirm that the arrow implementation gives the same result as the python one. Also changed test_str_fullmatch to use the fullmatch method instead of the match method.

There were errors, so I changed the tests to try and make them pass.

Had to change expected result again because I missed one previously. Also had to change test structure to reuse parametrize.

There were arrows about the series types not being the same, so I tried to address that.

Trying again to get the result types correct so that the equal assertion works.

ptth222 · 2025-09-17T03:18:03Z

I added a test that tests the arrow fullmatch against the python fullmatch. I also might have found a mistake in the fullmatch test. "test_str_fullmatch" in tests/extension/test_arrow.py was calling the match method and not the fullmatch method, so I changed it to use fullmatch. I also had to change some of the expected values to match what should be expected from fullmatch.

I can't for the life of me figure out what the problem is with the new test I added, test_str_fullmatch_against_python_fullmatch. It is producing AssertionError: Series NA mask are different, but both series should be the exact same values and types. I have tried using both ArrowDtype(pa.string()) and result.dtype in the astype method, but still get this same error. I can't reproduce this on my own machine. If someone has any insights into what the issue could be, I would love to hear it.

I didn't add any more tests except for the one for fullmatch, but it might be a good idea to add the same kind of test for match, search, etc. Assuming we can get the issues straightened out.

jorisvandenbossche · 2025-09-21T12:42:11Z

I also might have found a mistake in the fullmatch test. "test_str_fullmatch" in tests/extension/test_arrow.py was calling the match method and not the fullmatch method, so I changed it to use fullmatch. I also had to change some of the expected values to match what should be expected from fullmatch.

Good catch! Thanks for updating that

I can't for the life of me figure out what the problem is with the new test I added, test_str_fullmatch_against_python_fullmatch. It is producing AssertionError: Series NA mask are different, but both series should be the exact same values and types.

The reason for this is because the ArrowDtype handles NA in the input Series differently, i.e. it propagates as missing to the result while the str dtype will propagate it as False. Initially I pushed a fix for that by removing the missing value from the test case. But eventually I moved the entire test to test_find_replace.py: there, it will automatically be run for the various string dtypes, including the ones backed by Python strings, and thus the expected result gets verified against Python that way.

jorisvandenbossche · 2025-09-21T12:46:22Z

I didn't add any more tests except for the one for fullmatch

I think we still need to add a test for the actual issue you reported. Because right now, if I undo the actual code fix in _str_fullmatch, then all updated tests still pass, indicating the current tests don't cover the bug you are trying to fix.

jorisvandenbossche · 2025-09-21T13:12:45Z

@ptth222 I pushed a small update to the parametrized test to include one case with the optional groups (or I don't know what the exact regex terminology is for this) from the issue, and add a whatsnew note.
Apologies for quickly pushing myself, but I want to include this fix in the upcoming 2.3.3. release that I am preparing right now.

ptth222 · 2025-09-21T13:58:18Z

Sounds good to me. Thank you for the assistance.

… backend with optional groups

jorisvandenbossche · 2025-09-21T14:02:35Z

Thanks @ptth222!

… Arrow backend with optional groups) (#62401) Co-authored-by: ptth222 <[email protected]>

…andas-dev#61073) Co-authored-by: Joris Van den Bossche <[email protected]>

Update _arrow_string_mixins.py

519ea79

Address pandas-dev#61072.

ptth222 mentioned this pull request Mar 7, 2025

BUG: str.fullmatch behavior is not the same for object dtype and string[pyarrow] dtype #61072

Closed

3 tasks

mroeschke closed this Mar 24, 2025

jorisvandenbossche reopened this Sep 12, 2025

jorisvandenbossche added Strings String extension data type and string data Arrow pyarrow functionality labels Sep 12, 2025

jorisvandenbossche added this to the 2.3.3 milestone Sep 12, 2025

jorisvandenbossche changed the title ~~BUG: Addressing #61072~~ BUG: fix bug in str.fullmatch for Arrow backend with optional groups Sep 12, 2025

jorisvandenbossche reviewed Sep 12, 2025

View reviewed changes

pandas/core/arrays/_arrow_string_mixins.py Outdated Show resolved Hide resolved

ptth222 added 6 commits September 15, 2025 13:35

Updated _arrow_string_mixins.py

93ad579

Made the changes suggested by @jorisvandenbossche.

Update test_arrow.py

f99bcd7

Added a test to confirm that the arrow implementation gives the same result as the python one. Also changed test_str_fullmatch to use the fullmatch method instead of the match method.

Update test_arrow.py

46cb440

There were errors, so I changed the tests to try and make them pass.

Update test_arrow.py

cd1ebce

Had to change expected result again because I missed one previously. Also had to change test structure to reuse parametrize.

Update test_arrow.py

c8d6048

There were arrows about the series types not being the same, so I tried to address that.

Update test_arrow.py

0796834

Trying again to get the result types correct so that the equal assertion works.

jorisvandenbossche added 3 commits September 21, 2025 13:52

Merge remote-tracking branch 'upstream/main' into patch-1

26f64dd

remove missing value in extra test

70e82b1

move test comparing with python to test_find_replace.py

c424093

jorisvandenbossche added 2 commits September 21, 2025 15:08

add test case with optional groups

cb6a9d2

add whatsnew

6fd088d

jorisvandenbossche approved these changes Sep 21, 2025

View reviewed changes

jorisvandenbossche merged commit 08d21d7 into pandas-dev:main Sep 21, 2025
42 checks passed

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Sep 21, 2025

Backport PR pandas-dev#61073: BUG: fix bug in str.fullmatch for Arrow…

b93c451

… backend with optional groups

meeseeksmachine mentioned this pull request Sep 21, 2025

Backport PR #61073 on branch 2.3.x (BUG: fix bug in str.fullmatch for Arrow backend with optional groups) #62401

Merged

jorisvandenbossche pushed a commit that referenced this pull request Sep 21, 2025

Backport PR #61073 on branch 2.3.x (BUG: fix bug in str.fullmatch for…

fd40f9a

… Arrow backend with optional groups) (#62401) Co-authored-by: ptth222 <[email protected]>

jorisvandenbossche mentioned this pull request Sep 22, 2025

BUG: fix bug in str.match for Arrow backend with optional groups #62410

Merged

4 tasks

jzwick pushed a commit to jzwick/pandas that referenced this pull request Oct 1, 2025

BUG: fix bug in str.fullmatch for Arrow backend with optional groups (p…

e02a24d

…andas-dev#61073) Co-authored-by: Joris Van den Bossche <[email protected]>

eicchen pushed a commit to eicchen/pandas that referenced this pull request Oct 18, 2025

BUG: fix bug in str.fullmatch for Arrow backend with optional groups (p…

cafc03b

…andas-dev#61073) Co-authored-by: Joris Van den Bossche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: fix bug in str.fullmatch for Arrow backend with optional groups #61073

BUG: fix bug in str.fullmatch for Arrow backend with optional groups #61073

Uh oh!

ptth222 commented Mar 7, 2025 •

edited by jorisvandenbossche

Loading

Uh oh!

mroeschke commented Mar 24, 2025

Uh oh!

jorisvandenbossche commented Sep 12, 2025

Uh oh!

Uh oh!

ptth222 commented Sep 17, 2025

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

ptth222 commented Sep 21, 2025

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

BUG: fix bug in str.fullmatch for Arrow backend with optional groups #61073

BUG: fix bug in str.fullmatch for Arrow backend with optional groups #61073

Uh oh!

Conversation

ptth222 commented Mar 7, 2025 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke commented Mar 24, 2025

Uh oh!

jorisvandenbossche commented Sep 12, 2025

Uh oh!

Uh oh!

ptth222 commented Sep 17, 2025

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

ptth222 commented Sep 21, 2025

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ptth222 commented Mar 7, 2025 •

edited by jorisvandenbossche

Loading