Skip to content

Conversation

ptth222
Copy link
Contributor

@ptth222 ptth222 commented Mar 7, 2025

Trying to address #61072 since I created the issue. I've never contributed to pandas, but I have tried to follow what is in the guide.

Closes #61072

Pedro-Santos04 added a commit to Pedro-Santos04/pandas that referenced this pull request Mar 23, 2025
…nation in PyArrow strings

Fixes an issue where regex patterns with alternation (|) produce different results between str dtype and string[pyarrow] dtype. When using patterns like "(as)|(as)", PyArrow implementation would incorrectly match "asdf" while Python's implementation correctly rejects it. The fix adds special handling to ensure alternation patterns are properly parenthesized when using PyArrow-backed strings.
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@jorisvandenbossche
Copy link
Member

@ptth222 sorry you didn't get any feedback on the PR initially, but thanks for opening the issue and this PR! It looks like a good start to me, but will need to address some failing tests (and also add a new test case for what it is fixing)

@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data Arrow pyarrow functionality labels Sep 12, 2025
@jorisvandenbossche jorisvandenbossche added this to the 2.3.3 milestone Sep 12, 2025
@jorisvandenbossche jorisvandenbossche changed the title BUG: Addressing #61072 BUG: fix bug in str.fullmatch for Arrow backend with optional groups Sep 12, 2025
Added a test to confirm that the arrow implementation gives the same result as the python one. Also changed test_str_fullmatch to use the fullmatch method instead of the match method.
There were errors, so I changed the tests to try and make them pass.
Had to change expected result again because I missed one previously. Also had to change test structure to reuse parametrize.
There were arrows about the series types not being the same, so I tried to address that.
Trying again to get the result types correct so that the equal assertion works.
@ptth222
Copy link
Contributor Author

ptth222 commented Sep 17, 2025

I added a test that tests the arrow fullmatch against the python fullmatch. I also might have found a mistake in the fullmatch test. "test_str_fullmatch" in tests/extension/test_arrow.py was calling the match method and not the fullmatch method, so I changed it to use fullmatch. I also had to change some of the expected values to match what should be expected from fullmatch.

I can't for the life of me figure out what the problem is with the new test I added, test_str_fullmatch_against_python_fullmatch. It is producing AssertionError: Series NA mask are different, but both series should be the exact same values and types. I have tried using both ArrowDtype(pa.string()) and result.dtype in the astype method, but still get this same error. I can't reproduce this on my own machine. If someone has any insights into what the issue could be, I would love to hear it.

I didn't add any more tests except for the one for fullmatch, but it might be a good idea to add the same kind of test for match, search, etc. Assuming we can get the issues straightened out.

@jorisvandenbossche
Copy link
Member

I also might have found a mistake in the fullmatch test. "test_str_fullmatch" in tests/extension/test_arrow.py was calling the match method and not the fullmatch method, so I changed it to use fullmatch. I also had to change some of the expected values to match what should be expected from fullmatch.

Good catch! Thanks for updating that

I can't for the life of me figure out what the problem is with the new test I added, test_str_fullmatch_against_python_fullmatch. It is producing AssertionError: Series NA mask are different, but both series should be the exact same values and types.

The reason for this is because the ArrowDtype handles NA in the input Series differently, i.e. it propagates as missing to the result while the str dtype will propagate it as False. Initially I pushed a fix for that by removing the missing value from the test case. But eventually I moved the entire test to test_find_replace.py: there, it will automatically be run for the various string dtypes, including the ones backed by Python strings, and thus the expected result gets verified against Python that way.

@jorisvandenbossche
Copy link
Member

I didn't add any more tests except for the one for fullmatch

I think we still need to add a test for the actual issue you reported. Because right now, if I undo the actual code fix in _str_fullmatch, then all updated tests still pass, indicating the current tests don't cover the bug you are trying to fix.

@jorisvandenbossche
Copy link
Member

@ptth222 I pushed a small update to the parametrized test to include one case with the optional groups (or I don't know what the exact regex terminology is for this) from the issue, and add a whatsnew note.
Apologies for quickly pushing myself, but I want to include this fix in the upcoming 2.3.3. release that I am preparing right now.

@ptth222
Copy link
Contributor Author

ptth222 commented Sep 21, 2025

Sounds good to me. Thank you for the assistance.

@jorisvandenbossche jorisvandenbossche merged commit 08d21d7 into pandas-dev:main Sep 21, 2025
42 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @ptth222!

jorisvandenbossche pushed a commit that referenced this pull request Sep 21, 2025
… Arrow backend with optional groups) (#62401)

Co-authored-by: ptth222 <[email protected]>
jzwick pushed a commit to jzwick/pandas that referenced this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: str.fullmatch behavior is not the same for object dtype and string[pyarrow] dtype

3 participants