-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
test_regex_groups_warning.py
import pandas as pd
fruits = pd.Series(["apple", "banana", "apple and banana"])
def test_user_warning_in_series():
print("fruits:\n", fruits)
print("UNNAMED REGEX") # less accessible
nand_query = r"^(?!(?=.*apple)(?=.*banana)).*$"
print(f"'{nand_query}'")
filtered = fruits[fruits.str.contains(nand_query)]
print("filtered:\n", filtered)
nand_condition = ~(
filtered.str.contains("head", case=False)
& filtered.str.contains("neck", case=False)
)
print("nand_condition:\n", nand_condition)
if not nand_condition.all():
failures = filtered[~nand_condition]
print("failures:\n", failures)
assert nand_condition.all()
assert 0
def test_user_warning_with_names():
print("fruits:\n", fruits)
print("NAMED REGEX") # more clear about what is matched and why:
named_nand_query = r"(?P<nand>^(?P<not_>(?!(?P<and_>(?=.*apple)(?=.*banana)))).*$)"
print(f"'{named_nand_query}'")
filtered_named = fruits[fruits.str.contains(named_nand_query)]
print("filtered_named:\n", filtered_named)
named_nand_condition = ~(
filtered_named.str.contains("head", case=False)
& filtered_named.str.contains("neck", case=False)
)
print("named_nand_condition:\n", named_nand_condition)
if not named_nand_condition.all():
named_failures = filtered_named[~named_nand_condition]
print("named_failures:\n", named_failures)
assert named_nand_condition.all()
assert 0Issue Description
@phofl wrote re: PR 56763 to remove this warning
Please open issues to facilitate a discussion before you open a PR
Sorry, my bad, here is an issue to discuss the warning I proposed to remove.
PRE-TL;DR:
- Named groups aid regex understanding and maintenance
- The warning logic also slows down pandas regex perf
result of running reproducible example:
This warning is annoying and serves to discourage use of named capturing groups (which are more maintainable) because users must either switch to extracting the groups (not always necessary) or replace their named groups with "(?:" (noncapturing groups are harder to maintain because it's less clear what is their objective within a greater regex pattern).
If users need to specialize their regex patterns to each command, then they need to maintain multiple copies, some with non-captured groups, some without, just to silence some warning, also, if they remove the groups, then later on when they want to use them, they might have to figure out how to replace groups they removed just to silence a warning, and be frustrated.
The logical condition for the warning also compiles the pattern but then the compiled pattern is discarded, so this warning slows down every "contains" in pandas just to check if we should annoy people who probably know what they're doing.
If we remove this unnecessary warning, then we no longer discourage users who use named capturing groups, thus facilitating readability of the patterns, and portability to other contexts, such as debuggers or the "extract" method mentioned in the removed warning.
TL;DR: This warning doesn't need to exist, discourages best practices, and slows performance of every string contains query in pandas (a super "hot path!"), so I suggest we remove it.
here is a permalink to the line of code to check for the warning condition
dtype: bool
"""
- if regex and re.compile(pat).groups: # compiled and discarded the pattern
- warnings.warn(
- "This pattern is interpreted as a regular expression, and has "
- "match groups. To actually get the groups, use str.extract.",
- UserWarning,
- stacklevel=find_stack_level(),
- )
result = self._data.array._str_contains(pat, case, flags, na, regex)
return self._wrap_result(result, fill_value=na, returns_string=False)just compile the pattern and use the compiled pattern:
dtype: bool
"""
result = self._data.array._str_contains(pat, case, flags, na, regex)
return self._wrap_result(result, fill_value=na, returns_string=False)
``
optionally (may cause issues depending on _str_contains logic:
```diff
- result = self._data.array._str_contains(pat, case, flags, na, regex)
+ compiled_pattern = re.compile(pat, flags)
+ result = self._data.array._str_contains(compiled_pattern, case, flags, na, regex)Expected Behavior
no warning for containment queries using groups
Installed Versions
python
Python 3.10.9 (main, Mar 1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.import pandas as pd
pd.show_versions()
/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
Traceback (most recent call last):
File "", line 1, in
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 141, in show_versions
deps = _get_dependency_info()
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py", line 98, in _get_dependency_info
mod = import_optional_dependency(modname, errors="ignore")
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pandas/compat/_optional.py", line 132, in import_optional_dependency
module = importlib.import_module(name)
File "/home/bion/miniconda3/envs/py310/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/init.py", line 42, in
from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/init.py", line 3, in
from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/numba/np/ufunc/decorators.py", line 3, in
from numba.np.ufunc import _internal
SystemError: initialization of _internal failed without raising an exception
Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.
Copyright (c) 2011-2023, Open source contributors.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Location: /home/bion/miniconda3/envs/py310/lib/python3.10/site-packages
Requires: numpy, python-dateutil, pytz, tzdata


