-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Open
Labels
EnhancementStringsString extension data type and string dataString extension data type and string data
Description
Code Sample, a copy-pastable example if possible
import re
import pandas as pd
import regex
df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "1", "2"]})
pattern = r"\d"
df.b.str.match(pattern)
df.b.str.match(re.compile(pattern))
df.b.str.match(regex.compile(pattern)) # throws typeError
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-eec2b9ae9613> in <module>()
9 df.b.str.match(pattern)
10 df.b.str.match(re.compile(pattern))
---> 11 df.b.str.match(regex.compile(pattern))
~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in match(self, pat, case, flags, na, as_indexer)
2421 def match(self, pat, case=True, flags=0, na=np.nan, as_indexer=None):
2422 result = str_match(self._data, pat, case=case, flags=flags, na=na,
-> 2423 as_indexer=as_indexer)
2424 return self._wrap_result(result)
2425
~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in str_match(arr, pat, case, flags, na, as_indexer)
736 flags |= re.IGNORECASE
737
--> 738 regex = re.compile(pat, flags=flags)
739
740 if (as_indexer is False) and (regex.groups > 0):
~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():
~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
298 return pattern
299 if not sre_compile.isstring(pattern):
--> 300 raise TypeError("first argument must be string or compiled pattern")
301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
TypeError: first argument must be string or compiled pattern
A simpler way to demonstrate the problem is:
re.compile(regex.compile(pattern))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-64-38578ab20aeb> in <module>()
----> 1 re.compile(regex.compile(pattern))
~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():
~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
298 return pattern
299 if not sre_compile.isstring(pattern):
--> 300 raise TypeError("first argument must be string or compiled pattern")
301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):
TypeError: first argument must be string or compiled pattern
Problem description
The regex library seems not to be supported by pandas. Not sure if you want to add support for it, but I had a quick look and It seems relatively straight forward to add support for it (+ it would make maintainance for projects that have already opted for regex
easier).
How to fix
So, I think that the steps that seem to be required are:
pandas.core.dtypes.inference.is_re
should return True forregex
compiled patterns too (assuming thatregex
is installed of course).- Make sure that you use call "is_re" before
re.compile()
(as is being done e.g. here):
if not is_re(pat):
pat = re.compile(pat, flags)
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.17.5-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.3
pytest: 3.7.1
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
madimov, gwerbin, lucazav, MariyaSteffanova, coroa and 1 more
Metadata
Metadata
Assignees
Labels
EnhancementStringsString extension data type and string dataString extension data type and string data