-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
#%%
df = pd.DataFrame({"A": [0, 1, 2, pd.NA, 1, 2], "B": [0, pd.NA, 2, pd.NA, 2, 4]}, dtype="float[pyarrow]")
df["rate"] = df["A"]/df["B"]
df.info()
df["rate"].isna()
#%%
df = pd.DataFrame([1, 2, np.nan])
df.info()
df.isna()
#%%
pd.Series([pa.NA], dtype="float[pyarrow]").isna()
Issue Description
I deal with tables with a lot of missing data, and using pyarrow
backend is a must. However, I recently noticed that some intermediate results are not processed as they should be: applying isna
to the rate
column above results in pyarrow.NaN
being see as a regular number, whereas numpy.nan
is processed like expected. Creating a dataframe from scratch though behaves like it should.
Expected Behavior
The documentation says the following:
pandas/pandas/core/dtypes/missing.py
Lines 101 to 178 in 0691c5c
def isna(obj: object) -> bool | npt.NDArray[np.bool_] | NDFrame: | |
""" | |
Detect missing values for an array-like object. | |
This function takes a scalar or array-like object and indicates | |
whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN`` | |
in object arrays, ``NaT`` in datetimelike). | |
Parameters | |
---------- | |
obj : scalar or array-like | |
Object to check for null or missing values. | |
Returns | |
------- | |
bool or array-like of bool | |
For scalar input, returns a scalar boolean. | |
For array input, returns an array of boolean indicating whether each | |
corresponding element is missing. | |
See Also | |
-------- | |
notna : Boolean inverse of pandas.isna. | |
Series.isna : Detect missing values in a Series. | |
DataFrame.isna : Detect missing values in a DataFrame. | |
Index.isna : Detect missing values in an Index. | |
Examples | |
-------- | |
Scalar arguments (including strings) result in a scalar boolean. | |
>>> pd.isna('dog') | |
False | |
>>> pd.isna(pd.NA) | |
True | |
>>> pd.isna(np.nan) | |
True | |
ndarrays result in an ndarray of booleans. | |
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]]) | |
>>> array | |
array([[ 1., nan, 3.], | |
[ 4., 5., nan]]) | |
>>> pd.isna(array) | |
array([[False, True, False], | |
[False, False, True]]) | |
For indexes, an ndarray of booleans is returned. | |
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None, | |
... "2017-07-08"]) | |
>>> index | |
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], | |
dtype='datetime64[ns]', freq=None) | |
>>> pd.isna(index) | |
array([False, False, True, False]) | |
For Series and DataFrame, the same type is returned, containing booleans. | |
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']]) | |
>>> df | |
0 1 2 | |
0 ant bee cat | |
1 dog None fly | |
>>> pd.isna(df) | |
0 1 2 | |
0 False False False | |
1 False True False | |
>>> pd.isna(df[1]) | |
0 False | |
1 True | |
Name: 1, dtype: bool | |
""" | |
return _isna(obj) |
and I'd expect NaN
to be seen the same regardless of the actual dtype; this is a well-defined IEEE 754
object. There is a hint to what might be causing it here: apache/arrow#35535 (comment)
That's because pyarrow does not set that the NaN is the missing value indicator, and thus NaNs in the input are preserved.
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.6
python-bits : 64
OS : Linux
OS-release : 6.10.10-arch1-1
Version : #1 SMP PREEMPT_DYNAMIC Thu, 12 Sep 2024 17:21:02 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 24.1.2
Cython : None
sphinx : None
IPython : 8.26.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : None
matplotlib : 3.9.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 17.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None