-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# Create CSV with missing values
with open('missing_values.csv', 'w') as f:
f.write('''
int_col_a,int_col_b,float_col,string_col,all_missing_col
1,1,1.1,'one',
2,,,,
3,3,3.3,'three',''')
# Load CSV
df = pd.read_csv('missing_values.csv')
# Notice that all missing values are NaN
df
# int_col_a int_col_b float_col string_col all_missing_col
# 0 1 1.0 1.1 'one' NaN
# 1 2 NaN NaN NaN NaN
# 2 3 3.0 3.3 'three' NaN
# Load as Parquet
df.to_parquet('missing_values.parquet', engine='pyarrow')
df2 = pd.read_parquet('missing_values.parquet')
# Notice that there's a None in string_col
df2
# int_col_a int_col_b float_col string_col all_missing_col
# 0 1 1.0 1.1 'one' NaN
# 1 2 NaN NaN None NaN
# 2 3 3.0 3.3 'three' NaN
Issue Description
When calling pd.read_parquet()
, it loads missing values in string columns as None
instead of NaN
. This can cause problems in downstream ML models.
Missing values of non-string columns are correctly loaded as NaN
.
Expected Behavior
Missing values in string columns should also be NaN
.
Installed Versions
INSTALLED VERSIONS
commit : a60ad39
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Wed Jul 5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.2
numpy : 1.26.1
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 58.0.4
pip : 21.2.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None