-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd; print(pd.__version__)
import numpy as np; print(np.__version__)
N = 70_000_000
df = pd.DataFrame({'A': np.random.normal(4,1,N).astype(np.float32)})
print(np.mean(df['A'].values)) # Return 4.0000944 <-- Correct
print(np.mean(df['A'])) # Return 1.917656660079956 <-- Wrong !
print(df['A'].mean()) # Return 1.917656660079956 <-- written like this, it looks like a pandas-related bugIssue Description
Hi,
It seems that when using float32, pandas mess up mean() or var() function after 34 Millions of rows.
I was suspecting some rounding errors, but it seems to be something way more fundamental than this.
Please note that this bug :
- is especially nasty since it does not produce warning or raise an Exception, yet gives a statistic absolutely wrong. Consequences for data pipelines and companies can be really big.
- Mathematically, it seems that all the elements after a certain index (sometimes
2**24,2**25...) are considered as 0 for np.float32 (or NaN for other dtype) - happen at least for np.mean() and np.var(), but probably for other functions as well
- may be, in fact, related to Numpy (or other library) and not Pandas.
In terms of datatype, I manage to reproduce the bug for np.float32 and np.float16 :
- float64 : works OK at least up to (
2**28) - float32 : OK up to 1.99 * (
2**23), starts bugging at (2**24) (consider last elements as 0) - float16 : OK up to 1.99 * (
2**15), starts bugging at (2**16) (consider last elements as NaN) - np.int8, np.int16, np.int32, np.int64 : works OK at least up to (
2**28)
Expected Behavior
In the above example, we should have np.mean(df['A']) returning something around 4.0
Installed Versions
INSTALLED VERSIONS
commit : 66e3805
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.0-18-cloud-amd64
Version : #1 SMP Debian 4.19.208-1 (2021-09-29)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.5
numpy : 1.21.6
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.2.0
Cython : 0.29.30
pytest : 7.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : 7.28.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fsspec : 2021.10.0
fastparquet : 0.8.1
gcsfs : 2021.10.0
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : 0.17.4
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.25
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None