-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
I've been using the new Arrow backed dtypes, and I'm a bit confused on how it is decided which backend is used. One example:
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
... pandas.Series([1, 2, 3, 4])
...
0 1
1 2
2 3
3 4
dtype: int64Why is setting the dtype_backend to pyarrow not enough to use Arrow in the Series constructor when no dtype is specified?
Also, when using for example read_csv:
>>> import pandas
>>> pandas.read_csv('test.csv').dtypes
name object
age int64
dtype: object
>>> pandas.read_csv('test.csv', use_nullable_dtypes=True).dtypes
name string[python]
age Int64
dtype: object
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
... pandas.read_csv('test.csv').dtypes
...
name object
age int64
dtype: object
>>> with pandas.option_context("mode.dtype_backend", "pyarrow"):
... pandas.read_csv('test.csv', use_nullable_dtypes=True).dtypes
...
name string[pyarrow]
age int64[pyarrow]
dtype: objectWhy again is not enough that the user set the backend to pyarrow to use Arrow dtypes, and needs to call use_nullable_dtypes? This s what we returned, which doesn't make sense to me:
| dtype_backend=None | dtype_backend=pyarrow | |
|---|---|---|
| use_nullable_dtypes=False | NumPy | NumPy ??? |
| use_nullable_dtypes=True | Arrow+NumPy nullables | Arrow |
What I would expect:
| dtype_backend=None | dtype_backend=pyarrow | |
|---|---|---|
| use_nullable_dtypes=False | NumPy | Arrow |
| use_nullable_dtypes=True | Arrow eventually, Arrow+Numpy nullables for now | Arrow |
Sorry if I missed the discussion, maybe I'm just missing something. But I don't see what's the use case for a user to explicitly say they want Arrow types with the option, but still giving them NumPy backed series and dataframes... Is this something it was agreed, or we just didn't make the changes to have a more intuitive behavior?
CC: @mroeschke