-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
BUG: read_csv with engine=pyarrow and numpy-nullable dtype #62053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the original issue, do you know where we are introducing float to lose precision when wanting the result type to be int?
In |
OK I see, it's What if in |
That’s basically what this is currently doing, just not in that function since it is also called from other places. I’m out of town for a few days. If you feel strongly that this logic should live inside that function I’ll move it when I get back |
Looking at this again, I'm skeptical of moving the logic into arrow_table_to_pandas. The trouble is that between the table.to_pandas() and the .astype conversions, we have to do a bunch of other csv-keyword-specific stuff like set_index and column renaming. (Just opened #62087 to clean that up a bit). Shoe-horning all of that into arrow_table_to_pandas would make it a really big function in a way that i think is a net negative. |
Sorry in #62053 (comment), I meant for diff --git a/pandas/io/_util.py b/pandas/io/_util.py
index 6827fbe9c9..2e15bd3749 100644
--- a/pandas/io/_util.py
+++ b/pandas/io/_util.py
@@ -85,7 +85,14 @@ def arrow_table_to_pandas(
else:
types_mapper = None
elif dtype_backend is lib.no_default or dtype_backend == "numpy":
- types_mapper = None
+ # Avoid lossy conversion to float64
+ # Caller is responsible for converting to numpy type if needed
+ types_mapper = {
+ pa.int8(): pd.Int8Dtype(),
+ pa.int16(): pd.Int16Dtype(),
+ pa.int32(): pd.Int32Dtype(),
+ pa.int64(): pd.Int64Dtype(),
+ }
else:
raise NotImplementedError And then each IO parser is responsible for manipulating this result based on the IO arguments |
That would mean adding that logic to each of the 7 places where arrow_table_to_pandas is called, so we would almost-certainly be better off having it centralized. If we get #62087 in then moving all the logic into arrow_table_to_pandas at least gets a little bit less bulky, so I can give it a try. |
engine='pyarrow'
and dtype Int64 #56136 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Also makes this code path robust to always-distinguish behavior in #62040