-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
PARTIAL FIX: Improve leading zeros preservation with dtype=str for dict-based dtypes #62242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Looks like AI |
I did use AI to help draft this. I tried setting up a pandas development environment (both via pip and the Docker image) to create a reproducible test case, but running pytest from the CLI kept failing with a message about pandas._libs. There seems to be a significant issue with the pyarrow implementation. Specifically, pyarrow does not enforce dtypes during load - it applies them afterward. As a result, integer-to-string conversions lose leading zeros. I wanted to at least contribute a working test that highlights this problem. |
@jbrockmendel test is passing now. I have it marked as |
We discourage AI-generated PRs since they take more time and effort to review than they do to write. I'll take a look since you took the time to get the CI passing, but in the future please avoid it. |
big picture adding a test isn't wrong, but not a good use of time. if you'd like to actually fix the issue, i think there are some good comments in the original thread |
I agree, but the fix doesn't appear to be very straightforward. Happy to work on it if there is some guidance on where to find the relevant pieces. It requires proper mapping of pandas dtypes to pyarrow types, and also handling other logic that pandas supports but pyarrow doesn't (e.g., col index-based dtypes, global dtypes, etc.). |
@jbrockmendel I dug into the pandas and PyArrow APIs and landed on a more general fix for the issue. I wasn't sure how to run This patch improves dtype handling during the pyarrow write path by converting supported column-specific dtypes into pyarrow types and passing them via convert_options["column_types"]. However, a few limitations remain: Known Issues / Remaining Work
|
any additional comments after review? |
Only that it looks like you're still using AI in which case a careful review isn't a good use of my time. |
@jbrockmendel I've put significant effort into understanding this bug and implementing the fix. While I used various resources for reference (as most developers do), this is my work and I stand behind it. Happy to discuss any specific concerns about the implementation. |
OK I'll take another look. |
PR for Arrow here, which addresses this on the arrow side: apache/arrow#47663 If approved, we can integrate global dtype via |
Eventually. We support older versions of pyarrow for a year (i think) |
if target_dtype: | ||
column_types[col] = target_dtype | ||
|
||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats an example where this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I seem to remember it failed some test. I can look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the try/except block to test the theory, but I'm getting some failures. Not sure if it's the test suite or the change itself. I was getting some recent failures in the test suite anyway... they just don't seem related.
If the test suite will pass, I'm fine leaving it out. I think there was some historical reason for including it, during some of my earlier attempts at making this work.
else: | ||
# TODO: Global dtypes not supported - may cause inconsistent behavior | ||
# between engines, especially for leading zero preservation | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they pass a singleton, can we do something like
convert_options["column_types"] = defaultdict(user_passed_dtype)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good thought, but it doesn't work. One of the first things I tried actually :) I documented a larger analysis on pyarrow here: apache/arrow#47502
You can see the relevant portion of pyarrow code here. Everything is mapped back to C++, and if the column name is not found, it uses the default (inferred) option.
Sure. In the meantime, we could still supply it as an argument. It would be ignored by pyarrow for versions with no support and start working once the new pyarrow is released. For example: def _resolve_pyarrow_type(dtype):
"""Try converting a pandas dtype to a pyarrow type, return None if unsupported."""
source_dtype = pandas_dtype(dtype)
try:
return to_pyarrow_type(source_dtype.type)
except TypeError:
# TODO: Unsupported dtypes silently ignored - may cause
# unexpected behavior when pyarrow applies default inference
# instead of user's dtype
return None
if self.dtype is not None:
if isinstance(self.dtype, dict):
column_types = {
col: pa_type
for col, col_dtype in self.dtype.items()
if (pa_type := _resolve_pyarrow_type(col_dtype)) is not None
}
if column_types:
self.convert_options["column_types"] = column_types
else:
if (default_column_type := _resolve_pyarrow_type(self.dtype)) is not None:
self.convert_options["default_column_type"] = default_column_type |
Summary
This PR partially addresses issue #57666 by improving leading zeros preservation when
dtype=str
is used with dictionary-based dtype specifications. While the globaldtype=str
issue with pyarrow engine remains unfixed, this PR resolves the problem for more targeted dtype specifications.Problem
Issue #57666 reports that the pyarrow engine does not preserve leading zeros in numeric-looking strings when
dtype=str
is specified, while other engines correctly preserve them.Solution
dtype={'col': str}
) now properly preserve leading zeros across all enginesdtype=str
still fails with pyarrow engine (marked with xfail for now)What's Fixed vs Still Broken
✅ Now Working:
Next Steps
This PR provides a foundation for the complete fix. The remaining work involves:
dtype
handlingChecklist
doc/source/whatsnew/v3.0.0.rst
Files Changed
pandas/io/parsers/arrow_parser_wrapper.py
- Fix for dict-based dtypespandas/tests/io/parser/test_preserve_leading_zeros.py
- Comprehensive test suiteTest Output
dtype=str
marked as xfail (temporary)