-
Notifications
You must be signed in to change notification settings - Fork 187
FEAT - Adding decimal as parameter for ToFloat32
#1772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT - Adding decimal as parameter for ToFloat32
#1772
Conversation
decimal as parameter for ToFloat32
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
skrub/tests/test_to_float.py
Outdated
| (",56", 0.56, ","), | ||
| ], | ||
| ) | ||
| def test_number_parsing(input_str, expected_float, decimal, df_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it be worth adding tests for the code's behaviour in case of an invalid entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we should check a few weird cases and make sure they fail as expected
|
After some discussion, I think this PR needs some more time before it can be merged, and unfortunately won't be part of the next release. The current implementation is removing all thousands separators other than what is specified as the "decimal" separator, which is quite risky and may leads to problems. It's better to follow what pandas is doing, i.e., have both If there is some kind of weird string like Another check that may be considered is counting the number of Some additional comments:
I'll convert this back to draft and keep an eye on this for the next PR. |
|
When talking about the tests, it was mentioned to include 3 tests:
|
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
| During ``fit``, |ToFloat| attempts to convert all values in the column to | ||
| numeric values after automatically removing other possible thousands separators | ||
| (``,``, ``.``, space, apostrophe). If any value cannot be converted, the column | ||
| is rejected with a ``RejectColumn`` exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.
There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)
doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Outdated
Show resolved
Hide resolved
| ("1,,234", ".", ","), | ||
| ("1.23,45", ".", ","), | ||
| # decimal == thousand | ||
| ("123,456,789", ",", ","), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are testing that RejectColumn is raised as expected when it encounters values that should not be converted. This case should be moved to a separate test that verifies that the correct exception is raised if the parameters are incorrect. The same (new) test should also check that a ValueError is raised if decimal is None.
|
Thanks a lot for the PR @gabrielapgomezji! This will be very useful for parsing data that is not in the usual locale. My comments are mostly about improving clarity in the documentation and adding comments in the code. I think the actual content of the PR is in a good shape, it's just a matter of polishing at this point. |
…ical.rst Co-authored-by: Riccardo Cappuzzo <[email protected]>
…ical.rst Co-authored-by: Riccardo Cappuzzo <[email protected]>
…ical.rst Co-authored-by: Riccardo Cappuzzo <[email protected]>
Co-authored-by: Riccardo Cappuzzo <[email protected]>
Co-authored-by: Riccardo Cappuzzo <[email protected]>
Co-authored-by: Riccardo Cappuzzo <[email protected]>
Co-authored-by: Riccardo Cappuzzo <[email protected]>
| if self.thousand is None: | ||
| self.thousand = "" # No thousand separator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to the init, parameters should not be modified in the fit
This issue addresses #1728
The ToFloat transformer now includes a
decimalparameter that lets the user specify the decimal separator to use for the given column. Then, all the possible thousands separators are removed, and the decimal separator is converted to a.before the column is passed toto_float32.