Skip to content

Conversation

@gabrielapgomezji
Copy link
Contributor

@gabrielapgomezji gabrielapgomezji commented Nov 24, 2025

This issue addresses #1728

The ToFloat transformer now includes a decimal parameter that lets the user specify the decimal separator to use for the given column. Then, all the possible thousands separators are removed, and the decimal separator is converted to a . before the column is passed to to_float32.

@rcap107 rcap107 changed the title WIP: Adding decimal conversion and tests FEAT - Adding decimal as parameter for ToFloat32 Nov 24, 2025
@gabrielapgomezji gabrielapgomezji marked this pull request as ready for review December 1, 2025 14:03
(",56", 0.56, ","),
],
)
def test_number_parsing(input_str, expected_float, decimal, df_module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be worth adding tests for the code's behaviour in case of an invalid entry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should check a few weird cases and make sure they fail as expected

@rcap107
Copy link
Member

rcap107 commented Dec 2, 2025

After some discussion, I think this PR needs some more time before it can be merged, and unfortunately won't be part of the next release.

The current implementation is removing all thousands separators other than what is specified as the "decimal" separator, which is quite risky and may leads to problems. It's better to follow what pandas is doing, i.e., have both decimal and thousands as separators. By default, the thousands separator should be None (so no replacement).

If there is some kind of weird string like 1,2.3,4, it should not be parsed as a number. I am not sure how far we should do to parse something like 1,2.34 with decimal . and thousands ,: it's not a format I recognize, but it would still be recognized as 12.34 rather than being rejected.

Another check that may be considered is counting the number of decimal separators, and reject any case where there is more than one.

Some additional comments:

  • While it's impossible to test all possible scenarios, tests should also include as many weird edge cases as we can come up with to see what could be the result.
  • The ToFloat docstring needs some more work to explain in more detail the behavior when decimal and thousands are set.

I'll convert this back to draft and keep an eye on this for the next PR.

@rcap107 rcap107 marked this pull request as draft December 2, 2025 13:54
@gabrielapgomezji
Copy link
Contributor Author

When talking about the tests, it was mentioned to include 3 tests:

  • A test for Good inputs
  • A test for Bad Inputs
  • A test for bad parameters
    I merged the last two tests including also bada parameters in the test. If it's better to have the 3 tests individually instead of the 2, I will modify it.

@rcap107 rcap107 marked this pull request as ready for review December 17, 2025 13:00
Comment on lines +65 to +68
During ``fit``, |ToFloat| attempts to convert all values in the column to
numeric values after automatically removing other possible thousands separators
(``,``, ``.``, space, apostrophe). If any value cannot be converted, the column
is rejected with a ``RejectColumn`` exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.

There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)

("1,,234", ".", ","),
("1.23,45", ".", ","),
# decimal == thousand
("123,456,789", ",", ","),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are testing that RejectColumn is raised as expected when it encounters values that should not be converted. This case should be moved to a separate test that verifies that the correct exception is raised if the parameters are incorrect. The same (new) test should also check that a ValueError is raised if decimal is None.

@rcap107
Copy link
Member

rcap107 commented Dec 17, 2025

Thanks a lot for the PR @gabrielapgomezji! This will be very useful for parsing data that is not in the usual locale.

My comments are mostly about improving clarity in the documentation and adding comments in the code. I think the actual content of the PR is in a good shape, it's just a matter of polishing at this point.

Comment on lines +279 to +280
if self.thousand is None:
self.thousand = "" # No thousand separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to the init, parameters should not be modified in the fit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ToFloat fails when trying to parse numbers with "," decimal separators

4 participants