FEAT - Adding `decimal` as parameter for `ToFloat32` #1772

gabrielapgomezji · 2025-11-24T16:19:57Z

This issue addresses #1728

The ToFloat transformer now includes a decimal parameter that lets the user specify the decimal separator to use for the given column. Then, all the possible thousands separators are removed, and the decimal separator is converted to a . before the column is passed to to_float32.

…ement

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

skrub/_to_float.py

emassoulie · 2025-12-02T09:31:54Z

skrub/tests/test_to_float.py

+        (",56", 0.56, ","),
+    ],
+)
+def test_number_parsing(input_str, expected_float, decimal, df_module):


Might it be worth adding tests for the code's behaviour in case of an invalid entry?

yes, we should check a few weird cases and make sure they fail as expected

rcap107 · 2025-12-02T13:54:35Z

After some discussion, I think this PR needs some more time before it can be merged, and unfortunately won't be part of the next release.

The current implementation is removing all thousands separators other than what is specified as the "decimal" separator, which is quite risky and may leads to problems. It's better to follow what pandas is doing, i.e., have both decimal and thousands as separators. By default, the thousands separator should be None (so no replacement).

If there is some kind of weird string like 1,2.3,4, it should not be parsed as a number. I am not sure how far we should do to parse something like 1,2.34 with decimal . and thousands ,: it's not a format I recognize, but it would still be recognized as 12.34 rather than being rejected.

Another check that may be considered is counting the number of decimal separators, and reject any case where there is more than one.

Some additional comments:

While it's impossible to test all possible scenarios, tests should also include as many weird edge cases as we can come up with to see what could be the result.
The ToFloat docstring needs some more work to explain in more detail the behavior when decimal and thousands are set.

I'll convert this back to draft and keep an eye on this for the next PR.

gabrielapgomezji · 2025-12-17T11:24:40Z

When talking about the tests, it was mentioned to include 3 tests:

A test for Good inputs
A test for Bad Inputs
A test for bad parameters
I merged the last two tests including also bada parameters in the test. If it's better to have the 3 tests individually instead of the 2, I will modify it.

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

rcap107 · 2025-12-17T13:13:05Z

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

+During ``fit``, |ToFloat| attempts to convert all values in the column to
+numeric values after automatically removing other possible thousands separators
+(``,``, ``.``, space, apostrophe). If any value cannot be converted, the column
+is rejected with a ``RejectColumn`` exception.


I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.

There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)

doc/modules/column_level_featurizing/feature_engineering_numerical.rst

skrub/_to_float.py

rcap107 · 2025-12-17T14:16:07Z

skrub/tests/test_to_float.py

+        ("1,,234", ".", ","),
+        ("1.23,45", ".", ","),
+        # decimal == thousand
+        ("123,456,789", ",", ","),


Here we are testing that RejectColumn is raised as expected when it encounters values that should not be converted. This case should be moved to a separate test that verifies that the correct exception is raised if the parameters are incorrect. The same (new) test should also check that a ValueError is raised if decimal is None.

rcap107 · 2025-12-17T14:18:28Z

Thanks a lot for the PR @gabrielapgomezji! This will be very useful for parsing data that is not in the usual locale.

My comments are mostly about improving clarity in the documentation and adding comments in the code. I think the actual content of the PR is in a good shape, it's just a matter of polishing at this point.

…ical.rst Co-authored-by: Riccardo Cappuzzo <[email protected]>

Co-authored-by: Riccardo Cappuzzo <[email protected]>

rcap107 · 2025-12-18T10:48:44Z

skrub/_to_float.py

+        if self.thousand is None:
+            self.thousand = ""  # No thousand separator


This should be moved to the init, parameters should not be modified in the fit

WIP: Adding decimal conversion and tests

228c068

rcap107 changed the title ~~WIP: Adding decimal conversion and tests~~ FEAT - Adding decimal as parameter for ToFloat32 Nov 24, 2025

Added tests and examples

b06af7e

rcap107 mentioned this pull request Nov 24, 2025

FEAT - adding a heuristic for parsing units in string columns #1726

Draft

Added doctest skip

7622e7a

gabrielapgomezji marked this pull request as ready for review December 1, 2025 14:03

ggomezji added 4 commits December 1, 2025 15:05

Merge remote-tracking branch 'upstream/main' into 1728-ToFloat_improv…

fdefc75

…ement

Added documentation

aa28499

Added elipsis on doctests

41e0817

Fixed example doc

1cbec41

rcap107 reviewed Dec 1, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

ggomezji added 2 commits December 1, 2025 16:37

Improved users guide

1b74ee3

Fixed tests

a72c182

emassoulie reviewed Dec 2, 2025

View reviewed changes

rcap107 marked this pull request as draft December 2, 2025 13:54

ggomezji added 4 commits December 15, 2025 16:36

WIP: Improved column verification

441df5e

WIP: Removed pattern and include thousand separator

6b4339e

WIP: Regex modification for polars

806d7ea

Improved tests

489079d

ggomezji and others added 3 commits December 17, 2025 13:14

Improving the docstrings and documentation

2425963

Improving documentation

424376d

Merge branch 'main' into 1728-ToFloat_improvement

015d561

rcap107 reviewed Dec 17, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

rcap107 marked this pull request as ready for review December 17, 2025 13:00

rcap107 reviewed Dec 17, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

rcap107 reviewed Dec 17, 2025

View reviewed changes

doc/modules/column_level_featurizing/feature_engineering_numerical.rst Outdated Show resolved Hide resolved

rcap107 requested changes Dec 17, 2025

View reviewed changes

gabrielapgomezji and others added 7 commits December 17, 2025 16:19

Update doc/modules/column_level_featurizing/feature_engineering_numer…

24b0078

…ical.rst Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update doc/modules/column_level_featurizing/feature_engineering_numer…

e66309c

…ical.rst Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update doc/modules/column_level_featurizing/feature_engineering_numer…

acd44c2

…ical.rst Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update skrub/_to_float.py

3436572

Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update skrub/_to_float.py

8091bd6

Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update skrub/_to_float.py

491f448

Co-authored-by: Riccardo Cappuzzo <[email protected]>

Update skrub/_to_float.py

3842b22

Co-authored-by: Riccardo Cappuzzo <[email protected]>

rcap107 reviewed Dec 18, 2025

View reviewed changes

rcap107 linked an issue Dec 18, 2025 that may be closed by this pull request

ToFloat fails when trying to parse numbers with "," decimal separators #1728

Open

		if self.thousand is None:
		self.thousand = "" # No thousand separator

FEAT - Adding decimal as parameter for ToFloat32 #1772

Are you sure you want to change the base?

FEAT - Adding decimal as parameter for ToFloat32 #1772

Uh oh!

Conversation

gabrielapgomezji commented Nov 24, 2025 • edited by rcap107 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emassoulie Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

rcap107 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

rcap107 commented Dec 2, 2025

Uh oh!

gabrielapgomezji commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

rcap107 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rcap107 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

rcap107 commented Dec 17, 2025

Uh oh!

rcap107 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FEAT - Adding `decimal` as parameter for `ToFloat32` #1772

FEAT - Adding `decimal` as parameter for `ToFloat32` #1772

gabrielapgomezji commented Nov 24, 2025 •

edited by rcap107

Loading