Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
228c068
WIP: Adding decimal conversion and tests
ggomezji Nov 24, 2025
b06af7e
Added tests and examples
ggomezji Nov 24, 2025
7622e7a
Added doctest skip
ggomezji Dec 1, 2025
fdefc75
Merge remote-tracking branch 'upstream/main' into 1728-ToFloat_improv…
ggomezji Dec 1, 2025
aa28499
Added documentation
ggomezji Dec 1, 2025
41e0817
Added elipsis on doctests
ggomezji Dec 1, 2025
1cbec41
Fixed example doc
ggomezji Dec 1, 2025
1b74ee3
Improved users guide
ggomezji Dec 1, 2025
a72c182
Fixed tests
ggomezji Dec 1, 2025
441df5e
WIP: Improved column verification
ggomezji Dec 15, 2025
6b4339e
WIP: Removed pattern and include thousand separator
ggomezji Dec 15, 2025
806d7ea
WIP: Regex modification for polars
ggomezji Dec 15, 2025
489079d
Improved tests
ggomezji Dec 17, 2025
2425963
Improving the docstrings and documentation
ggomezji Dec 17, 2025
424376d
Improving documentation
ggomezji Dec 17, 2025
015d561
Merge branch 'main' into 1728-ToFloat_improvement
rcap107 Dec 17, 2025
24b0078
Update doc/modules/column_level_featurizing/feature_engineering_numer…
gabrielapgomezji Dec 17, 2025
e66309c
Update doc/modules/column_level_featurizing/feature_engineering_numer…
gabrielapgomezji Dec 17, 2025
acd44c2
Update doc/modules/column_level_featurizing/feature_engineering_numer…
gabrielapgomezji Dec 17, 2025
3436572
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
8091bd6
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
491f448
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
3842b22
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
e5fe91f
WIP: Adding decimal conversion and tests
ggomezji Nov 24, 2025
c186201
Added tests and examples
ggomezji Nov 24, 2025
67f00c7
Added doctest skip
ggomezji Dec 1, 2025
47cc97d
Added documentation
ggomezji Dec 1, 2025
daa9557
Added elipsis on doctests
ggomezji Dec 1, 2025
292a5c1
Fixed example doc
ggomezji Dec 1, 2025
6821b32
Improved users guide
ggomezji Dec 1, 2025
9df1ba2
Fixed tests
ggomezji Dec 1, 2025
620bd12
WIP: Improved column verification
ggomezji Dec 15, 2025
0be30f3
WIP: Removed pattern and include thousand separator
ggomezji Dec 15, 2025
ec7d687
WIP: Regex modification for polars
ggomezji Dec 15, 2025
3e6dea1
Improved tests
ggomezji Dec 17, 2025
f8e63a6
Improving the docstrings and documentation
ggomezji Dec 17, 2025
50d9b47
Improving documentation
ggomezji Dec 17, 2025
36d1d8f
Update doc/modules/column_level_featurizing/feature_engineering_numer…
gabrielapgomezji Dec 17, 2025
0f149f1
Update doc/modules/column_level_featurizing/feature_engineering_numer…
gabrielapgomezji Dec 17, 2025
415aec1
Update doc/modules/column_level_featurizing/feature_engineering_numer…
gabrielapgomezji Dec 17, 2025
a19e149
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
1754d07
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
db44c3e
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
0b71c26
Update skrub/_to_float.py
gabrielapgomezji Dec 17, 2025
94435da
WIP
ggomezji Jan 19, 2026
90d00db
Merge branch '1728-ToFloat_improvement' of github.com:gabrielapgomezj…
ggomezji Jan 19, 2026
ce09e6a
Fix doctest and move default thousand value
ggomezji Jan 19, 2026
3a082ee
Fix bug in doctest properly
ggomezji Jan 19, 2026
cbed6fa
more docstring fix
rcap107 Jan 19, 2026
5e1dd9b
fixing more docstrings
rcap107 Jan 19, 2026
095f403
Reverting changes and cleaning up history
rcap107 Jan 20, 2026
8765b84
Merge
ggomezji Mar 30, 2026
de19268
New version of the ToFloat
ggomezji Mar 30, 2026
d4a6aef
Merge branch '1728-ToFloat_improvement' of github.com:gabrielapgomezj…
ggomezji Mar 30, 2026
ea84a6e
Improving To Float
ggomezji Mar 30, 2026
2398cd3
Fix doctest
ggomezji Mar 30, 2026
7d88a1b
Fix doc
ggomezji Mar 30, 2026
910efeb
Added modification
ggomezji Mar 31, 2026
888ea67
Included suggested comments
ggomezji Mar 31, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ New features
user select which tab should be opened when the ``TableReport`` is
rendered. :pr:`1737` by :user:`Riccardo Cappuzzo<rcap107>`.

- :class:`ToFloat32` has the parameter decimal to let the user specify whether they use ',' or '.'
as decimal separator and it also handles negative numbers indicated with parentheses.
:pr:`1772` by :user:`Gabriela Gómez Jiménez <gabrielapgomezji>`.

Changes
-------
Expand Down
112 changes: 112 additions & 0 deletions doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
.. |ToFloat| replace:: :class:`~skrub.ToFloat`
.. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer`
.. |Cleaner| replace:: :class:`~skrub.Cleaner`

.. _user_guide_feature_engineering_numeric_to_float:

Converting heterogeneous numeric values to uniform float32
==========================================================

Many tabular datasets contain numeric information stored as strings, mixed
representations, locale-specific formats, or other non-standard encodings.
Common issues include:

- Thousands separators (``1,234.56`` or ``1 234,56``)
- Use of apostrophes as separators (``4'567.89``)
- Negative numbers encoded inside parentheses (``(1,234.56)``)
- String columns that contain mostly numeric values, but with occasional invalid entries

To provide consistent numeric behavior, skrub includes the |ToFloat| transformer,
which **standardizes all numeric-like columns to ``float32``** and handles a wide
range of real-world formatting issues automatically.

The |ToFloat| transformer is used internally by both the |Cleaner| class and the
|TableVectorizer| to guarantee that downstream estimators receive clean and
uniform numeric data.

What |ToFloat| does
-------------------

The |ToFloat| transformer provides:

- **Automatic conversion to 32-bit floating-point values (`float32`).**
This dtype is lightweight and fully supported by scikit-learn estimators.

- **Automatic parsing of decimal separators**, regardless of locale:
- ``.`` or ``,`` can be used as decimal point
- thousands separators (``.``, ``,``, space, apostrophe) are removed automatically

- **Parentheses interpreted as negative numbers**, a common format in financial datasets:
- ``(1,234.56)`` → ``-1234.56``

- **Scientific notation parsing** (e.g. ``1.23e+4``)

- **Graceful handling of invalid or non-numeric values during transform**:
- During ``fit``: non-convertible values raise a ``RejectColumn`` exception
- During ``transform``: invalid entries become ``NaN`` instead of failing

- **Rejection of categorical and datetime columns**, which should not be cast to numeric.

As with all skrub transformers, |ToFloat| behaves like a standard
scikit-learn transformer and is fully compatible with pipelines.

Examples
--------

Parsing numeric-formatted strings:

>>> import pandas as pd
>>> from skrub import ToFloat
>>> s = pd.Series(['1.1', None, '3.3'], name='x')
>>> ToFloat().fit_transform(s)
0 1.1
1 NaN
2 3.3
Name: x, dtype: float32

Automatic handling of locale-dependent decimal separators:

>>> s = pd.Series(["4 567,89", "4'567,89"], name="x")
>>> ToFloat(decimal=",").fit_transform(s) # doctest: +SKIP
0 4567.89
1 4567.89
Name: x, dtype: float32

Parentheses interpreted as negative numbers:

>>> s = pd.Series(["-1,234.56", "(1,234.56)"], name="neg")
>>> ToFloat().fit_transform(s) # doctest: +SKIP
0 -1234.56
1 -1234.56
Name: neg, dtype: float32

Scientific notation:

>>> s = pd.Series(["1.23e+4", "1.23E+4"])
>>> ToFloat(decimal=".").fit_transform(s)
0 12300.0
1 12300.0
dtype: float32

Columns that cannot be converted are rejected during ``fit``:

>>> s = pd.Series(['1.1', 'hello'], name='x')
>>> ToFloat(decimal=".").fit_transform(s)
Traceback (most recent call last):
...
skrub._apply_to_cols.RejectColumn: Could not convert column 'x' to numbers.

How |ToFloat| is used in skrub
------------------------------

The |ToFloat| transformer is used internally in:

- the **Cleaner** (|Cleaner|), to normalize all numeric-like columns before modeling
- the **|TableVectorizer|**, ensuring a consistent numeric dtype across all numeric features

This makes |ToFloat| a core building block of skrub’s handling of heterogeneous
tabular data.

``ToFloat`` ensures that downstream machine-learning models receive numeric data
that is clean, consistent, lightweight, and free of locale-specific quirks or
string-encoded values.
59 changes: 59 additions & 0 deletions skrub/_to_float.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,30 @@
from . import _dataframe as sbd
from ._apply_to_cols import RejectColumn, SingleColumnTransformer
from ._dispatch import dispatch, raise_dispatch_unregistered_type

__all__ = ["ToFloat"]

POSSIBLE_SEPARATORS = [".", ",", "'", " "]


@dispatch
def _str_replace(col, pattern, strict=True):
raise_dispatch_unregistered_type(col, kind="Series")


@_str_replace.specialize("pandas", argument_type="Column")
def _str_replace_pandas(col, pattern, decimal):
col = col.str.replace(r"^\((.*)\)$", r"-\1", regex=True)
col = col.str.replace("[" + "".join(pattern) + "]", "", regex=True)
return col.str.replace(decimal, ".", regex=False)


@_str_replace.specialize("polars", argument_type="Column")
def _str_replace_polars(col, pattern, decimal):
col = col.str.replace_all(r"^\((.*)\)$", r"-$1")
col = col.str.replace_all("[" + "".join(pattern) + "]", "")
return col.str.replace_all(f"[{decimal}]", ".")


class ToFloat(SingleColumnTransformer):
"""
Expand All @@ -22,6 +44,13 @@ class ToFloat(SingleColumnTransformer):
During ``transform``, entries for which conversion fails are replaced by
null values.

Parameters
----------
decimal : str, default='.'
Character to recognize as the decimal separator when converting from
strings to floats. Other possible decimal separators are removed from
the strings before conversion.
Comment on lines +189 to +190
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the case anymore

Suggested change
strings to floats. Other possible decimal separators are removed from
the strings before conversion.
strings to floats.


Examples
--------
>>> import pandas as pd
Expand Down Expand Up @@ -165,8 +194,34 @@ class ToFloat(SingleColumnTransformer):
>>> s = pd.Series([1.1, None], dtype='float32')
>>> to_float.fit_transform(s) is s
True

Handling parentheses around negative numbers
>>> s = pd.Series(["-1,234.56", "1,234.56", "(1,234.56)"], name='parens')
>>> to_float.fit_transform(s)
0 -1234.5...
1 1234.5...
2 -1234.5...
dtype: float32

Scientific notation
>>> s = pd.Series(["1.23e+4", "1.23E+4"], name="x")
>>> ToFloat(decimal=".").fit_transform(s)
0 12300.0
1 12300.0
Name: x, dtype: float32

Space or apostrophe as thousand separator
>>> s = pd.Series(["4 567,89", "4'567,89"], name="x")
>>> ToFloat(decimal=",").fit_transform(s)
0 4567.8...
1 4567.8...
Name: x, dtype: float32
""" # noqa: E501

def __init__(self, decimal="."):
super().__init__()
self.decimal = decimal

def fit_transform(self, column, y=None):
"""Fit the encoder and transform a column.

Expand All @@ -191,6 +246,10 @@ def fit_transform(self, column, y=None):
f"with dtype '{sbd.dtype(column)}' to numbers."
)
try:
if sbd.is_string(column):
p = POSSIBLE_SEPARATORS.copy()
p.remove(self.decimal)
column = _str_replace(column, pattern=p, decimal=self.decimal)
numeric = sbd.to_float32(column, strict=True)
return numeric
except Exception as e:
Expand Down
33 changes: 33 additions & 0 deletions skrub/tests/test_to_float.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,36 @@ def test_rejected_columns(df_module):
ToFloat().fit_transform(col)
to_float = ToFloat().fit(df_module.make_column("c", [1.1]))
assert is_float32(df_module, to_float.transform(col))


@pytest.mark.parametrize(
"input_str, expected_float, decimal",
[
("1,234.56", 1234.56, "."),
("1.234,56", 1234.56, ","),
("1 234,56", 1234.56, ","),
("1234.56", 1234.56, "."),
("1234,56", 1234.56, ","),
("1,234,567.89", 1234567.89, "."),
("1.234.567,89", 1234567.89, ","),
("1 234 567,89", 1234567.89, ","),
("1'234'567.89", 1234567.89, "."),
("1.23e+4", 12300.0, "."),
("1.23E+4", 12300.0, "."),
("1,23e+4", 12300.0, ","),
("1,23E+4", 12300.0, ","),
("-1,234.56", -1234.56, "."),
("-1.234,56", -1234.56, ","),
("(1,234.56)", -1234.56, "."),
("(1.234,56)", -1234.56, ","),
("1,23,456.78", 123456.78, "."),
("12,3456.78", 123456.78, "."),
(".56", 0.56, "."),
(",56", 0.56, ","),
],
)
def test_number_parsing(input_str, expected_float, decimal, df_module):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be worth adding tests for the code's behaviour in case of an invalid entry?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should check a few weird cases and make sure they fail as expected

column = df_module.make_column("col", [input_str])
result = ToFloat(decimal=decimal).fit_transform(column)

np.allclose(result[0], expected_float)