Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ Ongoing Development

New features
------------
- :class:`ToFloat32` has the parameter decimal to let the user specify whether they use ',' or '.'
as decimal separator and it also handles negative numbers indicated with parentheses.
:pr:`1772` by :user:`Gabriela Gómez Jiménez <gabrielapgomezji>`.

Changes
-------
Expand Down
133 changes: 133 additions & 0 deletions doc/modules/column_level_featurizing/feature_engineering_numerical.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
.. |ToFloat| replace:: :class:`~skrub.ToFloat`
.. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer`
.. |Cleaner| replace:: :class:`~skrub.Cleaner`

.. _user_guide_feature_engineering_numeric_to_float:

Converting heterogeneous numeric values to uniform float32
==========================================================

Many tabular datasets contain numeric information stored as strings, mixed
representations, locale-specific formats, or other non-standard encodings.
Common issues include:

- Thousands separators (``1,234.56`` or ``1 234,56``)
- Use of apostrophes as separators (``4'567.89``)
- Negative numbers encoded inside parentheses (``(1,234.56)``)
- String columns that contain mostly numeric values, but with occasional invalid entries

To provide consistent numeric behavior, skrub includes the |ToFloat| transformer,
which standardizes all numeric-like columns to ``float32`` and handles a wide
range of real-world formatting issues automatically.

The |ToFloat| transformer is used internally by both the |Cleaner| and the
|TableVectorizer| to guarantee that downstream estimators receive clean and
uniform numeric data.

What |ToFloat| does
-------------------

The |ToFloat| transformer provides:

- **Automatic conversion to 32-bit floating-point values (`float32`).**
This dtype is lightweight and fully supported by scikit-learn estimators.

- **Automatic parsing of decimal and thousands separators**, regardless of locale:
- The decimal separator must be specified explicitly and can be either ``.`` or ``,``
- The thousands separator can be one of ``.``, ``,``, space (``" "``), apostrophe (``'``),
or None (no thousands separator)
- The transformer supports integers, decimals (including leading-decimal forms such as .56 or ,56), scientific notation
and negative numbers
- Numbers in parentheses are interpreted as negative numbers (``(1,234.56)`` → ``-1234.56``). This format is more common in financial datasets.
- Decimal and thousands separators must be different characters

- **Scientific notation parsing** (e.g. ``1.23e+4``)

- **Graceful handling of invalid or non-numeric values during transform**:
- During ``fit``: non-convertible values raise a ``RejectColumn`` exception
- During ``transform``: invalid entries become ``NaN`` instead of failing

- **Rejection of categorical and datetime columns**, which should not be cast to numeric.

As with all skrub transformers, |ToFloat| behaves like a standard
scikit-learn transformer and is fully compatible with pipelines.

How to use |ToFloat|
--------------------
The |ToFloat| transformer must be applied to individual columns, and it behaves
like a standard scikit-learn transformer.
|ToFloat| requires a ``decimal`` and a ``thousands`` separator, which are ``'.'`` and
``None`` (no thousands separator) by default.
Each column is expected to use a single separator for decimals, and one for thousands:
if any characters other than the provided selectors are encountered in the column, it will not
be converted.

During ``fit``, |ToFloat| attempts to convert all values in the column to
numeric values after automatically removing other possible thousands separators
(``,``, ``.``, space, apostrophe). If any value cannot be converted, the column
is rejected with a ``RejectColumn`` exception.
Comment on lines +65 to +68
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.

There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)


During ``transform``, invalid or non-convertible values are replaced by ``NaN``
instead of raising an error.

Examples
--------

Parsing numeric-formatted strings:

>>> import pandas as pd
>>> from skrub import ToFloat
>>> s = pd.Series(['1.1', None, '3.3'], name='x')
>>> ToFloat().fit_transform(s)
0 1.1
1 NaN
2 3.3
Name: x, dtype: float32

Locale-dependent decimal separators can be handled by specifying the
``decimal`` and ``thousand`` parameter. Here we use comma as decimal separator, and
a space as thousands separators:

>>> s = pd.Series(["4 567,89", "12 567,89"], name="x")
>>> ToFloat(decimal=",", thousand=" ").fit_transform(s)
0 4567.8...
1 12567.8...
Name: x, dtype: float32

Parentheses interpreted as negative numbers:

>>> s = pd.Series(["-1,234.56", "(1,234.56)"], name="neg")
>>> ToFloat(thousand=",").fit_transform(s)
0 -1234.5...
1 -1234.5...
Name: neg, dtype: float32

Scientific notation:

>>> s = pd.Series(["1.23e+4", "1.23E+4"])
>>> ToFloat(decimal=".").fit_transform(s)
0 12300.0
1 12300.0
dtype: float32

Columns that cannot be converted are rejected during ``fit``:

>>> s = pd.Series(['1.1', 'hello'], name='x')
>>> ToFloat(decimal=".").fit_transform(s)
Traceback (most recent call last):
...
skrub._single_column_transformer.RejectColumn: Could not convert column 'x' to numbers.


During ``transform``, invalid entries become ``NaN`` instead of raising an error:
>>> s = pd.Series(['1.1', '2.2'], name='x')
>>> to_float = ToFloat(decimal=".")
>>> to_float.fit_transform(s)
0 1.1
1 2.2
Name: x, dtype: float32

>>> to_float.transform(pd.Series(['3.3', 'invalid'], name='x'))
0 3.3
1 NaN
Name: x, dtype: float32
4 changes: 2 additions & 2 deletions skrub/_table_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -693,7 +693,7 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
We can inspect all the processing steps that were applied to a given column:

>>> vectorizer.all_processing_steps_['B']
[CleanNullStrings(), DropUninformative(), ToDatetime(), DatetimeEncoder(), {'B_day': ToFloat(), 'B_month': ToFloat(), ...}]
[CleanNullStrings(), DropUninformative(), ToDatetime(), DatetimeEncoder(), {'B_day': ToFloat(thousand=''), 'B_month': ToFloat(thousand=''), ...}]

Note that as the encoder (``DatetimeEncoder()`` above) produces multiple
columns, the last processing step is not described by a single transformer
Expand Down Expand Up @@ -768,7 +768,7 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
``ToDatetime()``:

>>> vectorizer.all_processing_steps_
{'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), DropUninformative(), ToFloat(), PassThrough(), {'C': ToFloat()}]}
{'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), DropUninformative(), ToFloat(thousand=''), PassThrough(), {'C': ToFloat(thousand='')}]}

Specifying several ``specific_transformers`` for the same column is not allowed.

Expand Down
143 changes: 143 additions & 0 deletions skrub/_to_float.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,99 @@
import re

from . import _dataframe as sbd
from ._dispatch import dispatch, raise_dispatch_unregistered_type
from ._single_column_transformer import RejectColumn, SingleColumnTransformer

__all__ = ["ToFloat"]


def _build_number_regex(decimal, thousand):
# Escape decimal and thousand separators to use in regex
d = re.escape(decimal) # e.g., '.' → '\.', ',' → '\,'
t = re.escape(thousand) # e.g., ',' → '\,', '.' → '\.'

# Matches integer parts:
# Either:
# - one or more digits without thousand separators: \d+
# - or digits grouped by thousand separators: \d{1,3}(?:{t}\d{3})+
# e.g., '1,234' or '12,345,678'
integer = rf"(?:\d+|\d{{1,3}}(?:{t}\d{{3}})+)"

# Matches decimal part after the decimal separator
# e.g., '.456' or ',456' depending on locale
decimal_part = rf"{d}\d+"

# Matches optional scientific notation
# e.g., 'e10', 'E-5', 'e+3'
scientific = r"(?:[eE][+-]?\d+)?"

# Full number can be:
# - integer with optional decimal part
# - or only decimal part (like '.5')
number = rf"(?:{integer}(?:{decimal_part})?|{decimal_part})"

# Final regex:
# - optional parentheses around the number: \( ... \)?
# - optional leading + or - sign: [+-]?
# - optional scientific notation is included in `number`
# Anchored to start (^) and end ($) of string
return rf"^\(?[+-]?(?:{number}{scientific})?\)?$"


@dispatch
def _str_is_valid_number(col, number_re):
raise_dispatch_unregistered_type(col, kind="Series")


@_str_is_valid_number.specialize("pandas", argument_type="Column")
def _str_is_valid_number_pandas(col, number_re):
# Check if all values in the column match the number regex.
# - Fill NaN values with empty string to avoid match errors.
# - Use `str.match` with `na=False` to treat empty/missing values as non-matching.
# - If any value does not match, raise RejectColumn with a descriptive message.
if not col.fillna("").str.match(number_re, na=False).all():
raise RejectColumn(f"Could not convert column {sbd.name(col)!r} to numbers.")
return True


@_str_is_valid_number.specialize("polars", argument_type="Column")
def _str_is_valid_number_polars(col, number_re):
# Check if all values in the column match the number regex.
# - Fill NaN values with empty string to avoid match errors.
# - Use `str.match` with `na=False` to treat empty/missing values as non-matching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - Use `str.match` with `na=False` to treat empty/missing values as non-matching.
# - Use `str.contains` with `literal=False` to treat empty/missing values as non-matching.

# - If any value does not match, raise RejectColumn with a descriptive message.
if not col.fill_null("").str.contains(number_re.pattern, literal=False).all():
raise RejectColumn(f"The pattern could not match the column {sbd.name(col)!r}.")
return True


@dispatch
def _str_replace(col, strict=True):
raise_dispatch_unregistered_type(col, kind="Series")


@_str_replace.specialize("pandas", argument_type="Column")
def _str_replace_pandas(col, decimal, thousand):
# Replace parentheses around numbers with a leading minus sign
# e.g., "(123.45)" → "-123.45"
col = col.str.replace(r"^\((.*)\)$", r"-\1", regex=True)
# Remove thousand separators
col = col.str.replace(thousand, "", regex=False)
# Replace decimal separator with '.'
return col.str.replace(decimal, ".", regex=False)


@_str_replace.specialize("polars", argument_type="Column")
def _str_replace_polars(col, decimal, thousand):
# Replace parentheses around numbers with a leading minus sign
# e.g., "(123.45)" → "-123.45"
col = col.str.replace_all(r"^\((.*)\)$", r"-$1")
# Remove thousand separators
col = col.str.replace_all(thousand, "", literal=True)
# Replace decimal separator with '.'
return col.str.replace_all(f"[{decimal}]", ".")


class ToFloat(SingleColumnTransformer):
"""
Convert a column to 32-bit floating-point numbers.
Expand All @@ -22,6 +112,17 @@ class ToFloat(SingleColumnTransformer):
During ``transform``, entries for which conversion fails are replaced by
null values.

Parameters
----------
decimal : str, default='.'
Character to recognize as the decimal separator when converting from
strings to floats. Other possible decimal separators are removed from
the strings before conversion.
Comment on lines +119 to +120
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the case anymore

Suggested change
strings to floats. Other possible decimal separators are removed from
the strings before conversion.
strings to floats.

thousand : str or None, default=None
Character used as thousands separator. Supported values are ``"."``,
``,``, space (``" "``), apostrophe (``"'"``), or ``None`` (no thousands
separator). The decimal and thousands separators must differ.

Examples
--------
>>> import pandas as pd
Expand Down Expand Up @@ -165,8 +266,36 @@ class ToFloat(SingleColumnTransformer):
>>> s = pd.Series([1.1, None], dtype='float32')
>>> to_float.fit_transform(s) is s
True

Negative numbers represented using parentheses are converted
so they use "-" instead.
>>> s = pd.Series(["-1,234.56", "1,234.56", "(1,234.56)"], name='parens')
>>> ToFloat(decimal=".", thousand=",").fit_transform(s)
0 -1234.5...
1 1234.5...
2 -1234.5...
dtype: float32

Numbers that use scientific notation are converted:
>>> s = pd.Series(["1.23e+4", "1.23E+4"], name="x")
>>> ToFloat(decimal=".").fit_transform(s)
0 12300.0
1 12300.0
Name: x, dtype: float32

It is possible to specify the thousands separator, e.g., to use " "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is possible to specify the thousands separator, e.g., to use " "
It is possible to specify the thousands separator, e.g., to use ``" "``

>>> s = pd.Series(["4 567,89", "12 567,89"], name="x")
>>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ELLIPSIS is enabled by default

Suggested change
>>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS
>>> ToFloat(decimal=",", thousand=" ").fit_transform(s)

0 4567.8...
1 12567.8...
Name: x, dtype: float32
""" # noqa: E501

def __init__(self, decimal=".", thousand=None):
super().__init__()
self.decimal = decimal
self.thousand = "" if thousand is None else thousand

def fit_transform(self, column, y=None):
"""Fit the encoder and transform a column.

Expand All @@ -185,12 +314,26 @@ def fit_transform(self, column, y=None):
"""
del y
self.all_outputs_ = [sbd.name(column)]
if self.decimal is None:
raise ValueError("The decimal separator cannot be None.")
if self.thousand == self.decimal:
raise ValueError("The thousand and decimal separators must differ.")

if sbd.is_any_date(column) or sbd.is_categorical(column):
raise RejectColumn(
f"Refusing to cast column {sbd.name(column)!r} "
f"with dtype '{sbd.dtype(column)}' to numbers."
)
try:
if sbd.is_string(column):
self._number_re_ = re.compile(
_build_number_regex(self.decimal, self.thousand),
re.VERBOSE,
)
_str_is_valid_number(column, self._number_re_)
column = _str_replace(
column, decimal=self.decimal, thousand=self.thousand
)
numeric = sbd.to_float32(column, strict=True)
return numeric
except Exception as e:
Expand Down
Loading
Loading