-
Notifications
You must be signed in to change notification settings - Fork 199
FEAT - Adding decimal as parameter for ToFloat32
#1772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e5fe91f
c186201
67f00c7
47cc97d
daa9557
292a5c1
6821b32
9df1ba2
620bd12
0be30f3
ec7d687
3e6dea1
f8e63a6
50d9b47
36d1d8f
0f149f1
415aec1
a19e149
1754d07
db44c3e
0b71c26
94435da
095f403
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| .. |ToFloat| replace:: :class:`~skrub.ToFloat` | ||
| .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` | ||
| .. |Cleaner| replace:: :class:`~skrub.Cleaner` | ||
|
|
||
| .. _user_guide_feature_engineering_numeric_to_float: | ||
|
|
||
| Converting heterogeneous numeric values to uniform float32 | ||
| ========================================================== | ||
|
|
||
| Many tabular datasets contain numeric information stored as strings, mixed | ||
| representations, locale-specific formats, or other non-standard encodings. | ||
| Common issues include: | ||
|
|
||
| - Thousands separators (``1,234.56`` or ``1 234,56``) | ||
| - Use of apostrophes as separators (``4'567.89``) | ||
| - Negative numbers encoded inside parentheses (``(1,234.56)``) | ||
| - String columns that contain mostly numeric values, but with occasional invalid entries | ||
|
|
||
| To provide consistent numeric behavior, skrub includes the |ToFloat| transformer, | ||
| which standardizes all numeric-like columns to ``float32`` and handles a wide | ||
| range of real-world formatting issues automatically. | ||
|
|
||
| The |ToFloat| transformer is used internally by both the |Cleaner| and the | ||
| |TableVectorizer| to guarantee that downstream estimators receive clean and | ||
| uniform numeric data. | ||
|
|
||
| What |ToFloat| does | ||
| ------------------- | ||
|
|
||
| The |ToFloat| transformer provides: | ||
|
|
||
| - **Automatic conversion to 32-bit floating-point values (`float32`).** | ||
| This dtype is lightweight and fully supported by scikit-learn estimators. | ||
|
|
||
| - **Automatic parsing of decimal and thousands separators**, regardless of locale: | ||
| - The decimal separator must be specified explicitly and can be either ``.`` or ``,`` | ||
| - The thousands separator can be one of ``.``, ``,``, space (``" "``), apostrophe (``'``), | ||
| or None (no thousands separator) | ||
| - The transformer supports integers, decimals (including leading-decimal forms such as .56 or ,56), scientific notation | ||
| and negative numbers | ||
| - Numbers in parentheses are interpreted as negative numbers (``(1,234.56)`` → ``-1234.56``). This format is more common in financial datasets. | ||
| - Decimal and thousands separators must be different characters | ||
|
|
||
| - **Scientific notation parsing** (e.g. ``1.23e+4``) | ||
|
|
||
| - **Graceful handling of invalid or non-numeric values during transform**: | ||
| - During ``fit``: non-convertible values raise a ``RejectColumn`` exception | ||
| - During ``transform``: invalid entries become ``NaN`` instead of failing | ||
|
|
||
| - **Rejection of categorical and datetime columns**, which should not be cast to numeric. | ||
|
|
||
| As with all skrub transformers, |ToFloat| behaves like a standard | ||
| scikit-learn transformer and is fully compatible with pipelines. | ||
|
|
||
| How to use |ToFloat| | ||
| -------------------- | ||
| The |ToFloat| transformer must be applied to individual columns, and it behaves | ||
| like a standard scikit-learn transformer. | ||
| |ToFloat| requires a ``decimal`` and a ``thousands`` separator, which are ``'.'`` and | ||
| ``None`` (no thousands separator) by default. | ||
| Each column is expected to use a single separator for decimals, and one for thousands: | ||
| if any characters other than the provided selectors are encountered in the column, it will not | ||
| be converted. | ||
|
|
||
| During ``fit``, |ToFloat| attempts to convert all values in the column to | ||
| numeric values after automatically removing other possible thousands separators | ||
| (``,``, ``.``, space, apostrophe). If any value cannot be converted, the column | ||
| is rejected with a ``RejectColumn`` exception. | ||
|
|
||
| During ``transform``, invalid or non-convertible values are replaced by ``NaN`` | ||
| instead of raising an error. | ||
|
|
||
| Examples | ||
| -------- | ||
|
|
||
| Parsing numeric-formatted strings: | ||
|
|
||
| >>> import pandas as pd | ||
| >>> from skrub import ToFloat | ||
| >>> s = pd.Series(['1.1', None, '3.3'], name='x') | ||
| >>> ToFloat().fit_transform(s) | ||
| 0 1.1 | ||
| 1 NaN | ||
| 2 3.3 | ||
| Name: x, dtype: float32 | ||
|
|
||
| Locale-dependent decimal separators can be handled by specifying the | ||
| ``decimal`` and ``thousand`` parameter. Here we use comma as decimal separator, and | ||
| a space as thousands separators: | ||
|
|
||
| >>> s = pd.Series(["4 567,89", "12 567,89"], name="x") | ||
| >>> ToFloat(decimal=",", thousand=" ").fit_transform(s) | ||
| 0 4567.8... | ||
| 1 12567.8... | ||
| Name: x, dtype: float32 | ||
|
|
||
| Parentheses interpreted as negative numbers: | ||
|
|
||
| >>> s = pd.Series(["-1,234.56", "(1,234.56)"], name="neg") | ||
| >>> ToFloat(thousand=",").fit_transform(s) | ||
| 0 -1234.5... | ||
| 1 -1234.5... | ||
| Name: neg, dtype: float32 | ||
|
|
||
| Scientific notation: | ||
|
|
||
| >>> s = pd.Series(["1.23e+4", "1.23E+4"]) | ||
| >>> ToFloat(decimal=".").fit_transform(s) | ||
| 0 12300.0 | ||
| 1 12300.0 | ||
| dtype: float32 | ||
|
|
||
| Columns that cannot be converted are rejected during ``fit``: | ||
|
|
||
| >>> s = pd.Series(['1.1', 'hello'], name='x') | ||
| >>> ToFloat(decimal=".").fit_transform(s) | ||
| Traceback (most recent call last): | ||
| ... | ||
| skrub._single_column_transformer.RejectColumn: Could not convert column 'x' to numbers. | ||
|
|
||
|
|
||
| During ``transform``, invalid entries become ``NaN`` instead of raising an error: | ||
| >>> s = pd.Series(['1.1', '2.2'], name='x') | ||
| >>> to_float = ToFloat(decimal=".") | ||
| >>> to_float.fit_transform(s) | ||
| 0 1.1 | ||
| 1 2.2 | ||
| Name: x, dtype: float32 | ||
|
|
||
| >>> to_float.transform(pd.Series(['3.3', 'invalid'], name='x')) | ||
| 0 3.3 | ||
| 1 NaN | ||
| Name: x, dtype: float32 | ||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -1,9 +1,99 @@ | ||||||||
| import re | ||||||||
|
|
||||||||
| from . import _dataframe as sbd | ||||||||
| from ._dispatch import dispatch, raise_dispatch_unregistered_type | ||||||||
| from ._single_column_transformer import RejectColumn, SingleColumnTransformer | ||||||||
|
|
||||||||
| __all__ = ["ToFloat"] | ||||||||
|
|
||||||||
|
|
||||||||
| def _build_number_regex(decimal, thousand): | ||||||||
gabrielapgomezji marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
| # Escape decimal and thousand separators to use in regex | ||||||||
| d = re.escape(decimal) # e.g., '.' → '\.', ',' → '\,' | ||||||||
| t = re.escape(thousand) # e.g., ',' → '\,', '.' → '\.' | ||||||||
|
|
||||||||
| # Matches integer parts: | ||||||||
| # Either: | ||||||||
| # - one or more digits without thousand separators: \d+ | ||||||||
| # - or digits grouped by thousand separators: \d{1,3}(?:{t}\d{3})+ | ||||||||
| # e.g., '1,234' or '12,345,678' | ||||||||
| integer = rf"(?:\d+|\d{{1,3}}(?:{t}\d{{3}})+)" | ||||||||
|
|
||||||||
| # Matches decimal part after the decimal separator | ||||||||
| # e.g., '.456' or ',456' depending on locale | ||||||||
| decimal_part = rf"{d}\d+" | ||||||||
|
|
||||||||
| # Matches optional scientific notation | ||||||||
| # e.g., 'e10', 'E-5', 'e+3' | ||||||||
| scientific = r"(?:[eE][+-]?\d+)?" | ||||||||
|
|
||||||||
| # Full number can be: | ||||||||
| # - integer with optional decimal part | ||||||||
| # - or only decimal part (like '.5') | ||||||||
| number = rf"(?:{integer}(?:{decimal_part})?|{decimal_part})" | ||||||||
|
|
||||||||
| # Final regex: | ||||||||
| # - optional parentheses around the number: \( ... \)? | ||||||||
| # - optional leading + or - sign: [+-]? | ||||||||
| # - optional scientific notation is included in `number` | ||||||||
| # Anchored to start (^) and end ($) of string | ||||||||
| return rf"^\(?[+-]?(?:{number}{scientific})?\)?$" | ||||||||
|
|
||||||||
|
|
||||||||
| @dispatch | ||||||||
| def _str_is_valid_number(col, number_re): | ||||||||
| raise_dispatch_unregistered_type(col, kind="Series") | ||||||||
|
|
||||||||
|
|
||||||||
| @_str_is_valid_number.specialize("pandas", argument_type="Column") | ||||||||
| def _str_is_valid_number_pandas(col, number_re): | ||||||||
| # Check if all values in the column match the number regex. | ||||||||
| # - Fill NaN values with empty string to avoid match errors. | ||||||||
| # - Use `str.match` with `na=False` to treat empty/missing values as non-matching. | ||||||||
| # - If any value does not match, raise RejectColumn with a descriptive message. | ||||||||
| if not col.fillna("").str.match(number_re, na=False).all(): | ||||||||
| raise RejectColumn(f"Could not convert column {sbd.name(col)!r} to numbers.") | ||||||||
| return True | ||||||||
|
|
||||||||
|
|
||||||||
| @_str_is_valid_number.specialize("polars", argument_type="Column") | ||||||||
| def _str_is_valid_number_polars(col, number_re): | ||||||||
| # Check if all values in the column match the number regex. | ||||||||
| # - Fill NaN values with empty string to avoid match errors. | ||||||||
| # - Use `str.match` with `na=False` to treat empty/missing values as non-matching. | ||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| # - If any value does not match, raise RejectColumn with a descriptive message. | ||||||||
| if not col.fill_null("").str.contains(number_re.pattern, literal=False).all(): | ||||||||
| raise RejectColumn(f"The pattern could not match the column {sbd.name(col)!r}.") | ||||||||
| return True | ||||||||
|
|
||||||||
|
|
||||||||
| @dispatch | ||||||||
| def _str_replace(col, strict=True): | ||||||||
| raise_dispatch_unregistered_type(col, kind="Series") | ||||||||
|
|
||||||||
|
|
||||||||
| @_str_replace.specialize("pandas", argument_type="Column") | ||||||||
| def _str_replace_pandas(col, decimal, thousand): | ||||||||
| # Replace parentheses around numbers with a leading minus sign | ||||||||
| # e.g., "(123.45)" → "-123.45" | ||||||||
| col = col.str.replace(r"^\((.*)\)$", r"-\1", regex=True) | ||||||||
rcap107 marked this conversation as resolved.
Show resolved
Hide resolved
gabrielapgomezji marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
| # Remove thousand separators | ||||||||
| col = col.str.replace(thousand, "", regex=False) | ||||||||
| # Replace decimal separator with '.' | ||||||||
| return col.str.replace(decimal, ".", regex=False) | ||||||||
|
|
||||||||
|
|
||||||||
| @_str_replace.specialize("polars", argument_type="Column") | ||||||||
| def _str_replace_polars(col, decimal, thousand): | ||||||||
| # Replace parentheses around numbers with a leading minus sign | ||||||||
| # e.g., "(123.45)" → "-123.45" | ||||||||
| col = col.str.replace_all(r"^\((.*)\)$", r"-$1") | ||||||||
| # Remove thousand separators | ||||||||
| col = col.str.replace_all(thousand, "", literal=True) | ||||||||
| # Replace decimal separator with '.' | ||||||||
| return col.str.replace_all(f"[{decimal}]", ".") | ||||||||
|
|
||||||||
|
|
||||||||
| class ToFloat(SingleColumnTransformer): | ||||||||
| """ | ||||||||
| Convert a column to 32-bit floating-point numbers. | ||||||||
|
|
@@ -22,6 +112,17 @@ class ToFloat(SingleColumnTransformer): | |||||||
| During ``transform``, entries for which conversion fails are replaced by | ||||||||
| null values. | ||||||||
|
|
||||||||
| Parameters | ||||||||
| ---------- | ||||||||
| decimal : str, default='.' | ||||||||
| Character to recognize as the decimal separator when converting from | ||||||||
| strings to floats. Other possible decimal separators are removed from | ||||||||
| the strings before conversion. | ||||||||
|
Comment on lines
+119
to
+120
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not the case anymore
Suggested change
|
||||||||
| thousand : str or None, default=None | ||||||||
| Character used as thousands separator. Supported values are ``"."``, | ||||||||
| ``,``, space (``" "``), apostrophe (``"'"``), or ``None`` (no thousands | ||||||||
| separator). The decimal and thousands separators must differ. | ||||||||
|
|
||||||||
| Examples | ||||||||
| -------- | ||||||||
| >>> import pandas as pd | ||||||||
|
|
@@ -165,8 +266,36 @@ class ToFloat(SingleColumnTransformer): | |||||||
| >>> s = pd.Series([1.1, None], dtype='float32') | ||||||||
| >>> to_float.fit_transform(s) is s | ||||||||
| True | ||||||||
|
|
||||||||
| Negative numbers represented using parentheses are converted | ||||||||
| so they use "-" instead. | ||||||||
| >>> s = pd.Series(["-1,234.56", "1,234.56", "(1,234.56)"], name='parens') | ||||||||
| >>> ToFloat(decimal=".", thousand=",").fit_transform(s) | ||||||||
| 0 -1234.5... | ||||||||
| 1 1234.5... | ||||||||
| 2 -1234.5... | ||||||||
| dtype: float32 | ||||||||
|
|
||||||||
| Numbers that use scientific notation are converted: | ||||||||
| >>> s = pd.Series(["1.23e+4", "1.23E+4"], name="x") | ||||||||
| >>> ToFloat(decimal=".").fit_transform(s) | ||||||||
| 0 12300.0 | ||||||||
| 1 12300.0 | ||||||||
| Name: x, dtype: float32 | ||||||||
|
|
||||||||
| It is possible to specify the thousands separator, e.g., to use " " | ||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
| >>> s = pd.Series(["4 567,89", "12 567,89"], name="x") | ||||||||
| >>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS | ||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ELLIPSIS is enabled by default
Suggested change
|
||||||||
| 0 4567.8... | ||||||||
| 1 12567.8... | ||||||||
| Name: x, dtype: float32 | ||||||||
| """ # noqa: E501 | ||||||||
|
|
||||||||
| def __init__(self, decimal=".", thousand=None): | ||||||||
| super().__init__() | ||||||||
| self.decimal = decimal | ||||||||
| self.thousand = "" if thousand is None else thousand | ||||||||
|
|
||||||||
| def fit_transform(self, column, y=None): | ||||||||
| """Fit the encoder and transform a column. | ||||||||
|
|
||||||||
|
|
@@ -185,12 +314,26 @@ def fit_transform(self, column, y=None): | |||||||
| """ | ||||||||
| del y | ||||||||
| self.all_outputs_ = [sbd.name(column)] | ||||||||
| if self.decimal is None: | ||||||||
| raise ValueError("The decimal separator cannot be None.") | ||||||||
| if self.thousand == self.decimal: | ||||||||
| raise ValueError("The thousand and decimal separators must differ.") | ||||||||
|
|
||||||||
| if sbd.is_any_date(column) or sbd.is_categorical(column): | ||||||||
| raise RejectColumn( | ||||||||
| f"Refusing to cast column {sbd.name(column)!r} " | ||||||||
| f"with dtype '{sbd.dtype(column)}' to numbers." | ||||||||
| ) | ||||||||
| try: | ||||||||
| if sbd.is_string(column): | ||||||||
| self._number_re_ = re.compile( | ||||||||
| _build_number_regex(self.decimal, self.thousand), | ||||||||
| re.VERBOSE, | ||||||||
| ) | ||||||||
| _str_is_valid_number(column, self._number_re_) | ||||||||
| column = _str_replace( | ||||||||
| column, decimal=self.decimal, thousand=self.thousand | ||||||||
| ) | ||||||||
| numeric = sbd.to_float32(column, strict=True) | ||||||||
| return numeric | ||||||||
| except Exception as e: | ||||||||
|
|
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.
There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)