-
Notifications
You must be signed in to change notification settings - Fork 212
FEAT - Adding decimal as parameter for ToFloat32
#1772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 7 commits
228c068
b06af7e
7622e7a
fdefc75
aa28499
41e0817
1cbec41
1b74ee3
a72c182
441df5e
6b4339e
806d7ea
489079d
2425963
424376d
015d561
24b0078
e66309c
acd44c2
3436572
8091bd6
491f448
3842b22
e5fe91f
c186201
67f00c7
47cc97d
daa9557
292a5c1
6821b32
9df1ba2
620bd12
0be30f3
ec7d687
3e6dea1
f8e63a6
50d9b47
36d1d8f
0f149f1
415aec1
a19e149
1754d07
db44c3e
0b71c26
94435da
90d00db
ce09e6a
3a082ee
cbed6fa
5e1dd9b
095f403
8765b84
de19268
d4a6aef
ea84a6e
2398cd3
7d88a1b
910efeb
888ea67
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| .. |ToFloat| replace:: :class:`~skrub.ToFloat` | ||
| .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` | ||
| .. |Cleaner| replace:: :class:`~skrub.Cleaner` | ||
|
|
||
| .. _user_guide_feature_engineering_numeric_to_float: | ||
|
|
||
| Converting heterogeneous numeric values to uniform float32 | ||
| ========================================================== | ||
|
|
||
| Many tabular datasets contain numeric information stored as strings, mixed | ||
| representations, locale-specific formats, or other non-standard encodings. | ||
| Common issues include: | ||
|
|
||
| - Thousands separators (``1,234.56`` or ``1 234,56``) | ||
| - Use of apostrophes as separators (``4'567.89``) | ||
| - Negative numbers encoded inside parentheses (``(1,234.56)``) | ||
| - String columns that contain mostly numeric values, but with occasional invalid entries | ||
|
|
||
| To provide consistent numeric behavior, skrub includes the |ToFloat| transformer, | ||
| which **standardizes all numeric-like columns to ``float32``** and handles a wide | ||
| range of real-world formatting issues automatically. | ||
|
|
||
| The |ToFloat| transformer is used internally by both the |Cleaner| class and the | ||
| |TableVectorizer| to guarantee that downstream estimators receive clean and | ||
| uniform numeric data. | ||
|
|
||
| What |ToFloat| does | ||
| ------------------- | ||
|
|
||
| The |ToFloat| transformer provides: | ||
|
|
||
| - **Automatic conversion to 32-bit floating-point values (`float32`).** | ||
| This dtype is lightweight and fully supported by scikit-learn estimators. | ||
|
|
||
| - **Automatic parsing of decimal separators**, regardless of locale: | ||
| - ``.`` or ``,`` can be used as decimal point | ||
| - thousands separators (``.``, ``,``, space, apostrophe) are removed automatically | ||
|
|
||
| - **Parentheses interpreted as negative numbers**, a common format in financial datasets: | ||
| - ``(1,234.56)`` → ``-1234.56`` | ||
|
|
||
| - **Scientific notation parsing** (e.g. ``1.23e+4``) | ||
|
|
||
| - **Graceful handling of invalid or non-numeric values during transform**: | ||
| - During ``fit``: non-convertible values raise a ``RejectColumn`` exception | ||
| - During ``transform``: invalid entries become ``NaN`` instead of failing | ||
|
|
||
| - **Rejection of categorical and datetime columns**, which should not be cast to numeric. | ||
|
|
||
| As with all skrub transformers, |ToFloat| behaves like a standard | ||
| scikit-learn transformer and is fully compatible with pipelines. | ||
|
|
||
| Examples | ||
| -------- | ||
|
|
||
| Parsing numeric-formatted strings: | ||
|
|
||
| >>> import pandas as pd | ||
| >>> from skrub import ToFloat | ||
| >>> s = pd.Series(['1.1', None, '3.3'], name='x') | ||
| >>> ToFloat().fit_transform(s) | ||
| 0 1.1 | ||
| 1 NaN | ||
| 2 3.3 | ||
| Name: x, dtype: float32 | ||
|
|
||
| Automatic handling of locale-dependent decimal separators: | ||
|
|
||
| >>> s = pd.Series(["4 567,89", "4'567,89"], name="x") | ||
| >>> ToFloat(decimal=",").fit_transform(s) # doctest: +SKIP | ||
| 0 4567.89 | ||
| 1 4567.89 | ||
| Name: x, dtype: float32 | ||
|
|
||
| Parentheses interpreted as negative numbers: | ||
|
|
||
| >>> s = pd.Series(["-1,234.56", "(1,234.56)"], name="neg") | ||
| >>> ToFloat().fit_transform(s) # doctest: +SKIP | ||
| 0 -1234.56 | ||
| 1 -1234.56 | ||
| Name: neg, dtype: float32 | ||
|
|
||
| Scientific notation: | ||
|
|
||
| >>> s = pd.Series(["1.23e+4", "1.23E+4"]) | ||
| >>> ToFloat(decimal=".").fit_transform(s) | ||
| 0 12300.0 | ||
| 1 12300.0 | ||
| dtype: float32 | ||
|
|
||
| Columns that cannot be converted are rejected during ``fit``: | ||
|
|
||
| >>> s = pd.Series(['1.1', 'hello'], name='x') | ||
| >>> ToFloat(decimal=".").fit_transform(s) | ||
| Traceback (most recent call last): | ||
| ... | ||
| skrub._apply_to_cols.RejectColumn: Could not convert column 'x' to numbers. | ||
|
|
||
| How |ToFloat| is used in skrub | ||
| ------------------------------ | ||
|
|
||
| The |ToFloat| transformer is used internally in: | ||
|
|
||
| - the **Cleaner** (|Cleaner|), to normalize all numeric-like columns before modeling | ||
| - the **|TableVectorizer|**, ensuring a consistent numeric dtype across all numeric features | ||
|
|
||
| This makes |ToFloat| a core building block of skrub’s handling of heterogeneous | ||
| tabular data. | ||
|
|
||
| ``ToFloat`` ensures that downstream machine-learning models receive numeric data | ||
| that is clean, consistent, lightweight, and free of locale-specific quirks or | ||
| string-encoded values. | ||
rcap107 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -1,8 +1,30 @@ | ||||||||
| from . import _dataframe as sbd | ||||||||
| from ._apply_to_cols import RejectColumn, SingleColumnTransformer | ||||||||
| from ._dispatch import dispatch, raise_dispatch_unregistered_type | ||||||||
|
|
||||||||
| __all__ = ["ToFloat"] | ||||||||
|
|
||||||||
| POSSIBLE_SEPARATORS = [".", ",", "'", " "] | ||||||||
|
|
||||||||
|
|
||||||||
| @dispatch | ||||||||
| def _str_replace(col, pattern, strict=True): | ||||||||
| raise_dispatch_unregistered_type(col, kind="Series") | ||||||||
|
|
||||||||
|
|
||||||||
| @_str_replace.specialize("pandas", argument_type="Column") | ||||||||
| def _str_replace_pandas(col, pattern, decimal): | ||||||||
| col = col.str.replace(r"^\((.*)\)$", r"-\1", regex=True) | ||||||||
rcap107 marked this conversation as resolved.
Show resolved
Hide resolved
gabrielapgomezji marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
| col = col.str.replace("[" + "".join(pattern) + "]", "", regex=True) | ||||||||
| return col.str.replace(decimal, ".", regex=False) | ||||||||
|
|
||||||||
|
|
||||||||
| @_str_replace.specialize("polars", argument_type="Column") | ||||||||
| def _str_replace_polars(col, pattern, decimal): | ||||||||
| col = col.str.replace_all(r"^\((.*)\)$", r"-$1") | ||||||||
| col = col.str.replace_all("[" + "".join(pattern) + "]", "") | ||||||||
| return col.str.replace_all(f"[{decimal}]", ".") | ||||||||
|
|
||||||||
|
|
||||||||
| class ToFloat(SingleColumnTransformer): | ||||||||
| """ | ||||||||
|
|
@@ -22,6 +44,13 @@ class ToFloat(SingleColumnTransformer): | |||||||
| During ``transform``, entries for which conversion fails are replaced by | ||||||||
| null values. | ||||||||
|
|
||||||||
| Parameters | ||||||||
| ---------- | ||||||||
| decimal : str, default='.' | ||||||||
| Character to recognize as the decimal separator when converting from | ||||||||
| strings to floats. Other possible decimal separators are removed from | ||||||||
| the strings before conversion. | ||||||||
|
Comment on lines
+189
to
+190
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not the case anymore
Suggested change
|
||||||||
|
|
||||||||
| Examples | ||||||||
| -------- | ||||||||
| >>> import pandas as pd | ||||||||
|
|
@@ -165,8 +194,34 @@ class ToFloat(SingleColumnTransformer): | |||||||
| >>> s = pd.Series([1.1, None], dtype='float32') | ||||||||
| >>> to_float.fit_transform(s) is s | ||||||||
| True | ||||||||
|
|
||||||||
| Handling parentheses around negative numbers | ||||||||
gabrielapgomezji marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| >>> s = pd.Series(["-1,234.56", "1,234.56", "(1,234.56)"], name='parens') | ||||||||
| >>> to_float.fit_transform(s) | ||||||||
| 0 -1234.5... | ||||||||
| 1 1234.5... | ||||||||
| 2 -1234.5... | ||||||||
| dtype: float32 | ||||||||
|
|
||||||||
| Scientific notation | ||||||||
gabrielapgomezji marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
| >>> s = pd.Series(["1.23e+4", "1.23E+4"], name="x") | ||||||||
| >>> ToFloat(decimal=".").fit_transform(s) | ||||||||
| 0 12300.0 | ||||||||
| 1 12300.0 | ||||||||
| Name: x, dtype: float32 | ||||||||
|
|
||||||||
| Space or apostrophe as thousand separator | ||||||||
| >>> s = pd.Series(["4 567,89", "4'567,89"], name="x") | ||||||||
| >>> ToFloat(decimal=",").fit_transform(s) | ||||||||
| 0 4567.8... | ||||||||
| 1 4567.8... | ||||||||
| Name: x, dtype: float32 | ||||||||
| """ # noqa: E501 | ||||||||
|
|
||||||||
| def __init__(self, decimal="."): | ||||||||
| super().__init__() | ||||||||
| self.decimal = decimal | ||||||||
|
|
||||||||
| def fit_transform(self, column, y=None): | ||||||||
| """Fit the encoder and transform a column. | ||||||||
|
|
||||||||
|
|
@@ -191,6 +246,10 @@ def fit_transform(self, column, y=None): | |||||||
| f"with dtype '{sbd.dtype(column)}' to numbers." | ||||||||
| ) | ||||||||
| try: | ||||||||
| if sbd.is_string(column): | ||||||||
| p = POSSIBLE_SEPARATORS.copy() | ||||||||
| p.remove(self.decimal) | ||||||||
| column = _str_replace(column, pattern=p, decimal=self.decimal) | ||||||||
| numeric = sbd.to_float32(column, strict=True) | ||||||||
| return numeric | ||||||||
| except Exception as e: | ||||||||
|
|
||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -43,3 +43,36 @@ def test_rejected_columns(df_module): | |
| ToFloat().fit_transform(col) | ||
| to_float = ToFloat().fit(df_module.make_column("c", [1.1])) | ||
| assert is_float32(df_module, to_float.transform(col)) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "input_str, expected_float, decimal", | ||
| [ | ||
| ("1,234.56", 1234.56, "."), | ||
| ("1.234,56", 1234.56, ","), | ||
| ("1 234,56", 1234.56, ","), | ||
| ("1234.56", 1234.56, "."), | ||
| ("1234,56", 1234.56, ","), | ||
| ("1,234,567.89", 1234567.89, "."), | ||
| ("1.234.567,89", 1234567.89, ","), | ||
| ("1 234 567,89", 1234567.89, ","), | ||
| ("1'234'567.89", 1234567.89, "."), | ||
| ("1.23e+4", 12300.0, "."), | ||
| ("1.23E+4", 12300.0, "."), | ||
| ("1,23e+4", 12300.0, ","), | ||
| ("1,23E+4", 12300.0, ","), | ||
| ("-1,234.56", -1234.56, "."), | ||
| ("-1.234,56", -1234.56, ","), | ||
| ("(1,234.56)", -1234.56, "."), | ||
| ("(1.234,56)", -1234.56, ","), | ||
| ("1,23,456.78", 123456.78, "."), | ||
| ("12,3456.78", 123456.78, "."), | ||
| (".56", 0.56, "."), | ||
| (",56", 0.56, ","), | ||
| ], | ||
| ) | ||
| def test_number_parsing(input_str, expected_float, decimal, df_module): | ||
|
||
| column = df_module.make_column("col", [input_str]) | ||
| result = ToFloat(decimal=decimal).fit_transform(column) | ||
|
|
||
| np.allclose(result[0], expected_float) | ||
Uh oh!
There was an error while loading. Please reload this page.