Skrub release 0.6.0
🚀 Highlights
- Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables, and machine learning pipelines. DataOps can be combined to form compled data plans, that can be used to train and tune machine learning models. Then, the DataOps plans can be exported as Learners (skrub.SkrubLearner), standalone objects that can be used on new data. More detail about the DataOps can be found in the User guide and in the examples.
- The TableReport has been improved with many new features. Series are now supported directly. It is now possible to skip computing column associations and generating plots when the number of columns in the dataframe exceeds a user-defined threshold. Columns with high cardinality and sorted columns are now highlighted in the report.
- selectors, ApplyToCols and ApplyToFrame are now available, providing utilities for selecting columns to which a transformer should be applied in a flexible way. For more details, see the User guide and the example.
- The SquashingScaler has been added: it robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks. See the example
🎆 New Features
- The Skrub DataOps are new mechanism for building machine-learning pipelines that handle multiple tables and easily describing their hyperparameter spaces. Main PR: #1233 by Jérôme Dockès. Additional work from other contributors can be found here: Vincent Maladiere provided very important help by trying the DataOps on many use-cases and datasets, providing feedback and suggesting improvements, improving the examples (including creating all the figures in the examples) and adding jitter to the parallel coordinate plots, Riccardo Cappuzzo experimented with the DataOps, suggested improvements and improved the examples, Gaël Varoquaux , Guillaume Lemaitre, Adrin Jalali, Olivier Grisel and others participated through many discussions in defining the requirements and the public API. See the examples for an introduction.
- The selectors module provides utilities for selecting columns to which a transformer should be applied in a flexible way. The module was created in #895 by Jérôme Dockès and added to the public API in #1341 by Jérôme Dockès.
- The DropUninformative transformer is now available. This transformer employs different heuristics to detect columns that are not likely to bring useful information for training a model. The current implementation includes detection of columns that contain only a single value (constant columns), only missing values, or all unique values (such as IDs). #1313 by Riccardo Cappuzzo.
- get_config(), set_config() and config_context() are now available to configure settings for dataframes display and expressions. patch_display() and unpatch_display() are deprecated and will be removed in the next release of skrub. #1427 by Vincent Maladiere. The global configuration includes the parameter
cardinality_thresholdthat controls the threshold value used to warn user if they have high cardinality columns in their dataset. #1498 by rouk1. Additionally, the parameterfloat_precisioncontrols the number of significant digits displayed for floating-point values in reports. #1470 by George S. - Added the SquashingScaler, a transformer that robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks. #1310 by Vincent Maladiere and David Holzmüller.
datasets.toy_order()is now available to create a toy dataframe and corresponding targets for examples. #1485 by Antoine Canaguier-Durand.- ApplyToCols and ApplyToFrame are now available to apply transformers on a set of columns independently and jointly respectively. #1478 by Vincent Maladiere.
Changes
⚠️ The default high cardinality encoder for both TableVectorizer and tabular_learner() (now tabular_pipeline()) has been changed from GapEncoder to StringEncoder. #1354 by Riccardo Cappuzzo.- The tabular_learner function has been deprecated in favor of tabular_pipeline() to honor its scikit-learn pipeline cultural heritage, and remove the ambiguity with the data ops Learner. #1493 by Vincent Maladiere.
- StringEncoder now exposes the stop_words argument, which is passed to the underlying vectorizer (TfidfVectorizer, or HashingVectorizer). #1415 by Vincent Maladiere.
- A new parameter
max_association_columnshas been added to the TableReport to skip association computation when the number of columns exceeds the specified value. #1304 by Victoria Shevchenko. - The packaging dependency was removed. #1307 by Jovan Stojanovic
- TextEncoder, StringEncoder and GapEncoder now compute the total standard deviation norm during training, which is a global constant, and normalize the vector outputs by performing element-wise division on all entries. #1274 by Vincent Maladiere.
- The
DropIfTooManyNullstransformer has been replaced by the DropUninformative transformer and will be removed in a future release. #1313 by Riccardo Cappuzzo - The
concat_horizontal()function was replaced withconcat(). Horizontal or vertical concatenation is now controlled by the axis parameter. #1334 by Parasa V Prajwal. - The TableVectorizer and Cleaner now accept a
datetime_formatparameter for specifying the format to use when parsing datetime columns. #1358 by Riccardo Cappuzzo. - The
SimpleCleanerhas been removed. use Cleaner instead. #1370 by Riccardo Cappuzzo. - The periodic encoding for the
day_in_yearhas been removed from the DatetimeEncoder as it was redundant. The feature itself is still added if the flag is set to True. #1396 by Riccardo Cappuzzo. - The naming scheme used for the features generated by TextEncoder, StringEncoder, MinHashEncoder, DatetimeEncoder has been standardized. Now features generated by all encoders have indices in the range
[0, n_components-1], rather than[1, n_components]. Additionally, columns with empty name are assigned a default name that depends on the encoder used. #1405 by Riccardo Cappuzzo. - The optional dependencies ‘dev’, ‘doc’, ‘lint’ and ‘test’ have been coalesced into ‘dev’. #1404 by Vincent Maladiere.
- The TableReport now supports Series in addition to Dataframes. #1420 by Vitor Pohlenz.
- The Cleaner now exposes a parameter to convert numerical values to float32. #1440 by Riccardo Cappuzzo.
- The TableReport now shows if columns are sorted. #1512 by Dea María Léon.
Bugfixes
- Fixed a bug that caused the StringEncoder and TextEncoder to raise an exception if the input column was a Categorical datatype. #1401 by Riccardo Cappuzzo.
Documentation
New Contributors
- @thomass-dev made their first contribution in #1215
- @semyonbok made their first contribution in #1269
- @gabrielapgomezji made their first contribution in #1300
- @ogrisel made their first contribution in #1309
- @victoris93 made their first contribution in #1304
- @pvprajwal made their first contribution in #1334
- @fritshermans made their first contribution in #1352
- @ArturoAmorQ made their first contribution in #1397
- @vitorpohlenz made their first contribution in #1393
- @emilienbattel09 made their first contribution in #1466
- @gmhaber made their first contribution in #1484
- @georgescutelnicu made their first contribution in #1470
- @lionelkusch made their first contribution in #1500
- @MarieSacksick made their first contribution in #1499
- @DeaMariaLeon made their first contribution in #1512
- @canag made their first contribution in #1485
- @dholzmueller made their first contribution in #1310
Full Changelog: 0.5.4...0.6.0