Skrub release 0.7.0
Release 0.7.0
✨ Highlights
- Data Ops can now be tuned with Optuna.
- It is now possible to pass extra named arguments to an estimator through
DataOps.skb.apply. - The
TableReportnow supports numpy arrays. - The minimum supported version of Python has been increased to 3.10, and the minimum supported versions of scikit-learn and requests are now 1.4.2 and 2.27.1 respectively.
- Added support for the upcoming Pandas 3.0.
- A lot of bugs have been fixed.
16 new contributors participated in this release 🎉
New features
- It is now possible to tune the choices in a
DataOpwith Optuna. #1661 by @jeromedockes. DataOp.skb.applynow allows passing extra named arguments to the estimator's methods through the parametersfit_kwargs,predict_kwargsetc. #1642 by @jeromedockes.- TableReport now displays the mean statistic for boolean columns. #1647 by @abenechehab.
DataOp.skb.get_varsallows inspecting all the variables, or all the named dataops, in aDataOp. This lets us easily know what keys should be present in theenvironmentdictionary we pass toDataOp.skb.evalor toSkrubLearner.fit,SkrubLearner.predict, etc. #1646 by @jeromedockes.DataOp.skb.iter_cv_splitsiterates over the training and testing environments produced by a CV splitter -- similar toDataOp.skb.train_test_splitbut for multiple cross-validation splits. #1653 by @jeromedockes.TableReportnow supportsnp.array. #1676 by @Nismamjad1.DataOp.skb.full_reportnow accepts a new parameter,title, that is displayed in the html report. #1654 by @MarieSacksick.TableReportnow includes theopen_tabparameter, which lets the user select which tab should be opened when theTableReportis rendered. #1737 by @rcap107.
Changes
- The minimum supported version of Python has been increased to 3.10. Additionally, the minimum supported versions of scikit-learn and requests are 1.4.2 and 2.27.1 respectively. Support for python 3.14 has been added. #1572 by @rcap107.
- The
DataOp.skb.full_reportmethod now deletes reports created withoutput_dir=Noneafter 7 days. #1657 by @dierickxsimon - The
tabular_pipelineuses aSquashingScalerinstead of aStandardScalerfor centering and scaling numerical features when linear models are used. #1644 by @dierickxsimon. - The transformer
ToFloat, previously calledToFloat32, is now public. #1687 by @MarieSacksick. - Improved the error message raised when a Polars lazyframe is passed to
TableReport, clarifying that.collect()must be called first. #1767 by @fatiben2002. - Computing the associations in
TableReportis now deterministic and can be controlled by the new parametersubsampling_seedof the global configuration. #1775 by @thomass-dev. - Added
cast_to_strparameter toCleanerto prevent unintended conversion of list/object-like columns to strings unless explicitly enabled. #1789 by @PilliSiddharth.
Bugfixes
- The
skrub.cross_validatefunction now raises a specific exception if the wrong variable type is passed. #1799 by @emassoulie. - Fixed various issues with some transformers by adding
get_feature_names_outto all single column transformers. #1666 by @rcap107. - Issues occurring when
DataOp.skb.applywas passed a DataOp as the estimator have been fixed in #1671 by @jeromedockes. TableReportcould raise an error while trying to check if Polars columns with some dtypes (lists, structs) are sorted. It would not indicate Polars columns sorted in descending order. Fixed in #1673 by @jeromedockes.- Fixed nightly checks and added support for upcoming library versions, including Pandas v3.0. #1664 by @auguste-probabl and @rcap107.
- Fixed the use of
TableReportandCleanerwith Polars dataframes containing a column with empty string as name. #1722 by @MarieSacksick. - Fixed an issue where
TableReportwould fail when computing associations for Polars dataframes if PyArrow was not installed. #1742 by @rcap107. - Fixed an issue in the Data Ops report generation in cases where the DataOp contained escape characters or were spanning multiple lines. #1764 by @rcap107.
- Added
get_feature_names_outtoCleanerfor consistency with theTableVectorizerand other transformers. #1762 by @rcap107. - Improve error message when
TextEncoderis used without the optional transformers dependencies. #1769 by @fxzhou22. - Accessing
.skb.applied_estimatoron aDataOpafter calling.skb.set_name(),.skb.set_description(),.skb.mark_as_X()or.skb.mark_as_y()used to raise an error, this has been fixed in #1782 by @jeromedockes. - Fixed potential issues that could arise in
ParamSearch.plot_resultswhen NaN values were present in the cross-validation results. #1800 by @rcap107.
New Contributors
- @csejourne made their first contribution in #1503
- @divakaivan made their first contribution in #1598
- @star1327p made their first contribution in #1599
- @abenechehab made their first contribution in #1647
- @dierickxsimon made their first contribution in #1644
- @kudos07 made their first contribution in #1670
- @DimitriPapadopoulos made their first contribution in #1692
- @auguste-probabl made their first contribution in #1664
- @amirakahub made their first contribution in #1715
- @Nismamjad1 made their first contribution in #1676
- @Alispirale made their first contribution in #1717
- @basile-desjuzeur made their first contribution in #1685
- @fatiben2002 made their first contribution in #1767
- @fxzhou22 made their first contribution in #1769
- @PilliSiddharth made their first contribution in #1789
- @emassoulie made their first contribution in #1799
Full Changelog: 0.6.2...0.7.0