Skip to content

Skrub release 0.7.0

Choose a tag to compare

@rcap107 rcap107 released this 12 Dec 09:40
· 94 commits to main since this release
345b18d

Release 0.7.0

✨ Highlights

  • Data Ops can now be tuned with Optuna.
  • It is now possible to pass extra named arguments to an estimator through DataOps.skb.apply.
  • The TableReport now supports numpy arrays.
  • The minimum supported version of Python has been increased to 3.10, and the minimum supported versions of scikit-learn and requests are now 1.4.2 and 2.27.1 respectively.
  • Added support for the upcoming Pandas 3.0.
  • A lot of bugs have been fixed.

16 new contributors participated in this release 🎉

New features

  • It is now possible to tune the choices in a DataOp with Optuna. #1661 by @jeromedockes.
  • DataOp.skb.apply now allows passing extra named arguments to the estimator's methods through the parameters fit_kwargs, predict_kwargs etc. #1642 by @jeromedockes.
  • TableReport now displays the mean statistic for boolean columns. #1647 by @abenechehab.
  • DataOp.skb.get_vars allows inspecting all the variables, or all the named dataops, in a DataOp. This lets us easily know what keys should be present in the environment dictionary we pass to DataOp.skb.eval or to SkrubLearner.fit, SkrubLearner.predict, etc. #1646 by @jeromedockes.
  • DataOp.skb.iter_cv_splits iterates over the training and testing environments produced by a CV splitter -- similar to DataOp.skb.train_test_split but for multiple cross-validation splits. #1653 by @jeromedockes.
  • TableReport now supports np.array. #1676 by @Nismamjad1.
  • DataOp.skb.full_report now accepts a new parameter, title, that is displayed in the html report. #1654 by @MarieSacksick.
  • TableReport now includes the open_tab parameter, which lets the user select which tab should be opened when the TableReport is rendered. #1737 by @rcap107.

Changes

  • The minimum supported version of Python has been increased to 3.10. Additionally, the minimum supported versions of scikit-learn and requests are 1.4.2 and 2.27.1 respectively. Support for python 3.14 has been added. #1572 by @rcap107.
  • The DataOp.skb.full_report method now deletes reports created with output_dir=None after 7 days. #1657 by @dierickxsimon
  • The tabular_pipeline uses a SquashingScaler instead of a StandardScaler for centering and scaling numerical features when linear models are used. #1644 by @dierickxsimon.
  • The transformer ToFloat, previously called ToFloat32, is now public. #1687 by @MarieSacksick.
  • Improved the error message raised when a Polars lazyframe is passed to TableReport, clarifying that .collect() must be called first. #1767 by @fatiben2002.
  • Computing the associations in TableReport is now deterministic and can be controlled by the new parameter subsampling_seed of the global configuration. #1775 by @thomass-dev.
  • Added cast_to_str parameter to Cleaner to prevent unintended conversion of list/object-like columns to strings unless explicitly enabled. #1789 by @PilliSiddharth.

Bugfixes

  • The skrub.cross_validate function now raises a specific exception if the wrong variable type is passed. #1799 by @emassoulie.
  • Fixed various issues with some transformers by adding get_feature_names_out to all single column transformers. #1666 by @rcap107.
  • Issues occurring when DataOp.skb.apply was passed a DataOp as the estimator have been fixed in #1671 by @jeromedockes.
  • TableReport could raise an error while trying to check if Polars columns with some dtypes (lists, structs) are sorted. It would not indicate Polars columns sorted in descending order. Fixed in #1673 by @jeromedockes.
  • Fixed nightly checks and added support for upcoming library versions, including Pandas v3.0. #1664 by @auguste-probabl and @rcap107.
  • Fixed the use of TableReport and Cleaner with Polars dataframes containing a column with empty string as name. #1722 by @MarieSacksick.
  • Fixed an issue where TableReport would fail when computing associations for Polars dataframes if PyArrow was not installed. #1742 by @rcap107.
  • Fixed an issue in the Data Ops report generation in cases where the DataOp contained escape characters or were spanning multiple lines. #1764 by @rcap107.
  • Added get_feature_names_out to Cleaner for consistency with the TableVectorizer and other transformers. #1762 by @rcap107.
  • Improve error message when TextEncoder is used without the optional transformers dependencies. #1769 by @fxzhou22.
  • Accessing .skb.applied_estimator on a DataOp after calling .skb.set_name(), .skb.set_description(), .skb.mark_as_X() or .skb.mark_as_y() used to raise an error, this has been fixed in #1782 by @jeromedockes.
  • Fixed potential issues that could arise in ParamSearch.plot_results when NaN values were present in the cross-validation results. #1800 by @rcap107.

New Contributors

Full Changelog: 0.6.2...0.7.0