feat: add scikit-learn compatible estimators with pipeline and grid search support

statmlben · statmlben · commit 99b2aad8364e · 2025-09-29T21:55:51.000+08:00
diff --git a/README.md b/README.md
@@ -9,7 +9,6 @@
 - Paper: [NeurIPS | 2023](https://openreview.net/pdf?id=3pEBW2UPAD)
 <!-- - Open Source: [MIT license](https://opensource.org/licenses/MIT) -->
 
-
 The **ReHLine** solver has four appealing
 "linear properties":
 
@@ -18,6 +17,16 @@ The **ReHLine** solver has four appealing
 - The optimization algorithm has a provable linear convergence rate.
 - The per-iteration computational complexity is linear in the sample size.
 
+
+## ✨ New Features: Scikit-Learn Compatible Estimators
+
+We are excited to introduce full scikit-learn compatibility! `ReHLine` now provides `plq_Ridge_Classifier` and `plq_Ridge_Regressor` estimators that integrate seamlessly with the entire scikit-learn ecosystem.
+
+This means you can:
+- Drop `ReHLine` estimators directly into your existing scikit-learn `Pipeline`.
+- Perform robust hyperparameter tuning using `GridSearchCV`.
+- Use standard scikit-learn evaluation metrics and cross-validation tools.
+
 <!-- 
 ## 📝 Formulation
 
@@ -57,7 +66,3 @@ benchmark code and results at the
 |[RidgeHuber](https://github.com/softmin/ReHLine-benchmark/tree/main/benchmark_Huber) | [Result](https://rehline-python.readthedocs.io/en/latest/_static/benchmark/benchmark_Huber.html)|
 |[SVM](https://github.com/softmin/ReHLine-benchmark/tree/main/benchmark_SVM) | [Result](https://rehline-python.readthedocs.io/en/latest/_static/benchmark/benchmark_SVM.html)|
 |[Smoothed SVM](https://github.com/softmin/ReHLine-benchmark/tree/main/benchmark_sSVM) | [Result](https://rehline-python.readthedocs.io/en/latest/_static/benchmark/benchmark_sSVM.html)|
-
-## 🧾 Overview of Results
-
-![](./figs/res.png)
diff --git a/doc/source/autoapi/rehline/index.rst b/doc/source/autoapi/rehline/index.rst
@@ -501,12 +501,12 @@ Classes
    and ridge penalty, compatible with the scikit-learn API.
 
    This wrapper makes ``plqERM_Ridge`` behave as a classifier:
-   - Accepts arbitrary binary labels in the original label space.
-   - Computes class weights on original labels (if ``class_weight`` is set).
-   - Encodes labels with ``LabelEncoder`` into {0,1}, then maps to {-1,+1} for training.
-   - Supports optional intercept fitting (via an augmented constant feature).
-   - Provides standard methods ``fit``, ``predict``, and ``decision_function``.
-   - Integrates with scikit-learn ecosystem (e.g., GridSearchCV, Pipeline).
+       - Accepts arbitrary binary labels in the original label space.
+       - Computes class weights on original labels (if ``class_weight`` is set).
+       - Encodes labels with ``LabelEncoder`` into {0,1}, then maps to {-1,+1} for training.
+       - Supports optional intercept fitting (via an augmented constant feature).
+       - Provides standard methods ``fit``, ``predict``, and ``decision_function``.
+       - Integrates with scikit-learn ecosystem (e.g., GridSearchCV, Pipeline).
 
    Parameters
    ----------
diff --git a/doc/source/clean_notebooks.py b/doc/source/clean_notebooks.py
@@ -0,0 +1,50 @@
+import json
+import sys
+from pathlib import Path
+
+def clean_notebook(file_path):
+    """Removes the 'id' field from all cells in a Jupyter notebook."""
+    try:
+        with open(file_path, 'r', encoding='utf-8') as f:
+            notebook = json.load(f)
+
+        changes_made = False
+        if 'cells' in notebook and isinstance(notebook['cells'], list):
+            for cell in notebook['cells']:
+                if isinstance(cell, dict) and 'id' in cell:
+                    del cell['id']
+                    changes_made = True
+
+        if changes_made:
+            with open(file_path, 'w', encoding='utf-8') as f:
+                json.dump(notebook, f, indent=1, ensure_ascii=False)
+                f.write('\n') # Add a newline at the end of the file
+            print(f"Cleaned: {file_path}")
+        else:
+            print(f"No changes needed: {file_path}")
+
+    except Exception as e:
+        print(f"Error processing {file_path}: {e}")
+
+def main():
+    if len(sys.argv) != 2:
+        print("Usage: python clean_notebooks.py <directory>")
+        sys.exit(1)
+
+    target_dir = Path(sys.argv[1])
+    if not target_dir.is_dir():
+        print(f"Error: {target_dir} is not a valid directory.")
+        sys.exit(1)
+
+    print(f"Searching for notebooks in {target_dir}...")
+    notebook_files = list(target_dir.rglob('*.ipynb'))
+
+    if not notebook_files:
+        print("No notebook files found.")
+        return
+
+    for notebook_file in notebook_files:
+        clean_notebook(notebook_file)
+
+if __name__ == "__main__":
+    main()
diff --git a/doc/source/examples/.ipynb_checkpoints/FairSVM-checkpoint.ipynb b/doc/source/examples/.ipynb_checkpoints/FairSVM-checkpoint.ipynb
@@ -2,7 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "50711dda-105e-4714-937b-e8be06370605",
    "metadata": {},
    "source": [
     "# **FairSVM**\n",
@@ -38,7 +37,6 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "e66268fa-403d-402b-9ea1-fbfe7573af40",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -62,7 +60,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6a576a09-b700-49cd-b500-219f3a6e40b0",
    "metadata": {},
    "source": [
     "## SVM as baseline"
@@ -71,7 +68,6 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "15531796-3a45-42b3-8a99-da0343be9d4d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -84,7 +80,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "79bb275b-2dfd-4608-83e3-b4b4eb0fdb72",
    "metadata": {},
    "source": [
     "## FairSVM"
@@ -93,7 +88,6 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "c43509f7-031b-4620-bc5e-fb5aea2ef1c2",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -111,7 +105,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "794ede1f-13a4-4889-b6d9-f19a61faa510",
    "metadata": {},
    "source": [
     "## Results"
@@ -120,7 +113,6 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "05dc3921-1837-474e-9a6d-4555a94ddc30",
    "metadata": {},
    "outputs": [
     {
@@ -159,7 +151,6 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "ad5a863e-fbbb-4caf-876d-374f3ca9b891",
    "metadata": {},
    "outputs": [
     {
diff --git a/doc/source/examples/CQR.ipynb b/doc/source/examples/CQR.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## **Ridge Composite Quantile Regression**\n",
+    "## Ridge Composite Quantile Regression\n",
     "\n",
     "[![Slides](https://img.shields.io/badge/🦌-ReHLine-blueviolet)](https://rehline-python.readthedocs.io/en/latest/)\n",
     "\n",
diff --git a/doc/source/examples/FairSVM.ipynb b/doc/source/examples/FairSVM.ipynb
@@ -2,10 +2,9 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "50711dda-105e-4714-937b-e8be06370605",
    "metadata": {},
    "source": [
-    "# **FairSVM**\n",
+    "# FairSVM\n",
     "\n",
     "[![Slides](https://img.shields.io/badge/🦌-ReHLine-blueviolet)](https://rehline-python.readthedocs.io/en/latest/)\n",
     "\n",
@@ -36,14 +35,12 @@
   },
   {
    "cell_type": "markdown",
-   "id": "7bf9272115591bbf",
    "metadata": {},
    "source": []
   },
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "e66268fa-403d-402b-9ea1-fbfe7573af40",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -67,7 +64,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6a576a09-b700-49cd-b500-219f3a6e40b0",
    "metadata": {},
    "source": [
     "## SVM as baseline"
@@ -76,7 +72,6 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "15531796-3a45-42b3-8a99-da0343be9d4d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -89,7 +84,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "79bb275b-2dfd-4608-83e3-b4b4eb0fdb72",
    "metadata": {},
    "source": [
     "## FairSVM"
@@ -98,7 +92,6 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "c43509f7-031b-4620-bc5e-fb5aea2ef1c2",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -116,7 +109,6 @@
   },
   {
    "cell_type": "markdown",
-   "id": "794ede1f-13a4-4889-b6d9-f19a61faa510",
    "metadata": {},
    "source": [
     "## Results"
@@ -125,7 +117,6 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "id": "05dc3921-1837-474e-9a6d-4555a94ddc30",
    "metadata": {},
    "outputs": [
     {
@@ -164,7 +155,6 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "ad5a863e-fbbb-4caf-876d-374f3ca9b891",
    "metadata": {},
    "outputs": [
     {
diff --git a/doc/source/examples/QR.ipynb b/doc/source/examples/QR.ipynb
@@ -2,10 +2,9 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "e3a11293-4739-476e-a513-48a256d425a2",
    "metadata": {},
    "source": [
-    "## **Ridge Quantile Regression**\n",
+    "## Ridge Quantile Regression\n",
     "\n",
     "[![Slides](https://img.shields.io/badge/🦌-ReHLine-blueviolet)](https://rehline-python.readthedocs.io/en/latest/)\n",
     "\n",
@@ -24,7 +23,6 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "b2dd4ce5-bc27-41a4-89ab-7920d393f377",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -46,7 +44,6 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "80129ee6-f886-4e27-a764-630f15826bca",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -63,7 +60,6 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "1d8b90e9-6af9-4856-9751-6fe6fbc7665c",
    "metadata": {},
    "outputs": [
     {
@@ -98,7 +94,7 @@
    ]
   }
  ],
-  "metadata": {
+ "metadata": {
   "colab": {
    "provenance": []
   },
@@ -112,4 +108,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
diff --git a/doc/source/examples/SVM.ipynb b/doc/source/examples/SVM.ipynb
@@ -2,10 +2,9 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "fbcb401d-6ca6-4933-abd5-f8f504282416",
    "metadata": {},
    "source": [
-    "# **SVM**\n",
+    "# SVM\n",
     "\n",
     "[![Slides](https://img.shields.io/badge/🦌-ReHLine-blueviolet)](https://rehline-python.readthedocs.io/en/latest/)\n",
     "\n",
@@ -21,7 +20,6 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "id": "2dd1c096-e0df-492f-be63-8ac272007237",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -42,7 +40,6 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "id": "aece9fbe-f9be-40ae-8179-b44849fb0fd3",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -56,7 +53,6 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "93719987-c6b3-4a9b-9b40-c35e5bf90ef0",
    "metadata": {},
    "outputs": [
     {
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -48,6 +48,16 @@ The proposed **ReHLine** solver has appealing exhibits appealing properties:
     * - **Super-Efficient**
       - The optimization algorithm has a provable **LINEAR** convergence rate, and the per-iteration computational complexity is **LINEAR** in the sample size.
 
+✨ New Features: Scikit-Learn Compatible Estimators
+---------------------------------------------------
+
+We are excited to introduce full scikit-learn compatibility! `ReHLine` now provides `plq_Ridge_Classifier` and `plq_Ridge_Regressor` estimators that integrate seamlessly with the entire scikit-learn ecosystem.
+
+This means you can:
+   - Drop `ReHLine` estimators directly into your existing scikit-learn `Pipeline`.
+   - Perform robust hyperparameter tuning using `GridSearchCV`.
+   - Use standard scikit-learn evaluation metrics and cross-validation tools.
+
 🔨 Installation
 ---------------
 
diff --git a/doc/source/tutorials.rst b/doc/source/tutorials.rst
@@ -41,17 +41,16 @@ List of Tutorials
    - | `plqERM_Ridge <./autoapi/rehline/index.html#rehline.plqERM_Ridge>`_
    - | Empirical Risk Minimization (ERM) with a piecewise linear-quadratic (PLQ) objective with a ridge penalty.
 
- * - `ReHLine: Scikit-learn Compatible Estimators Powered by ReHLine <./examples/Sklearn_Mixin.ipynb>`_
-   - | `plqERM_Ridge <./autoapi/rehline/index.html#rehline.plq_Ridge_Classifier>`_
-   - | `plqERM_Ridge <./autoapi/rehline/index.html#rehline.plq_Ridge_Regressor>`_
+ * - `ReHLine: Scikit-learn Compatible Estimators <./tutorials/ReHLine_sklearn.rst>`_
+   - | `plq_Ridge_Classifier <./autoapi/rehline/index.html#rehline.plq_Ridge_Classifier>`_ `plq_Ridge_Regressor <./autoapi/rehline/index.html#rehline.plq_Ridge_Regressor>`_
    - | Scikit-learn compatible estimators framework for empirical risk minimization problem.
 
  * - `ReHLine: Ridge Composite Quantile Regression <./examples/CQR.ipynb>`_
    - | `CQR_Ridge <./autoapi/rehline/index.html#rehline.CQR_Ridge>`_
    - | Composite Quantile Regression (CQR) with a ridge penalty.
 
  * - `ReHLine: Matrix Factorization <./tutorials/ReHLine_MF.rst>`_
-   - | `plqMF_Ridge <./autoapi/rehline/index.html#rehline.plqERM_Ridge>`_
+   - | `plqMF_Ridge <./autoapi/rehline/index.html#rehline.plqMF_Ridge>`_
    - | Matrix Factorization (MF) with a piecewise linear-quadratic (PLQ) objective with a ridge penalty.
 
 .. toctree::
@@ -62,5 +61,6 @@ List of Tutorials
    ./tutorials/ReHLine_ERM
    ./tutorials/loss
    ./tutorials/constraint
+   ./tutorials/ReHLine_sklearn
    ./tutorials/warmstart
 
diff --git a/doc/source/tutorials/ReHLine_sklearn.rst b/doc/source/tutorials/ReHLine_sklearn.rst