INRIA · ArturoAmorQ · May 22, 2025 · May 27, 2025 · May 27, 2025 · May 27, 2025
diff --git a/datasets/bbc_news.csv b/datasets/bbc_news.csv
diff --git a/datasets/rfm_segmentation.csv b/datasets/rfm_segmentation.csv
diff --git a/environment-dev.yml b/environment-dev.yml
@@ -7,6 +7,7 @@ dependencies:
   - matplotlib-base
   - seaborn >= 0.13
   - plotly >= 5.10
+  - skrub
   - jupytext
   - beautifulsoup4
   - IPython

diff --git a/environment.yml b/environment.yml
@@ -8,6 +8,7 @@ dependencies:
   - pandas >= 1
   - matplotlib-base
   - seaborn >= 0.13
+  - skrub
   - jupyterlab
   - notebook
   - plotly >= 5.10

diff --git a/figures/clustering_quiz_kmeans_not_scaled.svg b/figures/clustering_quiz_kmeans_not_scaled.svg
diff --git a/figures/clustering_quiz_kmeans_scaled.svg b/figures/clustering_quiz_kmeans_scaled.svg
diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml
@@ -236,3 +236,20 @@ parts:
   chapters:
   - file: python_scripts/dev_features_importance
   - file: interpretation/interpretation_quiz
+- caption: 🚧 Clustering
+  chapters:
+  - file: clustering/clustering_module_intro
+  - file: clustering/clustering_kmeans_index
+    sections:
+    - file: python_scripts/clustering_kmeans
+    - file: python_scripts/clustering_ex_01
+    - file: python_scripts/clustering_sol_01
+    - file: python_scripts/clustering_supervised_metrics
+    - file: python_scripts/clustering_ex_02
+    - file: python_scripts/clustering_sol_02
+    - file: clustering/clustering_quiz_m4_01
+  - file: clustering/clustering_assumptions_index
+    sections:
+    - file: python_scripts/clustering_hdbscan
+    - file: python_scripts/clustering_transformer
+  - file: clustering/clustering_module_take_away
diff --git a/jupyter-book/clustering/clustering_assumptions_index.md b/jupyter-book/clustering/clustering_assumptions_index.md
@@ -0,0 +1,5 @@
+# Clustering when k-means assumptions fail
+
+```{tableofcontents}
+
+```
diff --git a/jupyter-book/clustering/clustering_kmeans_index.md b/jupyter-book/clustering/clustering_kmeans_index.md
@@ -0,0 +1,5 @@
+# K-means
+
+```{tableofcontents}
+
+```
diff --git a/jupyter-book/clustering/clustering_module_intro.md b/jupyter-book/clustering/clustering_module_intro.md
@@ -0,0 +1,56 @@
+# Module overview
+
+## What you will learn
+
+<!-- Give in plain English what the module is about -->
+
+In the previous module, we introduced the need for hyperparameter tuning.
+To support this, we emphasized the role of a separate **validation set** for
+selecting the best hyperparameters, and the use of **nested cross-validation**
+to more reliably estimate their generalization performance and stability.
+
+In this module we present KMeans clustering, including aspects like cluster
+stability and evaluation metrics such as silhouette score and inertia.
+We also introduce supervised clustering metrics that leverage annotated data to
+assess clustering quality.
+
+Finally, we discuss what to do when the assumptions of KMeans do not hold, such
+as using HDBSCAN for non-convex clusters, and show how KMeans can still be
+useful as a feature engineering step in a supervised learning pipeline, by using
+distances to centroids.
+
+
+## Before getting started
+
+<!-- Give the required skills for the module -->
+
+The required technical skills to carry on this module are:
+
+- skills acquired during the "The Predictive Modeling Pipeline" module with
+  basic usage of scikit-learn;
+- skills acquired during the "Selecting The Best Model" module, mainly around
+  the concept of validation curves and the concepts around stability.
+
+<!-- Point to resources to learning these skills -->
+
+## Objectives and time schedule
+
+<!-- Give the learning objectives -->
+
+The objective in the module are the following:
+
+- apply k-means clustering and assess its behavior across different settings
+- evaluate cluster quality using unsupervised metrics such as silhouette score
+  and WCSS (also known as inertia)
+- interpret and compute supervised clustering metrics (e.g., AMI, ARI,
+  V-measure) when ground truth labels are available
+- understand the limitations of k-means and identify cases where its assumptions
+  (e.g., convex, isotropic clusters) do not hold
+- use HDBSCAN as an alternative clustering method suited for irregular or
+  non-convex cluster shapes
+- integrate k-means into a supervised learning pipeline by using distances to
+  centroids as features.
+
+<!-- Give the investment in time -->
+
+The estimated time to go through this module is about 6 hours.
diff --git a/jupyter-book/clustering/clustering_module_take_away.md b/jupyter-book/clustering/clustering_module_take_away.md
@@ -0,0 +1,26 @@
+# Main take-away
+
+## Wrap-up
+
+<!-- Quick wrap-up for the module -->
+
+In this module, we presented the framework used in unsupervised learning with
+clustering, focusing on KMeans and how to evaluate its results using both
+internal and supervised metrics.
+
+We explored the concept of cluster stability, addressed the limitations of
+KMeans when clusters are not convex, and introduced HDBSCAN as an alternative.
+
+Finally, we showed how clustering can be integrated into supervised pipelines
+through feature engineering.
+
+## To go further
+
+<!-- Some extra links of content to go further -->
+
+You can refer to the following scikit-learn examples which are related to
+the concepts approached in this module:
+
+- [Adjustment for chance in clustering performance evaluation](https://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html)
+- [Demonstration of k-means assumptions](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html)
+- [Clustering text documents using k-means](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html)
diff --git a/jupyter-book/clustering/clustering_quiz_m4_01.md b/jupyter-book/clustering/clustering_quiz_m4_01.md
@@ -0,0 +1,89 @@
+# ✅ Quiz M4.01
+
+```{admonition} Question
+Imagine you work for a music streaming platform that hosts a vast library of
+songs, playlists, and podcasts. You have access to detailed listening data from
+millions of users. For each user, you know their most-listened genres, the
+devices they use, their average session length, and how often they explore new
+content.
+
+You want to segment users based on their listening patterns to improve
+personalized recommendations, without relying on rigid, predefined labels like
+"pop fan" or "casual listener" which may fail to capture the complexity of
+their behavior.
+
+What kind of problem are you dealing with?
+
+- a) a supervised task
+- b) an unsupervised task
+- c) a classification task
+- d) a clustering task
+
+_Select all answers that apply_
+```
+
++++
+
+```{admonition} Question
+The plots below show the cluster labels as found by K-means with 3 clusters, only
+differing in the scaling step. Based on this, which conclusions can be obtained?
+
+![K-means not scaled](../../figures/evaluation_quiz_kmeans_not_scaled.svg)
+![K-means scaled](../../figures/evaluation_quiz_kmeans_scaled.svg)
+
+- a) without scaling, cluster assignment is dominated by the feature in the vertical axis
+- b) without scaling, cluster assignment is dominated by the feature in the horizontal axis
+- c) without scaling, both features contribute equally to cluster assignment
+
+_Select a single answer_
+```
+
++++
+
+```{admonition} Question
+Which of the following statements correctly describe factors that affect the
+stability of K-Means clustering across different resamplings of the data?
+
+- a) K-Means can produce different results on resampled datasets due to
+  sensitivity to initialization
+- b) If data is unevenly distributed, the stability improves when increasing the
+  parameter `n_init` in the "k-means++" initialization
+- c) Stability across resamplings of the data is guaranteed after feature
+  scaling
+- d) Increasing the number of clusters typically reduces the variability of
+  results across resamples
+
+_Select all answers that apply_
+```
+
++++
+
+```{admonition} Question
+Which of the following statements correctly describe how WCSS (within-cluster
+sum of squares, or inertia) behaves in K-means clustering?
+
+- a) For a fixed number of clusters, WCSS is lower when clusters are compact
+- b) For a fixed number of clusters, WCSS is lower for larger clusters
+- c) For a fixed number of clusters, lower WCSS implies lower computational cost
+  during training
+- d) WCSS always decreases as the number of clusters increases
+
+_Select all answers that apply_
+```
+
++++
+
+```{admonition} Question
+Which of the following statements correctly describe differences between
+supervised and unsupervised clustering metrics?
+
+- a) Supervised clustering metrics such as ARI and AMI require access to ground
+  truth labels to evaluate clustering performance
+- b) WCSS and the silhouette score evaluate internal cluster structure without
+  needing reference labels
+- c) V-measure is zero when labels are assigned completely at random
+- d) Supervised clustering metrics are not useful if the number of clusters do
+  not match the number of predefined classes
+
+_Select all answers that apply_
+```
diff --git a/jupyter-book/clustering/clustering_quiz_m4_02.md b/jupyter-book/clustering/clustering_quiz_m4_02.md
@@ -0,0 +1,45 @@
+# ✅ Quiz M4.02
+
+```{admonition} Question
+If we increase `min_cluster_size` in HDBSCAN, what happens to the number of
+points labeled as noise?
+
+- a) it decreases
+- b) it increases
+- c) it stays the same
+- d) HDBSCAN fails to converge
+
+_Select a single answer_
+
+```
+
++++
+
+```{admonition} Question
+What happens to k-means centroids in the presence of outliers?
+
+- a) they move towards the outliers
+- b) they are not sensitive to outliers
+- c) if a centroid is initialized on an outlier, it may remain isolated in
+  subsequent iterations
+
+_Select all answers that apply_
+
+```
+
++++
+
+```{admonition} Question
+A `KMeans` instance with `n_clusters=10` is used to transform the latitude and
+longitude in a supervised learning pipeline. Provided the original dataset consists of
+`n_features`, including those two, how many features are passed to
+the final estimator of the pipeline?
+
+- a) `n_features` + 10
+- b) `n_features` + 8
+- c) `n_features` - 2
+- d) `n_features`
+
+_Select a single answer_
+
+```