Skip to content
Open
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
2ec9530
WIP Add module about clustering
May 22, 2025
8ab8ff8
Iter on Kmeans exercise
May 27, 2025
d828ada
Synch exercise notebooks
May 27, 2025
d94eb14
Add notebooks on hdbscan and feature engineering
May 27, 2025
fd41577
Reworked the k-means intro notebook to use penguins dataset
ogrisel May 28, 2025
90bdfed
Rerender the first notebook
ogrisel May 28, 2025
8ccd658
Add some missing cell markers
ogrisel May 28, 2025
0a2dfa3
Rerender the first notebook
ogrisel May 28, 2025
83bc291
More missing markers
ogrisel May 28, 2025
60d32eb
Rerender the first notebook
ogrisel May 28, 2025
2621cec
Improve phrasing / fix typos
ogrisel May 28, 2025
57fafec
Typo
ogrisel May 28, 2025
51a0d13
Rerender the first notebook
ogrisel May 28, 2025
6b26222
Iter on Olivier's work
May 30, 2025
fe58a3e
General rewording
Jun 2, 2025
7374ae5
Apply suggestions from code review
ArturoAmorQ Jun 2, 2025
fd03b1a
Rephrasing in cluster_kmeans_sol_01.py
ogrisel Jun 3, 2025
da9a3ec
Resynchronize exercise and fix CI
Jun 3, 2025
d4ad40c
Wording
Jun 3, 2025
8ab3e29
Use MAE to score predicted house prices
Jun 4, 2025
52f244a
Solve plotly DeprecationWarning
Jun 4, 2025
dec3453
Prefer make_column_transformer as per #831
Jun 4, 2025
00c41a7
Iter on hdbscan notebook
Jun 5, 2025
ffe1855
Remove redundant paragraph
Jun 5, 2025
26cd3d2
Rename exercise and solution
Jun 5, 2025
50e9bc0
Add exercise and solution using AMI
Jun 5, 2025
d7e03e6
Fix exercise
Jun 6, 2025
900f0da
Small improvements to the solution of exercise 02
ogrisel Jun 6, 2025
87d438f
Add the skrub dependency
ogrisel Jun 6, 2025
166868c
Expand analysis a bit
ogrisel Jun 6, 2025
74245d3
Improvements in the HDBSCAN notebook
ogrisel Jun 6, 2025
9bfe2f2
Reworded analysis of the BBC text clustering notebook + use cross-val…
ogrisel Jun 6, 2025
a765e53
Improvements in the supervised metrics notebook
ogrisel Jun 6, 2025
024545f
Add discussion on silhouette for hdbscan
Jun 9, 2025
2647274
Fix warning and plot not rendering
Jun 9, 2025
bd32b87
Add intro, overview and sections
Jun 10, 2025
27f4e52
Iter discussion on silhouette for hdbscan
Jun 10, 2025
b8f6646
Tweaks
Jun 10, 2025
d328631
Add first quiz on clustering and related images
Jun 11, 2025
c32eabe
Wording tweaks
Jun 11, 2025
89110ea
Add second quiz on clustering
Jun 20, 2025
d4b975e
Apply suggestions from code review
ArturoAmorQ Jun 24, 2025
b016d8b
Update jupyter-book/clustering/clustering_module_take_away.md
ArturoAmorQ Jun 24, 2025
a72be00
Synchronize quizzes from review
Jun 24, 2025
60a58c4
Merge branch 'clustering_module' of github.com:ArturoAmorQ/scikit-lea…
Jun 24, 2025
aff2dba
Synchronize notebooks
Jun 24, 2025
0c161ee
Add clustering wrap-up quiz
Jul 15, 2025
54ff951
Fix bug in wrap-up quiz
Jul 15, 2025
3fa71e9
Add wrap-up quiz to toc
Jul 15, 2025
5c7b28c
Fix a couple of bugs
Aug 12, 2025
83fc894
Feature branch to update to 1.6 (#813)
ogrisel May 27, 2025
49b4474
FIX Penguin figures not rendering (#828)
ArturoAmorQ May 27, 2025
c7009bc
Missing notebook sync (#838)
ogrisel May 27, 2025
9a3dd25
MAINT Changed the use of ColumnTransformer to make_column_transformer…
SebastienMelo May 27, 2025
110f5cc
Fix typos (#839)
omahs Jun 2, 2025
36ca1c5
minor improvements in wording and import statement order (#841)
rouk1 Jun 2, 2025
8ca0182
Minor typos fixups (#842)
davidjsonn Jun 9, 2025
4f9b633
MTN px parallel render fix (#843)
SebastienMelo Jun 12, 2025
4aa8c13
MAINT Update matplotlib to v3.10.3 (#846)
brospars Jun 17, 2025
1210659
MTN Heat map explanation (#833)
SebastienMelo Jul 9, 2025
ff94dea
MTN Bias variance quizz (#849)
SebastienMelo Jul 9, 2025
e1a5f07
MNT Add info about the estimators html diagram (#844)
ArturoAmorQ Jul 15, 2025
a7df7ad
Update notebooks
Jul 15, 2025
c962329
Add cross-validation diagram to GridSearchCV notebook (#847)
student-ChestaVashishtha Jul 30, 2025
00ddb9b
MTN Synchronized the quizzes for module 1 and 7
SebastienMelo Aug 7, 2025
271555a
MTN Fix the parallel plots
SebastienMelo Aug 7, 2025
0df2714
MTN Hyperparameter tuning with grid search
SebastienMelo Aug 13, 2025
d263961
MTN Made the distinction between predictor and transformer clearer (#…
SebastienMelo Aug 13, 2025
090c715
MTN Added model state to the glossary (#857)
SebastienMelo Aug 13, 2025
e5d4bf6
MTN Proposal for explantation of what are iterations (#859)
SebastienMelo Sep 10, 2025
b943bed
FIX HistGradientBoosting fitting time too long (#860)
SebastienMelo Sep 17, 2025
2c8b60a
Update README and Adding License txt file for Bike Rides Dataset (#858)
student-ChestaVashishtha Sep 26, 2025
29ad752
Remove introduction paragraph meant for maintainers (#863)
SebastienMelo Oct 2, 2025
a4bd6ea
Improve wording in definition of numerical features (#861)
SebastienMelo Oct 2, 2025
e39860f
Explicit that fitting time is measured in seconds (#862)
SebastienMelo Oct 2, 2025
a516bdc
Add dataset credits and licenses (#864)
SebastienMelo Oct 9, 2025
22bb873
Update notebooks
Oct 9, 2025
8e3fe4b
Change BBC news to Wikinews
Oct 23, 2025
da12c87
Merge branch 'INRIA:main' into clustering_module
ArturoAmorQ Oct 23, 2025
d8fe672
Update corresponding notebook
Oct 23, 2025
b235eaa
Add credits to newly added datasets
Oct 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,251 changes: 1,251 additions & 0 deletions datasets/bbc_news.csv

Large diffs are not rendered by default.

5,882 changes: 5,882 additions & 0 deletions datasets/rfm_segmentation.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
- matplotlib-base
- seaborn >= 0.13
- plotly >= 5.10
- skrub
- jupytext
- beautifulsoup4
- IPython
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- pandas >= 1
- matplotlib-base
- seaborn >= 0.13
- skrub
- jupyterlab
- notebook
- plotly >= 5.10
Expand Down
1,004 changes: 1,004 additions & 0 deletions figures/clustering_quiz_kmeans_not_scaled.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
980 changes: 980 additions & 0 deletions figures/clustering_quiz_kmeans_scaled.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions jupyter-book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -236,3 +236,20 @@ parts:
chapters:
- file: python_scripts/dev_features_importance
- file: interpretation/interpretation_quiz
- caption: 🚧 Clustering
chapters:
- file: clustering/clustering_module_intro
- file: clustering/clustering_kmeans_index
sections:
- file: python_scripts/clustering_kmeans
- file: python_scripts/clustering_ex_01
- file: python_scripts/clustering_sol_01
- file: python_scripts/clustering_supervised_metrics
- file: python_scripts/clustering_ex_02
- file: python_scripts/clustering_sol_02
- file: clustering/clustering_quiz_m4_01
- file: clustering/clustering_assumptions_index
sections:
- file: python_scripts/clustering_hdbscan
- file: python_scripts/clustering_transformer
- file: clustering/clustering_module_take_away
5 changes: 5 additions & 0 deletions jupyter-book/clustering/clustering_assumptions_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Clustering when k-means assumptions fail

```{tableofcontents}

```
5 changes: 5 additions & 0 deletions jupyter-book/clustering/clustering_kmeans_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# K-means

```{tableofcontents}

```
56 changes: 56 additions & 0 deletions jupyter-book/clustering/clustering_module_intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Module overview

## What you will learn

<!-- Give in plain English what the module is about -->

In the previous module, we introduced the need for hyperparameter tuning.
To support this, we emphasized the role of a separate **validation set** for
selecting the best hyperparameters, and the use of **nested cross-validation**
to more reliably estimate their generalization performance and stability.

In this module we present KMeans clustering, including aspects like cluster
stability and evaluation metrics such as silhouette score and inertia.
We also introduce supervised clustering metrics that leverage annotated data to
assess clustering quality.

Finally, we discuss what to do when the assumptions of KMeans do not hold, such
as using HDBSCAN for non-convex clusters, and show how KMeans can still be
useful as a feature engineering step in a supervised learning pipeline, by using
distances to centroids.


## Before getting started

<!-- Give the required skills for the module -->

The required technical skills to carry on this module are:

- skills acquired during the "The Predictive Modeling Pipeline" module with
basic usage of scikit-learn;
- skills acquired during the "Selecting The Best Model" module, mainly around
the concept of validation curves and the concepts around stability.

<!-- Point to resources to learning these skills -->

## Objectives and time schedule

<!-- Give the learning objectives -->

The objective in the module are the following:

- apply k-means clustering and assess its behavior across different settings
- evaluate cluster quality using unsupervised metrics such as silhouette score
and WCSS (also known as inertia)
- interpret and compute supervised clustering metrics (e.g., AMI, ARI,
V-measure) when ground truth labels are available
- understand the limitations of k-means and identify cases where its assumptions
(e.g., convex, isotropic clusters) do not hold
- use HDBSCAN as an alternative clustering method suited for irregular or
non-convex cluster shapes
- integrate k-means into a supervised learning pipeline by using distances to
centroids as features.

<!-- Give the investment in time -->

The estimated time to go through this module is about 6 hours.
26 changes: 26 additions & 0 deletions jupyter-book/clustering/clustering_module_take_away.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Main take-away

## Wrap-up

<!-- Quick wrap-up for the module -->

In this module, we presented the framework used in unsupervised learning with
clustering, focusing on KMeans and how to evaluate its results using both
internal and supervised metrics.

We explored the concept of cluster stability, addressed the limitations of
KMeans when clusters are not convex, and introduced HDBSCAN as an alternative.

Finally, we showed how clustering can be integrated into supervised pipelines
through feature engineering.

## To go further

<!-- Some extra links of content to go further -->

You can refer to the following scikit-learn examples which are related to
the concepts approached in this module:

- [Adjustment for chance in clustering performance evaluation](https://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html)
- [Demonstration of k-means assumptions](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html)
- [Clustering text documents using k-means](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html)
89 changes: 89 additions & 0 deletions jupyter-book/clustering/clustering_quiz_m4_01.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# ✅ Quiz M4.01

```{admonition} Question
Imagine you work for a music streaming platform that hosts a vast library of
songs, playlists, and podcasts. You have access to detailed listening data from
millions of users. For each user, you know their most-listened genres, the
devices they use, their average session length, and how often they explore new
content.

You want to segment users based on their listening patterns to improve
personalized recommendations, without relying on rigid, predefined labels like
"pop fan" or "casual listener" which may fail to capture the complexity of
their behavior.

What kind of problem are you dealing with?

- a) a supervised task
- b) an unsupervised task
- c) a classification task
- d) a clustering task

_Select all answers that apply_
```

+++

```{admonition} Question
The plots below show the cluster labels as found by K-means with 3 clusters, only
differing in the scaling step. Based on this, which conclusions can be obtained?

![K-means not scaled](../../figures/evaluation_quiz_kmeans_not_scaled.svg)
![K-means scaled](../../figures/evaluation_quiz_kmeans_scaled.svg)

- a) without scaling, cluster assignment is dominated by the feature in the vertical axis
- b) without scaling, cluster assignment is dominated by the feature in the horizontal axis
- c) without scaling, both features contribute equally to cluster assignment

_Select a single answer_
```

+++

```{admonition} Question
Which of the following statements correctly describe factors that affect the
stability of K-Means clustering across different resamplings of the data?

- a) K-Means can produce different results on resampled datasets due to
sensitivity to initialization
- b) If data is unevenly distributed, the stability improves when increasing the
parameter `n_init` in the "k-means++" initialization
- c) Stability across resamplings of the data is guaranteed after feature
scaling
- d) Increasing the number of clusters typically reduces the variability of
results across resamples

_Select all answers that apply_
```

+++

```{admonition} Question
Which of the following statements correctly describe how WCSS (within-cluster
sum of squares, or inertia) behaves in K-means clustering?

- a) For a fixed number of clusters, WCSS is lower when clusters are compact
- b) For a fixed number of clusters, WCSS is lower for larger clusters
- c) For a fixed number of clusters, lower WCSS implies lower computational cost
during training
- d) WCSS always decreases as the number of clusters increases

_Select all answers that apply_
```

+++

```{admonition} Question
Which of the following statements correctly describe differences between
supervised and unsupervised clustering metrics?

- a) Supervised clustering metrics such as ARI and AMI require access to ground
truth labels to evaluate clustering performance
- b) WCSS and the silhouette score evaluate internal cluster structure without
needing reference labels
- c) V-measure is zero when labels are assigned completely at random
- d) Supervised clustering metrics are not useful if the number of clusters do
not match the number of predefined classes

_Select all answers that apply_
```
45 changes: 45 additions & 0 deletions jupyter-book/clustering/clustering_quiz_m4_02.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# ✅ Quiz M4.02

```{admonition} Question
If we increase `min_cluster_size` in HDBSCAN, what happens to the number of
points labeled as noise?

- a) it decreases
- b) it increases
- c) it stays the same
- d) HDBSCAN fails to converge

_Select a single answer_

```

+++

```{admonition} Question
What happens to k-means centroids in the presence of outliers?

- a) they move towards the outliers
- b) they are not sensitive to outliers
- c) if a centroid is initialized on an outlier, it may remain isolated in
subsequent iterations

_Select all answers that apply_

```

+++

```{admonition} Question
A `KMeans` instance with `n_clusters=10` is used to transform the latitude and
longitude in a supervised learning pipeline. Provided the original dataset consists of
`n_features`, including those two, how many features are passed to
the final estimator of the pipeline?

- a) `n_features` + 10
- b) `n_features` + 8
- c) `n_features` - 2
- d) `n_features`

_Select a single answer_

```
Loading