-
Notifications
You must be signed in to change notification settings - Fork 589
WIP Add module about clustering #836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ArturoAmorQ
wants to merge
81
commits into
INRIA:main
Choose a base branch
from
ArturoAmorQ:clustering_module
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 41 commits
Commits
Show all changes
81 commits
Select commit
Hold shift + click to select a range
2ec9530
WIP Add module about clustering
8ab8ff8
Iter on Kmeans exercise
d828ada
Synch exercise notebooks
d94eb14
Add notebooks on hdbscan and feature engineering
fd41577
Reworked the k-means intro notebook to use penguins dataset
ogrisel 90bdfed
Rerender the first notebook
ogrisel 8ccd658
Add some missing cell markers
ogrisel 0a2dfa3
Rerender the first notebook
ogrisel 83bc291
More missing markers
ogrisel 60d32eb
Rerender the first notebook
ogrisel 2621cec
Improve phrasing / fix typos
ogrisel 57fafec
Typo
ogrisel 51a0d13
Rerender the first notebook
ogrisel 6b26222
Iter on Olivier's work
fe58a3e
General rewording
7374ae5
Apply suggestions from code review
ArturoAmorQ fd03b1a
Rephrasing in cluster_kmeans_sol_01.py
ogrisel da9a3ec
Resynchronize exercise and fix CI
d4ad40c
Wording
8ab3e29
Use MAE to score predicted house prices
52f244a
Solve plotly DeprecationWarning
dec3453
Prefer make_column_transformer as per #831
00c41a7
Iter on hdbscan notebook
ffe1855
Remove redundant paragraph
26cd3d2
Rename exercise and solution
50e9bc0
Add exercise and solution using AMI
d7e03e6
Fix exercise
900f0da
Small improvements to the solution of exercise 02
ogrisel 87d438f
Add the skrub dependency
ogrisel 166868c
Expand analysis a bit
ogrisel 74245d3
Improvements in the HDBSCAN notebook
ogrisel 9bfe2f2
Reworded analysis of the BBC text clustering notebook + use cross-val…
ogrisel a765e53
Improvements in the supervised metrics notebook
ogrisel 024545f
Add discussion on silhouette for hdbscan
2647274
Fix warning and plot not rendering
bd32b87
Add intro, overview and sections
27f4e52
Iter discussion on silhouette for hdbscan
b8f6646
Tweaks
d328631
Add first quiz on clustering and related images
c32eabe
Wording tweaks
89110ea
Add second quiz on clustering
d4b975e
Apply suggestions from code review
ArturoAmorQ b016d8b
Update jupyter-book/clustering/clustering_module_take_away.md
ArturoAmorQ a72be00
Synchronize quizzes from review
60a58c4
Merge branch 'clustering_module' of github.com:ArturoAmorQ/scikit-lea…
aff2dba
Synchronize notebooks
0c161ee
Add clustering wrap-up quiz
54ff951
Fix bug in wrap-up quiz
3fa71e9
Add wrap-up quiz to toc
5c7b28c
Fix a couple of bugs
83fc894
Feature branch to update to 1.6 (#813)
ogrisel 49b4474
FIX Penguin figures not rendering (#828)
ArturoAmorQ c7009bc
Missing notebook sync (#838)
ogrisel 9a3dd25
MAINT Changed the use of ColumnTransformer to make_column_transformer…
SebastienMelo 110f5cc
Fix typos (#839)
omahs 36ca1c5
minor improvements in wording and import statement order (#841)
rouk1 8ca0182
Minor typos fixups (#842)
davidjsonn 4f9b633
MTN px parallel render fix (#843)
SebastienMelo 4aa8c13
MAINT Update matplotlib to v3.10.3 (#846)
brospars 1210659
MTN Heat map explanation (#833)
SebastienMelo ff94dea
MTN Bias variance quizz (#849)
SebastienMelo e1a5f07
MNT Add info about the estimators html diagram (#844)
ArturoAmorQ a7df7ad
Update notebooks
c962329
Add cross-validation diagram to GridSearchCV notebook (#847)
student-ChestaVashishtha 00ddb9b
MTN Synchronized the quizzes for module 1 and 7
SebastienMelo 271555a
MTN Fix the parallel plots
SebastienMelo 0df2714
MTN Hyperparameter tuning with grid search
SebastienMelo d263961
MTN Made the distinction between predictor and transformer clearer (#…
SebastienMelo 090c715
MTN Added model state to the glossary (#857)
SebastienMelo e5d4bf6
MTN Proposal for explantation of what are iterations (#859)
SebastienMelo b943bed
FIX HistGradientBoosting fitting time too long (#860)
SebastienMelo 2c8b60a
Update README and Adding License txt file for Bike Rides Dataset (#858)
student-ChestaVashishtha 29ad752
Remove introduction paragraph meant for maintainers (#863)
SebastienMelo a4bd6ea
Improve wording in definition of numerical features (#861)
SebastienMelo e39860f
Explicit that fitting time is measured in seconds (#862)
SebastienMelo a516bdc
Add dataset credits and licenses (#864)
SebastienMelo 22bb873
Update notebooks
8e3fe4b
Change BBC news to Wikinews
da12c87
Merge branch 'INRIA:main' into clustering_module
ArturoAmorQ d8fe672
Update corresponding notebook
b235eaa
Add credits to newly added datasets
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Clustering when k-means assumptions fail | ||
|
|
||
| ```{tableofcontents} | ||
|
|
||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # K-means | ||
|
|
||
| ```{tableofcontents} | ||
|
|
||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # Module overview | ||
|
|
||
| ## What you will learn | ||
|
|
||
| <!-- Give in plain English what the module is about --> | ||
|
|
||
| In the previous module, we introduced the need for hyperparameter tuning. | ||
| To support this, we emphasized the role of a separate **validation set** for | ||
| selecting the best hyperparameters, and the use of **nested cross-validation** | ||
| to more reliably estimate their generalization performance and stability. | ||
|
|
||
| In this module we present KMeans clustering, including aspects like cluster | ||
| stability and evaluation metrics such as silhouette score and inertia. | ||
| We also introduce supervised clustering metrics that leverage annotated data to | ||
| assess clustering quality. | ||
ArturoAmorQ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Finally, we discuss what to do when the assumptions of KMeans do not hold, such | ||
| as using HDBSCAN for non-convex clusters, and show how KMeans can still be | ||
| useful as a feature engineering step in a supervised learning pipeline, by using | ||
ArturoAmorQ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| distances to centroids. | ||
|
|
||
|
|
||
| ## Before getting started | ||
|
|
||
| <!-- Give the required skills for the module --> | ||
|
|
||
| The required technical skills to carry on this module are: | ||
|
|
||
| - skills acquired during the "The Predictive Modeling Pipeline" module with | ||
| basic usage of scikit-learn; | ||
| - skills acquired during the "Selecting The Best Model" module, mainly around | ||
| the concept of validation curves and the concepts around stability. | ||
|
|
||
| <!-- Point to resources to learning these skills --> | ||
|
|
||
| ## Objectives and time schedule | ||
|
|
||
| <!-- Give the learning objectives --> | ||
|
|
||
| The objective in the module are the following: | ||
|
|
||
| - apply k-means clustering and assess its behavior across different settings | ||
| - evaluate cluster quality using unsupervised metrics such as silhouette score | ||
| and WCSS (also known as inertia) | ||
| - interpret and compute supervised clustering metrics (e.g., AMI, ARI, | ||
| V-measure) when ground truth labels are available | ||
| - understand the limitations of k-means and identify cases where its assumptions | ||
| (e.g., convex, isotropic clusters) do not hold | ||
| - use HDBSCAN as an alternative clustering method suited for irregular or | ||
| non-convex cluster shapes | ||
| - integrate k-means into a supervised learning pipeline by using distances to | ||
| centroids as features. | ||
ArturoAmorQ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| <!-- Give the investment in time --> | ||
|
|
||
| The estimated time to go through this module is about 6 hours. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # Main take-away | ||
|
|
||
| ## Wrap-up | ||
|
|
||
| <!-- Quick wrap-up for the module --> | ||
|
|
||
| In this module, we presented the framework used in unsupervised learning with | ||
| clustering, focusing on KMeans and how to evaluate its results using both | ||
| internal and supervised metrics. | ||
ArturoAmorQ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| We explored the concept of cluster stability, addressed the limitations of | ||
| KMeans when clusters are not convex, and introduced HDBSCAN as an alternative. | ||
ArturoAmorQ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Finally, we showed how clustering can be integrated into supervised pipelines | ||
| through feature engineering. | ||
ArturoAmorQ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## To go further | ||
|
|
||
| <!-- Some extra links of content to go further --> | ||
|
|
||
| You can refer to the following scikit-learn examples which are related to | ||
| the concepts approached in this module: | ||
|
|
||
| - [Adjustment for chance in clustering performance evaluation](https://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html) | ||
| - [Demonstration of k-means assumptions](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html) | ||
| - [Clustering text documents using k-means](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # ✅ Quiz M4.01 | ||
|
|
||
| ```{admonition} Question | ||
| Imagine you work for a music streaming platform that hosts a vast library of | ||
| songs, playlists, and podcasts. You have access to detailed listening data from | ||
| millions of users. For each user, you know their most-listened genres, the | ||
| devices they use, their average session length, and how often they explore new | ||
| content. | ||
|
|
||
| You want to segment users based on their listening patterns to improve | ||
| personalized recommendations, without relying on rigid, predefined labels like | ||
| "pop fan" or "casual listener" which may fail to capture the complexity of | ||
| their behavior. | ||
|
|
||
| What kind of problem are you dealing with? | ||
|
|
||
| - a) a supervised task | ||
| - b) an unsupervised task | ||
| - c) a classification task | ||
| - d) a clustering task | ||
|
|
||
| _Select all answers that apply_ | ||
| ``` | ||
|
|
||
| +++ | ||
|
|
||
| ```{admonition} Question | ||
| The plots below show the cluster labels as found by K-means with 3 clusters, only | ||
| differing in the scaling step. Based on this, which conclusions can be obtained? | ||
|
|
||
|  | ||
|  | ||
|
|
||
| - a) without scaling, cluster assignment is dominated by the feature in the vertical axis | ||
| - b) without scaling, cluster assignment is dominated by the feature in the horizontal axis | ||
| - c) without scaling, both features contribute equally to cluster assignment | ||
|
|
||
| _Select a single answer_ | ||
| ``` | ||
|
|
||
| +++ | ||
|
|
||
| ```{admonition} Question | ||
| Which of the following statements correctly describe factors that affect the | ||
| stability of K-Means clustering across different resamplings of the data? | ||
|
|
||
| - a) K-Means can produce different results on resampled datasets due to | ||
| sensitivity to initialization | ||
| - b) If data is unevenly distributed, the stability improves when increasing the | ||
| parameter `n_init` in the "k-means++" initialization | ||
| - c) Stability across resamplings of the data is guaranteed after feature | ||
| scaling | ||
| - d) Increasing the number of clusters typically reduces the variability of | ||
| results across resamples | ||
|
|
||
| _Select all answers that apply_ | ||
| ``` | ||
|
|
||
| +++ | ||
|
|
||
| ```{admonition} Question | ||
| Which of the following statements correctly describe how WCSS (within-cluster | ||
| sum of squares, or inertia) behaves in K-means clustering? | ||
|
|
||
| - a) For a fixed number of clusters, WCSS is lower when clusters are compact | ||
| - b) For a fixed number of clusters, WCSS is lower for larger clusters | ||
| - c) For a fixed number of clusters, lower WCSS implies lower computational cost | ||
| during training | ||
| - d) WCSS always decreases as the number of clusters increases | ||
|
|
||
| _Select all answers that apply_ | ||
| ``` | ||
|
|
||
| +++ | ||
|
|
||
| ```{admonition} Question | ||
| Which of the following statements correctly describe differences between | ||
| supervised and unsupervised clustering metrics? | ||
|
|
||
| - a) Supervised clustering metrics such as ARI and AMI require access to ground | ||
| truth labels to evaluate clustering performance | ||
| - b) WCSS and the silhouette score evaluate internal cluster structure without | ||
| needing reference labels | ||
| - c) V-measure is zero when labels are assigned completely at random | ||
| - d) Supervised clustering metrics are not useful if the number of clusters do | ||
| not match the number of predefined classes | ||
|
|
||
| _Select all answers that apply_ | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # ✅ Quiz M4.02 | ||
|
|
||
| ```{admonition} Question | ||
| If we increase `min_cluster_size` in HDBSCAN, what happens to the number of | ||
| points labeled as noise? | ||
|
|
||
| - a) it decreases | ||
| - b) it increases | ||
| - c) it stays the same | ||
| - d) HDBSCAN fails to converge | ||
|
|
||
| _Select a single answer_ | ||
|
|
||
| ``` | ||
|
|
||
| +++ | ||
|
|
||
| ```{admonition} Question | ||
| What happens to k-means centroids in the presence of outliers? | ||
|
|
||
| - a) they move towards the outliers | ||
| - b) they are not sensitive to outliers | ||
| - c) if a centroid is initialized on an outlier, it may remain isolated in | ||
| subsequent iterations | ||
|
|
||
| _Select all answers that apply_ | ||
|
|
||
| ``` | ||
|
|
||
| +++ | ||
|
|
||
| ```{admonition} Question | ||
| A `KMeans` instance with `n_clusters=10` is used to transform the latitude and | ||
| longitude in a supervised learning pipeline. Provided the original dataset consists of | ||
| `n_features`, including those two, how many features are passed to | ||
| the final estimator of the pipeline? | ||
|
|
||
| - a) `n_features` + 10 | ||
| - b) `n_features` + 8 | ||
| - c) `n_features` - 2 | ||
| - d) `n_features` | ||
|
|
||
| _Select a single answer_ | ||
|
|
||
| ``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.