scikit-learn-contrib
diff --git a/‎notebooks/demo_clustering.ipynb‎
Lines changed: 240 additions & 0 deletions b/‎notebooks/demo_clustering.ipynb‎
Lines changed: 240 additions & 0 deletions
@@ -0,0 +1,240 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Skope-Rules Demo: application to cluster description"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook shows a demo of skope-rules applied to identify segments after a clustering analysis. The dataset is available here: https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset/data. It describes football players and their performance attributes.\n",
+    "\n",
+    "This notebook performs a hierarchical clustering with 4 segments. Skope-rules is used to interpret each segments by a 1(segment)-vs-all approach.\n",
+    "\n",
+    "The notebook is structured into 4 parts:\n",
+    "1. Imports\n",
+    "2. Data preparation\n",
+    "3. Clustering of data\n",
+    "4. Interpretation of clusters"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/jms/SquareNetFishing/envclean/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (23,35) have mixed types. Specify dtype option on import or set low_memory=False.\n",
+      "  interactivity=interactivity, compiler=compiler, result=result)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Import skope-rules\n",
+    "from skrules import SkopeRules\n",
+    "\n",
+    "# Import librairies\n",
+    "import pandas as pd\n",
+    "from sklearn.cluster import AgglomerativeClustering\n",
+    "import warnings\n",
+    "\n",
+    "# Import Titanic data\n",
+    "data = pd.read_csv('../data/CompleteDataset.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Data preparation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = data.query(\"Overall>=85\") # Select players with an overall attribute larger than 85/100.\n",
+    "\n",
+    "column_to_keep = ['Name', 'Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control',\n",
+    "       'Composure', 'Crossing', 'Curve', 'Dribbling', 'Finishing',\n",
+    "       'Free kick accuracy', 'GK diving', 'GK handling', 'GK kicking',\n",
+    "       'GK positioning', 'GK reflexes', 'Heading accuracy', 'Preferred Positions']\n",
+    "data = data[column_to_keep] # Keep only performance attributes and names.\n",
+    "\n",
+    "data.columns = [x.replace(' ', '_') for x in data.columns] # Replace white spaces in the column names\n",
+    "\n",
+    "feature_names = data.drop(['Name', 'Preferred_Positions'], axis=1).columns.tolist()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Clustering of data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "clust = AgglomerativeClustering(n_clusters=4) #with euclidian distance and ward linkage\n",
+    "\n",
+    "data['cluster'] = clust.fit_predict(data.drop(['Name', 'Preferred_Positions'], axis=1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Interpretation of clusters"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The 4 clusters obtained have to be interpreted. It can be done with a graphical analysis (parallel coordinates, mean comparison, reasoning with examples of players). This notebook presents an approach which consists in looking for a way to separate a cluster from the rest of population (and to repeat the process for each cluster). \n",
+    "\n",
+    "With this 1-vs-all approach, it becomes a supervized binary classification task. Skope-rules is very useful because a good interpretation of a cluster is based on a simple expression of the frontier which isolates the cluster. And that is skope-rules's scope!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Cluster 0:\n",
+      "[('Heading_accuracy > 58.5 and Free_kick_accuracy > 56.0 and Agility <= 81.5', (0.93548387096774188, 0.8529411764705882, 10))]\n",
+      "Cluster 1:\n",
+      "[('Agility > 81.5 and Aggression <= 76.5 and Heading_accuracy <= 84.0', (1.0, 0.77419354838709675, 2))]\n",
+      "Cluster 2:\n",
+      "[('Heading_accuracy > 82.5 and Curve <= 61.5', (1.0, 0.7857142857142857, 8)), ('Heading_accuracy > 82.5 and Free_kick_accuracy <= 56.5', (1.0, 0.7857142857142857, 2))]\n",
+      "Cluster 3:\n",
+      "[('GK_reflexes > 59.0', (1.0, 1.0, 2))]\n"
+     ]
+    }
+   ],
+   "source": [
+    "warnings.filterwarnings('ignore') #To deals with warning raised by max_samples=1 (see below).\n",
+    "#With max_samples=1, there is no Out-Of-Bag sample to evaluate performance (it is evaluated on all samples. \n",
+    "#As there are less than 100 samples and this is a clustering-oriented task, the risk of overfitting is not \n",
+    "#dramatic here.\n",
+    "\n",
+    "i_cluster = 0\n",
+    "for i_cluster in range(4):\n",
+    "    X_train = data.drop(['Name', 'Preferred_Positions', 'cluster'], axis=1)\n",
+    "    y_train = (data['cluster']==i_cluster)*1\n",
+    "    skope_rules_clf = SkopeRules(feature_names=feature_names, random_state=42, n_estimators=5,\n",
+    "                                   recall_min=0.5, precision_min=0.5,\n",
+    "                                   max_samples=1., max_depth=3)\n",
+    "    skope_rules_clf.fit(X_train, y_train)\n",
+    "    print('Cluster '+str(i_cluster)+':')\n",
+    "    #print(data.query('cluster=='+str(i_cluster))[['Name', 'Preferred_Positions']])\n",
+    "    print(skope_rules_clf.rules_)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In cluster 0, we find players with **good heading and free kick accuracy**, but which are **not the best agile players** (<= 81/100). This rule is a good description of cluster 0: it captures 85% of cluster 1, with a precision of 93% (7% of players described by the rule are not in cluster 0. The third term of the performance term (10) is the number of time that this rule was extracted from the trees built during skope-rules' fitting.\n",
+    "\n",
+    "In cluster 1, we find **very agile players** which are **not** the **most aggressive** and **accurate with their heads**. This rule is very precise but misses 23% of this cluster.\n",
+    "\n",
+    "In cluster 2, we find players **accurate with their heads** but with **less skills for dribbling** (Curve). Note that an alternative rule is proposed here, with similar performances.\n",
+    "\n",
+    "In cluster 3, we find players who have **good goal-keeper reflexes**. This rule perfectly defines the cluster (100% precision, 100% recall). This is the goal-keeper cluster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "5 players from cluster 0:\n",
+      "['M. Hamšík', 'Alex Sandro', 'Casemiro', 'K. Benzema', 'Z. Ibrahimović']\n",
+      "\n",
+      "5 players from cluster 1:\n",
+      "['H. Mkhitaryan', 'David Silva', 'F. Ribéry', 'J. Rodríguez', 'P. Dybala']\n",
+      "\n",
+      "5 players from cluster 2:\n",
+      "['Pepe', 'K. Glik', 'G. Chiellini', 'V. Kompany', 'Piqué']\n",
+      "\n",
+      "5 players from cluster 3:\n",
+      "['M. ter Stegen', 'D. Subašić', 'M. Neuer', 'K. Navas', 'H. Lloris']\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i_cluster in range(4):\n",
+    "    print('5 players from cluster '+str(i_cluster)+':')\n",
+    "    print(data.query(\"cluster==\"+str(i_cluster))['Name'].sample(5, random_state=42).tolist()) # Get 5 random players per cluster\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In brief, **cluster 0** tends to concentrate strikers and midfielders talented with their heads. **Cluster 1** tends to group other midfielders. **Cluster 2** focuses on defenders while goal-keepers are gathered in **cluster 3**!\n",
+    "\n",
+    "For a visual analysis (kind of parallel coordinates) of these clusters, you can check this Kaggle kernel: https://www.kaggle.com/malimo1024/clustering-top-players"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}