diff --git a/1-Introduction/01-defining-data-science/notebook.ipynb b/1-Introduction/01-defining-data-science/notebook.ipynb
index cf3988e85..75ca03058 100644
--- a/1-Introduction/01-defining-data-science/notebook.ipynb
+++ b/1-Introduction/01-defining-data-science/notebook.ipynb
@@ -1,419 +1,1020 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "source": [
- "# Challenge: Analyzing Text about Data Science\r\n",
- "\r\n",
- "In this example, let's do a simple exercise that covers all steps of a traditional data science process. You do not have to write any code, you can just click on the cells below to execute them and observe the result. As a challenge, you are encouraged to try this code out with different data. \r\n",
- "\r\n",
- "## Goal\r\n",
- "\r\n",
- "In this lesson, we have been discussing different concepts related to Data Science. Let's try to discover more related concepts by doing some **text mining**. We will start with a text about Data Science, extract keywords from it, and then try to visualize the result.\r\n",
- "\r\n",
- "As a text, I will use the page on Data Science from Wikipedia:"
- ],
- "metadata": {}
- },
- {
- "cell_type": "markdown",
- "source": [],
- "metadata": {}
- },
- {
- "cell_type": "code",
- "execution_count": 62,
- "source": [
- "url = 'https://en.wikipedia.org/wiki/Data_science'"
- ],
- "outputs": [],
- "metadata": {}
- },
- {
- "cell_type": "markdown",
- "source": [
- "## Step 1: Getting the Data\r\n",
- "\r\n",
- "First step in every data science process is getting the data. We will use `requests` library to do that:"
- ],
- "metadata": {}
- },
- {
- "cell_type": "code",
- "execution_count": 63,
- "source": [
- "import requests\r\n",
- "\r\n",
- "text = requests.get(url).content.decode('utf-8')\r\n",
- "print(text[:1000])"
- ],
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "\n",
- "\n",
- "
"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dTc8xEMzIUW0"
+ },
+ "source": [
+ "## Analyzing Real Data\n",
+ "\n",
+ "Mean and variance are very important when analyzing real-world data. Let's load the data about baseball players from [SOCR MLB Height/Weight Data](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights)"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "df.boxplot(column='Height', by='Role', figsize=(10,8))\n",
- "plt.xticks(rotation='vertical')\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> **Note**: This diagram suggests, that on average, the heights of first basemen are higher than heights of second basemen. Later we will learn how we can test this hypothesis more formally, and how to demonstrate that our data is statistically significant to show that. \n",
- "\n",
- "Age, height and weight are all continuous random variables. What do you think their distribution is? A good way to find out is to plot the histogram of values: "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "hIjGQv-0IUW6",
+ "outputId": "e26ca71a-2697-46a0-f599-986f71b961e0"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Mean = 73.6972920696325\n",
+ "Variance = 5.316798081118074\n",
+ "Standard Deviation = 2.3058183105175645\n"
+ ]
+ }
],
- "text/plain": [
- " Height Weight Count\n",
- "Role \n",
- "Catcher 72.723684 204.328947 76\n",
- "Designated_Hitter 74.222222 220.888889 18\n",
- "First_Baseman 74.000000 213.109091 55\n",
- "Outfielder 73.010309 199.113402 194\n",
- "Relief_Pitcher 74.374603 203.517460 315\n",
- "Second_Baseman 71.362069 184.344828 58\n",
- "Shortstop 71.903846 182.923077 52\n",
- "Starting_Pitcher 74.719457 205.163636 221\n",
- "Third_Baseman 73.044444 200.955556 45"
+ "source": [
+ "mean = df['Height'].mean()\n",
+ "var = df['Height'].var()\n",
+ "std = df['Height'].std()\n",
+ "print(f\"Mean = {mean}\\nVariance = {var}\\nStandard Deviation = {std}\")"
]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df.groupby('Role').agg({ 'Height' : 'mean', 'Weight' : 'mean', 'Age' : 'count'}).rename(columns={ 'Age' : 'Count'})"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's test the hypothesis that First Basemen are taller than Second Basemen. The simplest way to do this is to test the confidence intervals:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Conf=0.85, 1st basemen height: 73.62..74.38, 2nd basemen height: 71.04..71.69\n",
- "Conf=0.90, 1st basemen height: 73.56..74.44, 2nd basemen height: 70.99..71.73\n",
- "Conf=0.95, 1st basemen height: 73.47..74.53, 2nd basemen height: 70.92..71.81\n"
- ]
- }
- ],
- "source": [
- "for p in [0.85,0.9,0.95]:\n",
- " m1, h1 = mean_confidence_interval(df.loc[df['Role']=='First_Baseman',['Height']],p)\n",
- " m2, h2 = mean_confidence_interval(df.loc[df['Role']=='Second_Baseman',['Height']],p)\n",
- " print(f'Conf={p:.2f}, 1st basemen height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen height: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can see that the intervals do not overlap.\n",
- "\n",
- "A statistically more correct way to prove the hypothesis is to use a **Student t-test**:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "T-value = 7.65\n",
- "P-value: 9.137321189738925e-12\n"
- ]
- }
- ],
- "source": [
- "from scipy.stats import ttest_ind\n",
- "\n",
- "tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Second_Baseman',['Height']],equal_var=False)\n",
- "print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The two values returned by the `ttest_ind` function are:\n",
- "* p-value can be considered as the probability of two distributions having the same mean. In our case, it is very low, meaning that there is strong evidence supporting that first basemen are taller.\n",
- "* t-value is the intermediate value of normalized mean difference that is used in the t-test, and it is compared against a threshold value for a given confidence value."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Simulating a Normal Distribution with the Central Limit Theorem\n",
- "\n",
- "The pseudo-random generator in Python is designed to give us a uniform distribution. If we want to create a generator for normal distribution, we can use the central limit theorem. To get a normally distributed value we will just compute a mean of a uniform-generated sample."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAGoCAYAAABbtxOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAARLElEQVR4nO3df4zkd13H8ddblgbkR4DcghU4Fgghlj/4kbOIGFNDMEiNQIIJJGI1mFMjBJREL/yh/FnjryZGMRWQGn6FQPkRriqkkqCJEq9QQpuCIFQsXLg2KKAxIS0f/9g5eLfdc7fznd3v7O3jkUxu5rszO+/93Ox+n/e9mZ0aYwQAANj2A3MPAAAA60QgAwBAI5ABAKARyAAA0AhkAABoNg7yzo4dOza2trYO8i4BAGBHN910011jjM37bj/QQN7a2sqZM2cO8i4BAGBHVfXvO233FAsAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGg25h4AgAdm69TpuUeYxe1XXzn3CMAR4QgyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCg2TWQq+qJVfXxqrqtqm6tqtcttj+mqj5WVV9Y/Pno/R8XAAD2116OIN+d5A1jjB9J8mNJfqOqLktyKsmNY4ynJblxcRkAAA61XQN5jHF2jPGpxflvJ7ktyeOTvCTJdYurXZfkpfs0IwAAHJgH9BzkqtpK8uwkn0zyuDHG2WQ7opM89gK3OVlVZ6rqzJ133jlxXAAA2F97DuSqeniS9yd5/RjjW3u93Rjj2jHGiTHGic3NzWVmBACAA7OnQK6qB2c7jt85xrh+sfnrVXXp4uOXJjm3PyMCAMDB2ctvsagkb01y2xjjj9uHPpzkqsX5q5J8aPXjAQDAwdrYw3Wen+RVST5bVTcvtr0xydVJ3ltVr07ylSQ/vy8TAgDAAdo1kMcY/5ikLvDhF6x2HAAAmJd30gMAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoNuYeAGCKrVOn5x4BgIuMI8gAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBm10CuqrdV1bmquqVte1NVfbWqbl6cXry/YwIAwMHYyxHktyd50Q7b/2SM8azF6YbVjgUAAPPYNZDHGJ9I8o0DmAUAAGa3MeG2r6mqX0xyJskbxhj/udOVqupkkpNJcvz48Ql3BwBHz9ap03OPcOBuv/rKuUfgiFv2RXpvTvLUJM9KcjbJH13oimOMa8cYJ8YYJzY3N5e8OwAAOBhLBfIY4+tjjHvGGN9N8pdJLl/tWAAAMI+lArmqLm0XX5bklgtdFwAADpNdn4NcVe9OckWSY1V1R5LfS3JFVT0ryUhye5Jf3b8RAQDg4OwayGOMV+6w+a37MAsAAMzOO+kBAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQbMw9AADsxdap03OPABwRjiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAEAjkAEAoBHIAADQCGQAAGgEMgAANAIZAAAagQwAAI1ABgCARiADAECzMfcAwGpsnTo99wgAcFFwBBkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANDsGshV9baqOldVt7Rtj6mqj1XVFxZ/Pnp/xwQAgIOxlyPIb0/yovtsO5XkxjHG05LcuLgMAACH3q6BPMb4RJJv3GfzS5Jctzh/XZKXrnYsAACYx8aSt3vcGONskowxzlbVYy90xao6meRkkhw/fnzJuwMAjoqtU6fnHmEWt1995dwjsLDvL9IbY1w7xjgxxjixubm533cHAACTLBvIX6+qS5Nk8ee51Y0EAADzWTaQP5zkqsX5q5J8aDXjAADAvPbya97eneSfkjy9qu6oqlcnuTrJC6vqC0leuLgMAACH3q4v0htjvPICH3rBimcBAIDZeSc9AABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAADNxpQbV9XtSb6d5J4kd48xTqxiKAAAmMukQF74qTHGXSv4PAAAMDtPsQAAgGZqII8kH62qm6rq5CoGAgCAOU19isXzxxhfq6rHJvlYVX1ujPGJfoVFOJ9MkuPHj0+8OwCAi9PWqdNzjzCL26++cu4R7mfSEeQxxtcWf55L8oEkl+9wnWvHGCfGGCc2Nzen3B0AAOy7pQO5qh5WVY84fz7JTye5ZVWDAQDAHKY8xeJxST5QVec/z7vGGH+7kqkAAGAmSwfyGONLSZ65wlkAAGB2fs0bAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQLMx9wCwalunTs89AgBwiDmCDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0AhkAABqBDAAAzcbcAxyUrVOn5x4BAIBDwBFkAABoBDIAADQCGQAAGoEMAACNQAYAgEYgAwBAI5ABAKARyAAA0AhkAABoBDIAADQCGQAAGoEMAACNQAYAgGZSIFfVi6rq81X1xao6taqhAABgLksHclU9KMmfJfmZJJcleWVVXbaqwQAAYA5TjiBfnuSLY4wvjTG+k+Q9SV6ymrEAAGAeGxNu+/gk/9Eu35Hkufe9UlWdTHJycfG/q+rzE+7zMDuW5K65h7gIWMfVsZarYy1XwzqujrVcHWu5Ghdcx/r9A57k3p6008YpgVw7bBv32zDGtUmunXA/F4WqOjPGODH3HIeddVwda7k61nI1rOPqWMvVsZarcdjWccpTLO5I8sR2+QlJvjZtHAAAmNeUQP6XJE+rqidX1SVJXpHkw6sZCwAA5rH0UyzGGHdX1WuS/F2SByV52xjj1pVNdvE58k8zWRHruDrWcnWs5WpYx9WxlqtjLVfjUK1jjXG/pw0DAMCR5Z30AACgEcgAANAI5In2+nbbVfWjVXVPVb18cfmJVfXxqrqtqm6tqtcd3NTradm1bNsfVFWfrqqP7P+062vKOlbVo6rqfVX1ucVj83kHM/V6mriWv7n43r6lqt5dVQ85mKnX025rWVVXVNU3q+rmxel393rbo2TZdbTPub8pj8nFx+1zFiZ+f6/nfmeM4bTkKdsvTvy3JE9JckmSzyS57ALX+/skNyR5+WLbpUmeszj/iCT/utNtj8ppylq2j/1Wkncl+cjcX89hXcck1yX5lcX5S5I8au6v6TCuZbbfSOnLSR66uPzeJL8099e0zmuZ5Iqdvnf3+vdwFE4T19E+Z0Vr2T5+5Pc5q1jLdd3vOII8zV7fbvu1Sd6f5Nz5DWOMs2OMTy3OfzvJbdneqR5VS69lklTVE5JcmeQt+z3omlt6HavqkUl+Mslbk2SM8Z0xxn/t+8Tra9JjMtu/JeihVbWR5AdztH9P/F7XctW3vdgsvRb2Ofcz6XFln3MvS6/lOu93BPI0O73d9r1+4FTV45O8LMlfXOiTVNVWkmcn+eTqRzw0pq7lNUl+O8l392m+w2LKOj4lyZ1J/mrx34ZvqaqH7eewa27ptRxjfDXJHyb5SpKzSb45xvjovk673nZdy4XnVdVnqupvquoZD/C2R8GUdfwe+5wk09fymtjnnDdlLdd2vyOQp9nL221fk+R3xhj37PgJqh6e7aNPrx9jfGu14x0qS69lVf1sknNjjJv2abbDZMpjciPJc5K8eYzx7CT/k+QoP99zymPy0dk+gvLkJD+c5GFV9Qv7MeQhsZe1/FSSJ40xnpnkT5N88AHc9qiYso7bn8A+57yl19I+536mPC7Xdr+z9BuFkGRvb7d9Isl7qipJjiV5cVXdPcb4YFU9ONs/qN45xrj+IAZeY0uvZZLnJvm5qnpxkockeWRVvWOMcRSDZMo6/nOSO8YY548qvS9r8oNqJlPW8sFJvjzGuDNJqur6JD+e5B37PfSa2nUte6yNMW6oqj+vqmN7ue0RsvQ6jjHuss+5lymPyefHPqeb+v29nvuduZ8EfZhP2f4HxpeyfZTo/BPTn/H/XP/t+f6LeCrJXye5Zu6vYx1OU9byPtuvyBF+wcTUdUzyD0mevjj/piR/MPfXdBjXMtv/aLs12889rmy/COW1c39N67yWSX4o33/zqsuz/fSUeqB/DxfzaeI62uesaC3vc50jvc9ZxVqu637HEeQJxgXebruqfm3x8Qs+7zjb/wJ9VZLPVtXNi21vHGPcsJ8zr6uJa8nCCtbxtUneWVWXZPsH3i/v68BrbMpajjE+WVXvy/Z/K96d5NM5ZG+zukp7XMuXJ/n1xRH4/03yirG9x9zxtrN8ITObso5V9ROxz/meiY9JmhWs5Vrud7zVNAAANF6kBwAAjUAGAIBGIAMAQCOQAQCgEcgAANAIZAAAaAQyAAA0/wceFVFs3MY9ywAAAABJRU5ErkJggg==\n",
- "text/plain": [
- "
"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "_CtgdwXxIUW7"
+ },
+ "source": [
+ "In addition to mean, it makes sense to look at the median value and quartiles. They can be visualized using a **box plot**:"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "def normal_random(sample_size=100):\n",
- " sample = [random.uniform(0,1) for _ in range(sample_size) ]\n",
- " return sum(sample)/sample_size\n",
- "\n",
- "sample = [normal_random() for _ in range(100)]\n",
- "plt.figure(figsize=(10,6))\n",
- "plt.hist(sample)\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Correlation and Evil Baseball Corp\n",
- "\n",
- "Correlation allows us to find relations between data sequences. In our toy example, let's pretend there is an evil baseball corporation that pays its players according to their height - the taller the player is, the more money he/she gets. Suppose there is a base salary of $1000, and an additional bonus from $0 to $100, depending on height. We will take the real players from MLB, and compute their imaginary salaries:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[(74, 1075.2469071629068), (74, 1075.2469071629068), (72, 1053.7477908306478), (72, 1053.7477908306478), (73, 1064.4973489967772), (69, 1021.4991163322591), (69, 1021.4991163322591), (71, 1042.9982326645181), (76, 1096.746023495166), (71, 1042.9982326645181)]\n"
- ]
- }
- ],
- "source": [
- "heights = df['Height']\n",
- "salaries = 1000+(heights-heights.min())/(heights.max()-heights.mean())*100\n",
- "print(list(zip(heights, salaries))[:10])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's now compute covariance and correlation of those sequences. `np.cov` will give us a so-called **covariance matrix**, which is an extension of covariance to multiple variables. The element $M_{ij}$ of the covariance matrix $M$ is a correlation between input variables $X_i$ and $X_j$, and diagonal values $M_{ii}$ is the variance of $X_{i}$. Similarly, `np.corrcoef` will give us the **correlation matrix**."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Covariance matrix:\n",
- "[[ 5.31679808 57.15323023]\n",
- " [ 57.15323023 614.37197275]]\n",
- "Covariance = 57.153230230544736\n",
- "Correlation = 1.0\n"
- ]
- }
- ],
- "source": [
- "print(f\"Covariance matrix:\\n{np.cov(heights, salaries)}\")\n",
- "print(f\"Covariance = {np.cov(heights, salaries)[0,1]}\")\n",
- "print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A correlation equal to 1 means that there is a strong **linear relation** between two variables. We can visually see the linear relation by plotting one value against the other:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "plt.figure(figsize=(10,2))\n",
+ "plt.boxplot(df['Height'], vert=False, showmeans=True)\n",
+ "plt.grid(color='gray', linestyle='dotted')\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure(figsize=(10,6))\n",
- "plt.scatter(heights,salaries)\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's see what happens if the relation is not linear. Suppose that our corporation decided to hide the obvious linear dependency between heights and salaries, and introduced some non-linearity into the formula, such as `sin`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Correlation = 0.9835304456670837\n"
- ]
- }
- ],
- "source": [
- "salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100\n",
- "print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In this case, the correlation is slightly smaller, but it is still quite high. Now, to make the relation even less obvious, we might want to add some extra randomness by adding some random variable to the salary. Let's see what happens:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Correlation = 0.9363097848296155\n"
- ]
- }
- ],
- "source": [
- "salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100+np.random.random(size=len(heights))*20-10\n",
- "print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RqV-Mw1SIUW8"
+ },
+ "source": [
+ "We can also make box plots of subsets of our dataset, for example, grouped by player role."
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure(figsize=(10,6))\n",
- "plt.scatter(heights, salaries)\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "> Can you guess why the dots line up into vertical lines like this?\n",
- "\n",
- "We have observed the correlation between an artificially engineered concept like salary and the observed variable *height*. Let's also see if the two observed variables, such as height and weight, correlate too:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([[ 1., nan],\n",
- " [nan, nan]])"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 806
+ },
+ "id": "TFRfrvyKIUW8",
+ "outputId": "3c8c28e4-1788-4df4-9468-834e2e51e6e4"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ "
"
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "df.boxplot(column='Height', by='Role', figsize=(10,8))\n",
+ "plt.xticks(rotation='vertical')\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
]
- },
- "execution_count": 26,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "np.corrcoef(df['Height'],df['Weight'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Unfortunately, we did not get any results - only some strange `nan` values. This is due to the fact that some of the values in our series are undefined, represented as `nan`, which causes the result of the operation to be undefined as well. By looking at the matrix we can see that `Weight` is the problematic column, because self-correlation between `Height` values has been computed.\n",
- "\n",
- "> This example shows the importance of **data preparation** and **cleaning**. Without proper data we cannot compute anything.\n",
- "\n",
- "Let's use `fillna` method to fill the missing values, and compute the correlation: "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "array([[1. , 0.52959196],\n",
- " [0.52959196, 1. ]])"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6XHFyajmIUW8"
+ },
+ "source": [
+ "> **Note**: This diagram suggests, that on average, the heights of first basemen are higher than heights of second basemen. Later we will learn how we can test this hypothesis more formally, and how to demonstrate that our data is statistically significant to show that. \n",
+ "\n",
+ "Age, height and weight are all continuous random variables. What do you think their distribution is? A good way to find out is to plot the histogram of values:"
]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "np.corrcoef(df['Height'],df['Weight'].fillna(method='pad'))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "There is indeed a correlation, but not such a strong one as in our artificial example. Indeed, if we look at the scatter plot of one value against the other, the relation would be much less obvious:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- "
\n"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 17
+ }
+ ],
+ "source": [
+ "df.groupby('Role').agg({ 'Height' : 'mean', 'Weight' : 'mean', 'Age' : 'count'}).rename(columns={ 'Age' : 'Count'})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "O3H--SzfIUXC"
+ },
+ "source": [
+ "Let's test the hypothesis that First Basemen are taller than Second Basemen. The simplest way to do this is to test the confidence intervals:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "XsO80SsfIUXC",
+ "outputId": "424f53ff-729b-4f21-addf-ecca08d4fb04"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Conf=0.85, 1st basemen height: 73.62..74.38, 2nd basemen height: 71.04..71.69\n",
+ "Conf=0.90, 1st basemen height: 73.56..74.44, 2nd basemen height: 70.99..71.73\n",
+ "Conf=0.95, 1st basemen height: 73.47..74.53, 2nd basemen height: 70.92..71.81\n"
+ ]
+ }
+ ],
+ "source": [
+ "for p in [0.85,0.9,0.95]:\n",
+ " m1, h1 = mean_confidence_interval(df.loc[df['Role']=='First_Baseman',['Height']],p)\n",
+ " m2, h2 = mean_confidence_interval(df.loc[df['Role']=='Second_Baseman',['Height']],p)\n",
+ " print(f'Conf={p:.2f}, 1st basemen height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen height: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "b8DU4ZUGIUXD"
+ },
+ "source": [
+ "We can see that the intervals do not overlap.\n",
+ "\n",
+ "A statistically more correct way to prove the hypothesis is to use a **Student t-test**:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "Qwra5Iz7IUXD",
+ "outputId": "2d5f359b-5707-4ce0-9c95-ef8d7ca8b791"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "T-value = 7.65\n",
+ "P-value: 9.137321189738959e-12\n"
+ ]
+ }
+ ],
+ "source": [
+ "from scipy.stats import ttest_ind\n",
+ "\n",
+ "tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Second_Baseman',['Height']],equal_var=False)\n",
+ "print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "brelF-6LIUXE"
+ },
+ "source": [
+ "The two values returned by the `ttest_ind` function are:\n",
+ "* p-value can be considered as the probability of two distributions having the same mean. In our case, it is very low, meaning that there is strong evidence supporting that first basemen are taller.\n",
+ "* t-value is the intermediate value of normalized mean difference that is used in the t-test, and it is compared against a threshold value for a given confidence value."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "CtPClLpDIUXE"
+ },
+ "source": [
+ "## Simulating a Normal Distribution with the Central Limit Theorem\n",
+ "\n",
+ "The pseudo-random generator in Python is designed to give us a uniform distribution. If we want to create a generator for normal distribution, we can use the central limit theorem. To get a normally distributed value we will just compute a mean of a uniform-generated sample."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 606
+ },
+ "id": "0vSxU_l1IUXF",
+ "outputId": "f4ae9bbd-95da-4f70-ca9f-740e8e382bbd"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "def normal_random(sample_size=100):\n",
+ " sample = [random.uniform(0,1) for _ in range(sample_size) ]\n",
+ " return sum(sample)/sample_size\n",
+ "\n",
+ "sample = [normal_random() for _ in range(100)]\n",
+ "plt.figure(figsize=(10,6))\n",
+ "plt.hist(sample)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rdUV9KxXIUXF"
+ },
+ "source": [
+ "## Correlation and Evil Baseball Corp\n",
+ "\n",
+ "Correlation allows us to find relations between data sequences. In our toy example, let's pretend there is an evil baseball corporation that pays its players according to their height - the taller the player is, the more money he/she gets. Suppose there is a base salary of $1000, and an additional bonus from $0 to $100, depending on height. We will take the real players from MLB, and compute their imaginary salaries:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "oNlWGq_VIUXF",
+ "outputId": "4bc2c9a7-cf4b-4d0b-e87c-dfe7a7fd08e9"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "[(74, 1075.2469071629068), (74, 1075.2469071629068), (72, 1053.7477908306478), (72, 1053.7477908306478), (73, 1064.4973489967772), (69, 1021.4991163322591), (69, 1021.4991163322591), (71, 1042.9982326645181), (76, 1096.746023495166), (71, 1042.9982326645181)]\n"
+ ]
+ }
+ ],
+ "source": [
+ "heights = df['Height']\n",
+ "salaries = 1000+(heights-heights.min())/(heights.max()-heights.mean())*100\n",
+ "print(list(zip(heights, salaries))[:10])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "F5FZaXFAIUXG"
+ },
+ "source": [
+ "Let's now compute covariance and correlation of those sequences. `np.cov` will give us a so-called **covariance matrix**, which is an extension of covariance to multiple variables. The element $M_{ij}$ of the covariance matrix $M$ is a correlation between input variables $X_i$ and $X_j$, and diagonal values $M_{ii}$ is the variance of $X_{i}$. Similarly, `np.corrcoef` will give us the **correlation matrix**."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "Oux7B-KfIUXG",
+ "outputId": "f115362b-63d9-48b8-caee-f348c86bb40a"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Covariance matrix:\n",
+ "[[ 5.31679808 57.15323023]\n",
+ " [ 57.15323023 614.37197275]]\n",
+ "Covariance = 57.1532302305447\n",
+ "Correlation = 1.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f\"Covariance matrix:\\n{np.cov(heights, salaries)}\")\n",
+ "print(f\"Covariance = {np.cov(heights, salaries)[0,1]}\")\n",
+ "print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JV8ZTH1rIUXG"
+ },
+ "source": [
+ "A correlation equal to 1 means that there is a strong **linear relation** between two variables. We can visually see the linear relation by plotting one value against the other:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 607
+ },
+ "id": "Sz9DY39RIUXH",
+ "outputId": "ef9fbd38-1b0f-41d0-f7c6-d385fcffe94f"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "plt.figure(figsize=(10,6))\n",
+ "plt.scatter(heights,salaries)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "1n-bXVOYIUXH"
+ },
+ "source": [
+ "Let's see what happens if the relation is not linear. Suppose that our corporation decided to hide the obvious linear dependency between heights and salaries, and introduced some non-linearity into the formula, such as `sin`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "sdS7DPgcIUXH",
+ "outputId": "97f0efaf-ef7a-4b8d-95a9-b01d4c917c4f"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Correlation = 0.9835304456670827\n"
+ ]
+ }
+ ],
+ "source": [
+ "salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100\n",
+ "print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "gtPLXZGzIUXI"
+ },
+ "source": [
+ "In this case, the correlation is slightly smaller, but it is still quite high. Now, to make the relation even less obvious, we might want to add some extra randomness by adding some random variable to the salary. Let's see what happens:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "gYKdvnQYIUXI",
+ "outputId": "51359d22-9747-4bdd-e08a-3cf388ac5258"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Correlation = 0.9332592309527241\n"
+ ]
+ }
+ ],
+ "source": [
+ "salaries = 1000+np.sin((heights-heights.min())/(heights.max()-heights.mean()))*100+np.random.random(size=len(heights))*20-10\n",
+ "print(f\"Correlation = {np.corrcoef(heights, salaries)[0,1]}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 607
+ },
+ "id": "-ZYLE3UAIUXJ",
+ "outputId": "fb604472-ce5b-4f77-c407-6f69de3f9cc5"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "plt.figure(figsize=(10,6))\n",
+ "plt.scatter(heights, salaries)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "U_hRjJkMIUXJ"
+ },
+ "source": [
+ "> Can you guess why the dots line up into vertical lines like this?\n",
+ "\n",
+ "We have observed the correlation between an artificially engineered concept like salary and the observed variable *height*. Let's also see if the two observed variables, such as height and weight, correlate too:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "ba1ty3GoIUXK",
+ "outputId": "56ecb028-8d9a-4777-b028-e72fc2028498"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[ 1., nan],\n",
+ " [nan, nan]])"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 27
+ }
+ ],
+ "source": [
+ "np.corrcoef(df['Height'],df['Weight'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0o3CEkTHIUXL"
+ },
+ "source": [
+ "Unfortunately, we did not get any results - only some strange `nan` values. This is due to the fact that some of the values in our series are undefined, represented as `nan`, which causes the result of the operation to be undefined as well. By looking at the matrix we can see that `Weight` is the problematic column, because self-correlation between `Height` values has been computed.\n",
+ "\n",
+ "> This example shows the importance of **data preparation** and **cleaning**. Without proper data we cannot compute anything.\n",
+ "\n",
+ "Let's use `fillna` method to fill the missing values, and compute the correlation:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "C1wzEgHGIUXM",
+ "outputId": "f953f94a-e9d6-41a3-e53f-7c5a72e38f8c"
+ },
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "array([[1. , 0.52959196],\n",
+ " [0.52959196, 1. ]])"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 28
+ }
+ ],
+ "source": [
+ "np.corrcoef(df['Height'],df['Weight'].fillna(method='pad'))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "59HPAOScIUXM"
+ },
+ "source": [
+ "There is indeed a correlation, but not such a strong one as in our artificial example. Indeed, if we look at the scatter plot of one value against the other, the relation would be much less obvious:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 607
+ },
+ "id": "LR27XJe9IUXM",
+ "outputId": "a1d4bb37-9c83-49b3-b299-d8b8bc365fd6"
+ },
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/plain": [
+ ""
+ ],
+ "image/png": "\n"
+ },
+ "metadata": {}
+ }
+ ],
+ "source": [
+ "plt.figure(figsize=(10,6))\n",
+ "plt.scatter(df['Height'],df['Weight'])\n",
+ "plt.xlabel('Height')\n",
+ "plt.ylabel('Weight')\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "O-i0GjqIIUXM"
+ },
+ "source": [
+ "## Conclusion\n",
+ "\n",
+ "In this notebook we have learnt how to perform basic operations on data to compute statistical functions. We now know how to use a sound apparatus of math and statistics in order to prove some hypotheses, and how to compute confidence intervals for arbitrary variables given a data sample."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Challenge\n"
+ ],
+ "metadata": {
+ "id": "z9YubMYAuAG7"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Use the sample code in the notebook to test other hypothesis that:\n",
+ "\n",
+ "1. First basemen are older than second basemen\n",
+ "2. First basemen are taller than third basemen\n",
+ "3. Shortstops are taller than second basemen\n"
+ ],
+ "metadata": {
+ "id": "q0cxF4-2uQg5"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### First basemen are older than second basemen"
+ ],
+ "metadata": {
+ "id": "eqa9jPuIuWrz"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's test the hypothesis that First Basemen are older than Second Basemen. The simplest way to do this is to test the confidence intervals:"
+ ],
+ "metadata": {
+ "id": "sDiN7_wpvPaZ"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "for p in [0.85,0.9,0.95]:\n",
+ " m1, h1 = mean_confidence_interval(df.loc[df['Role']=='First_Baseman',['Age']],p)\n",
+ " m2, h2 = mean_confidence_interval(df.loc[df['Role']=='Second_Baseman',['Age']],p)\n",
+ " print(f'Conf={p:.2f}, 1st basemen age: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen age: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')"
+ ],
+ "metadata": {
+ "id": "17SWW8n1uL-s",
+ "outputId": "5a4a6929-c3f0-4257-b7b4-b7257d7e773b",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 30,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Conf=0.85, 1st basemen age: 28.56..30.39, 2nd basemen age: 28.18..29.87\n",
+ "Conf=0.90, 1st basemen age: 28.42..30.53, 2nd basemen age: 28.06..29.99\n",
+ "Conf=0.95, 1st basemen age: 28.22..30.73, 2nd basemen age: 27.87..30.18\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can see that the intervals do overlap.\n",
+ "\n",
+ "A statistically more correct way to prove the hypothesis is to use a **Student t-test**:\n"
+ ],
+ "metadata": {
+ "id": "_kjnTN57voE1"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Age']], df.loc[df['Role']=='Second_Baseman',['Age']],equal_var=False)\n",
+ "print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")"
+ ],
+ "metadata": {
+ "id": "MHUMgyBGv1Wo",
+ "outputId": "09f8e24e-6f67-4eaf-a6c1-f58260f31064",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 31,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "T-value = 0.53\n",
+ "P-value: 0.6005513264471434\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "\n",
+ "* The p-value is the probability of obtaining a result at least as extreme as the one observed, assuming that the null hypothesis is true (p-value can be considered as the probability of two distributions having the same mean). In our case, the null hypothesis would be that there is no difference in ages between First basemen and second basemen of players. A p-value of 0.60 indicates that, assuming the null hypothesis is true, there is a 60% probability of obtaining a result at least as extreme as the one observed. In general, if the p-value is less than a predefined significance level (for example, 0.05), the null hypothesis is rejected and it is concluded that there is a significant difference between the groups. However, since your p-value is greater than 0.05, there is not enough evidence to reject the null hypothesis and conclude that there is a significant difference in ages between the two groups of players.\n"
+ ],
+ "metadata": {
+ "id": "-Qx9SQL7wETA"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "* The t-value is a statistic that measures the size of the difference between two groups in relation to the variation in the sample data. In other words, the t-value is simply the calculated difference represented in units of standard error. The larger the magnitude of the t-value, the greater the evidence against the null hypothesis.\n",
+ "\n",
+ " In your case, if the t-value is equal to 0.53, this indicates that the difference between the two groups is not very large in relation to the variation in the sample data. A small t-value suggests that there is not enough evidence to reject the null hypothesis and conclude that there is a significant difference between the two groups."
+ ],
+ "metadata": {
+ "id": "fVtApZXSy7-k"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### First basemen are taller than third basemen"
+ ],
+ "metadata": {
+ "id": "5m3HHZvDzcch"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's test the hypothesis that First Basemen are taller than Third Basemen. The simplest way to do this is to test the confidence intervals:"
+ ],
+ "metadata": {
+ "id": "THSfoe-r0BAn"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "for p in [0.85,0.9,0.95]:\n",
+ " m1, h1 = mean_confidence_interval(df.loc[df['Role']=='First_Baseman',['Height']],p)\n",
+ " m3, h3 = mean_confidence_interval(df.loc[df['Role']=='Third_Baseman',['Height']],p)\n",
+ " print(f'Conf={p:.2f}, 1st basemen height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 3rd basemen height: {m3-h3[0]:.2f}..{m3+h3[0]:.2f}')"
+ ],
+ "metadata": {
+ "id": "HtLzYWJ5v3KO",
+ "outputId": "2b93373b-5669-4f67-ccf5-58fc8e611b8c",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 32,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Conf=0.85, 1st basemen height: 73.62..74.38, 3rd basemen height: 72.58..73.51\n",
+ "Conf=0.90, 1st basemen height: 73.56..74.44, 3rd basemen height: 72.51..73.58\n",
+ "Conf=0.95, 1st basemen height: 73.47..74.53, 3rd basemen height: 72.40..73.68\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can see that the intervals do not overlap.\n",
+ "\n",
+ "**student t-test**"
+ ],
+ "metadata": {
+ "id": "WeSVWGn10FDZ"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "tval, pval = ttest_ind(df.loc[df['Role']=='First_Baseman',['Height']], df.loc[df['Role']=='Third_Baseman',['Height']],equal_var=False)\n",
+ "print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")"
+ ],
+ "metadata": {
+ "id": "j__x6Szdz67t",
+ "outputId": "60441609-435f-40d5-bc1c-f446935f39ae",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 37,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "T-value = 2.32\n",
+ "P-value: 0.02285634157510527\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "\n",
+ "* A t-value of 2.32 indicates that there is a moderate difference in heights between the two groups in relation to the variation in the sample data.\n",
+ "* p-value of 0.023 indicates that, assuming the null hypothesis is true (i.e., that there is no difference in heights between the two groups), there is a 2.3% probability of obtaining a result at least as extreme as the one observed"
+ ],
+ "metadata": {
+ "id": "RY5vm27t1IaE"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Shortstops are taller than second basemen"
+ ],
+ "metadata": {
+ "id": "DKbjAJeX1kIu"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "the confidence intervals."
+ ],
+ "metadata": {
+ "id": "d-Vwzkzw1p0W"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "for p in [0.85,0.9,0.95]:\n",
+ " m1, h1 = mean_confidence_interval(df.loc[df['Role']=='Shortstop',['Height']],p)\n",
+ " m2, h2 = mean_confidence_interval(df.loc[df['Role']=='Second_Baseman',['Height']],p)\n",
+ " print(f'Conf={p:.2f}, Shortstop height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen height: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')"
+ ],
+ "metadata": {
+ "id": "CtimHO1V0Zu1",
+ "outputId": "abdb0d0c-5460-4e1e-b970-003f6ce30ad1",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 36,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Conf=0.85, Shortstop height: 71.54..72.27, 2nd basemen height: 71.04..71.69\n",
+ "Conf=0.90, Shortstop height: 71.49..72.32, 2nd basemen height: 70.99..71.73\n",
+ "Conf=0.95, Shortstop height: 71.40..72.40, 2nd basemen height: 70.92..71.81\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can see that the intervals do overlap.\n",
+ "\n",
+ "**student t-test**"
+ ],
+ "metadata": {
+ "id": "7IiPnuFR2NRW"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "tval, pval = ttest_ind(df.loc[df['Role']=='Shortstop',['Height']], df.loc[df['Role']=='Second_Baseman',['Height']],equal_var=False)\n",
+ "print(f\"T-value = {tval[0]:.2f}\\nP-value: {pval[0]}\")"
+ ],
+ "metadata": {
+ "id": "xWDTpnDI127B",
+ "outputId": "ce861612-fb9b-4f23-9bbd-a43cae0657a0",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": 38,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "T-value = 1.62\n",
+ "P-value: 0.10763413630751067\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "* A t-value of 1.62 indicates that there is a small difference in heights between the two groups in relation to the variation in the sample data.\n",
+ "* A p-value of 0.10763413630751067 indicates that, assuming the null hypothesis is true (i.e., that there is no difference in heights between the two groups), there is a 10.76% probability of obtaining a result at least as extreme as the one observed."
+ ],
+ "metadata": {
+ "id": "NE6XLRMX23qY"
+ }
+ }
+ ],
+ "metadata": {
+ "interpreter": {
+ "hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.12"
+ },
+ "colab": {
+ "provenance": [],
+ "toc_visible": true,
+ "include_colab_link": true
}
- ],
- "source": [
- "plt.figure(figsize=(10,6))\n",
- "plt.scatter(df['Height'],df['Weight'])\n",
- "plt.xlabel('Height')\n",
- "plt.ylabel('Weight')\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Conclusion\n",
- "\n",
- "In this notebook we have learnt how to perform basic operations on data to compute statistical functions. We now know how to use a sound apparatus of math and statistics in order to prove some hypotheses, and how to compute confidence intervals for arbitrary variables given a data sample. "
- ]
- }
- ],
- "metadata": {
- "interpreter": {
- "hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
- },
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
},
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.8.12"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file