improved modeling steps overview

matthieuvion · matthieuvion · commit b2ac2c405622 · 2023-07-10T13:51:15.000+02:00
diff --git a/spark-cluster.ipynb b/spark-cluster.ipynb
@@ -52,7 +52,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "id": "ac3e8958-5d9c-4e80-9a6f-fd343a3d4dd5",
    "metadata": {
     "tags": []
@@ -89,7 +89,7 @@
    "id": "2ba80b72-4efc-4369-9acc-525613671e7b",
    "metadata": {},
    "source": [
-    "On Avocado dataset (how original). If you cloned git repo, is in /data, else go Kaggle"
+    "Predict average price, avocado dataset (how original). If you ggit cloned repo, is in /data, else go Kaggle"
    ]
   },
   {
@@ -108,7 +108,7 @@
    },
    "source": [
     "*Quick desc / scope of dataset :*  \n",
-    "No EDA, this exercise have been made a million times\n",
+    "No EDA, this exercise have been made a million times  \n",
     "Years 2015 to 2018  \n",
     "Two avocado types : organic or conventional  \n",
     "Region = region of consumption  \n",
@@ -117,45 +117,12 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
    "id": "888a85f7-5e40-4e90-8a35-3cb1435d1460",
    "metadata": {
     "tags": []
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "root\n",
-      " |-- _c0: integer (nullable = true)\n",
-      " |-- Date: timestamp (nullable = true)\n",
-      " |-- AveragePrice: double (nullable = true)\n",
-      " |-- Total Volume: double (nullable = true)\n",
-      " |-- 4046: double (nullable = true)\n",
-      " |-- 4225: double (nullable = true)\n",
-      " |-- 4770: double (nullable = true)\n",
-      " |-- Total Bags: double (nullable = true)\n",
-      " |-- Small Bags: double (nullable = true)\n",
-      " |-- Large Bags: double (nullable = true)\n",
-      " |-- XLarge Bags: double (nullable = true)\n",
-      " |-- type: string (nullable = true)\n",
-      " |-- year: integer (nullable = true)\n",
-      " |-- region: string (nullable = true)\n",
-      "\n",
-      "+---+-------------------+------------+------------+-------+---------+-----+----------+----------+----------+-----------+------------+----+------+\n",
-      "|_c0|               Date|AveragePrice|Total Volume|   4046|     4225| 4770|Total Bags|Small Bags|Large Bags|XLarge Bags|        type|year|region|\n",
-      "+---+-------------------+------------+------------+-------+---------+-----+----------+----------+----------+-----------+------------+----+------+\n",
-      "|  0|2015-12-27 00:00:00|        1.33|    64236.62|1036.74| 54454.85|48.16|   8696.87|   8603.62|     93.25|        0.0|conventional|2015|Albany|\n",
-      "|  1|2015-12-20 00:00:00|        1.35|    54876.98| 674.28| 44638.81|58.33|   9505.56|   9408.07|     97.49|        0.0|conventional|2015|Albany|\n",
-      "|  2|2015-12-13 00:00:00|        0.93|   118220.22|  794.7|109149.67|130.5|   8145.35|   8042.21|    103.14|        0.0|conventional|2015|Albany|\n",
-      "|  3|2015-12-06 00:00:00|        1.08|    78992.15| 1132.0| 71976.41|72.58|   5811.16|    5677.4|    133.76|        0.0|conventional|2015|Albany|\n",
-      "+---+-------------------+------------+------------+-------+---------+-----+----------+----------+----------+-----------+------------+----+------+\n",
-      "only showing top 4 rows\n",
-      "\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "# Cache table/dataframe for re-usable table with .cache()\n",
     "# caching operation takes place only when a Spark action (count, show, take or write) is also performed on the same dataframe\n",
@@ -181,7 +148,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": null,
    "id": "e0068bc2-270c-4e43-beeb-082b404ce297",
    "metadata": {
     "tags": []
@@ -201,17 +168,19 @@
    "id": "b840a5b1-8bd7-4c73-a8c9-133e4983e8dd",
    "metadata": {},
    "source": [
-    "- Steps differs a bit from sklearn. Search for 'transformers' and 'estimators'\n",
+    "- Steps differs a bit from sklearn. Search for Spark 'transformers' and 'estimators'\n",
     "- No EDA, has been done a million times on this dataset. \n",
     "- Format data  \n",
-    "-Feature creation from 'Date' : yy and mm  \n",
-    "-Drop columns : Total Bags, Total Volume (strong corr with respective subcategories) ; could also be done in pipeline tho ?\n",
-    "- Pipeline (encode etc...)  \n",
-    "-One hot encoding categorical 'region' (before that, use StringIndexer)   \n",
-    "-Drop transformed columns:  Date, region. Note : unlike scikit-learn col transf, pyspark adds new col when transforming    \n",
-    "- Consolidate all remaining features in a single vector using VectorAssembler\n",
-    "- Scale numerical features using StandardScaler <- would be earlier in a sklearn pipeline\n",
-    "- Predict"
+    "-Feature creation from 'Date' & 'Year' : yy and mm  \n",
+    "-Optional : Drop columns : Total Bags, Total Volume (strong corr with respective subcategories)  \n",
+    "- Build Pipeline (encode etc...)  \n",
+    "-StringIndexer to convert categorical in caetgory indices  \n",
+    "-One hot encoding categorical 'region'   \n",
+    "-VectorAssembler, used encoded features into a single vector  \n",
+    "-StandardScaler on features vector <- would be earlier in sklearn pipeline  \n",
+    "-define regressor (here, randomForest)  \n",
+    "-build Pipeline()\n",
+    "- Simple model, no cv/search param"
    ]
   },
   {
@@ -226,27 +195,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": null,
    "id": "ea5b4865-062b-491a-bf10-1242d46d358c",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "+------------+-----------+----------+-----------+----------+----------+-----------+------------+----+------+----------+-----+\n",
-      "|AveragePrice|Medium Size|Large Size|XLarge Size|Small Bags|Large Bags|XLarge Bags|        type|year|region|Year Index|Month|\n",
-      "+------------+-----------+----------+-----------+----------+----------+-----------+------------+----+------+----------+-----+\n",
-      "|        1.33|    1036.74|  54454.85|      48.16|   8603.62|     93.25|        0.0|conventional|2015|Albany|        15|   12|\n",
-      "|        1.35|     674.28|  44638.81|      58.33|   9408.07|     97.49|        0.0|conventional|2015|Albany|        15|   12|\n",
-      "|        0.93|      794.7| 109149.67|      130.5|   8042.21|    103.14|        0.0|conventional|2015|Albany|        15|   12|\n",
-      "|        1.08|     1132.0|  71976.41|      72.58|    5677.4|    133.76|        0.0|conventional|2015|Albany|        15|   12|\n",
-      "+------------+-----------+----------+-----------+----------+----------+-----------+------------+----+------+----------+-----+\n",
-      "only showing top 4 rows\n",
-      "\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "# convert 'year' yyyy to yy (yyyy - 2000, since we have 2015-2018 values)\n",
     "df = df.withColumn('Year Index', col('Year') - 2000)\n",
@@ -276,7 +228,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": null,
    "id": "382272ea-07aa-43a4-af0f-681b332af34d",
    "metadata": {},
    "outputs": [],
@@ -330,29 +282,18 @@
    "id": "c3332499-66a1-4f79-be00-bcefcbda212a",
    "metadata": {},
    "source": [
-    "Crude attempt, no cv, some default rf parameters.  \n",
+    "Crude attempt, no cv, some arbitrary randomForest parameters.  \n",
     "For parameters tuning, look up for pyspark.ml.tuning  / CrossValidator, ParamGridBuilder. Not used here"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": null,
    "id": "ae2ebec7-8379-45bd-b375-faac5c64824c",
    "metadata": {
     "tags": []
    },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "0.1975694758480664"
-      ]
-     },
-     "execution_count": 18,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
    "source": [
     "from pyspark.ml.evaluation import RegressionEvaluator\n",
     "\n",
@@ -365,18 +306,18 @@
     "\n",
     "# apply the model to the test set\n",
     "prediction = model.transform(test)\n",
-    "eval = RegressionEvaluator(predictionCol='prediction',\n",
+    "eval_ = RegressionEvaluator(predictionCol='prediction',\n",
     "                                       labelCol='AveragePrice', metricName='rmse')\n",
     "\n",
-    "eval.evaluate(prediction)"
+    "eval_.evaluate(prediction)"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "5a769698-04bc-4eda-9edc-63a4bfd11d25",
    "metadata": {},
    "source": [
-    "For reference, original article, using Linear regression + cv : rmse of .28"
+    "For reference, original article, using Linear regression + cv/gridSearch : rmse of .28"
    ]
   }
  ],