complete notebook example w/ spark.ml pipeline

matthieuvion · matthieuvion · commit 55e7f1f9aec7 · 2023-07-09T16:30:33.000+02:00
diff --git a/spark-cluster.ipynb b/spark-cluster.ipynb
@@ -5,7 +5,7 @@
    "id": "707cc39c-31f3-408e-a61f-8fd612627910",
    "metadata": {},
    "source": [
-    "## Spark cluster (standalone) - Prediction notebook"
+    "## Spark cluster (standalone) - Prediction with a Pipeline notebook"
    ]
   },
   {
@@ -15,13 +15,20 @@
     "tags": []
    },
    "source": [
-    "> Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]  \n",
-    "`docker-compose.yml` was (slightly) adapted from this [article](https://towardsdatascience.com/first-steps-in-machine-learning-with-apache-spark-672fe31799a3)  \n",
-    "\n",
-    "> Original notebook is heavily modified :  \n",
+    "> Dockerized env : [JupyterLab server => Spark (master <-> 1 worker) ]  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ead98478-3591-4f6f-8d2d-5b65bbe18d30",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "`docker-compose.yml` was (slightly) adapted from this [article](https://towardsdatascience.com/first-steps-in-machine-learning-with-apache-spark-672fe31799a3), whereras original notebook was heavily modified :   \n",
     "-random forest regressor instead of the article's linreg  \n",
     "-use of a Pipeline (pyspark.ml.pipeline) to streamline the whole prediction process  \n",
-    "-no more sql-type queries"
+    "-no more sql-type queries (personal preferences)"
    ]
   },
   {
@@ -45,15 +52,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 7,
    "id": "ac3e8958-5d9c-4e80-9a6f-fd343a3d4dd5",
    "metadata": {
     "tags": []
    },
    "outputs": [],
    "source": [
     "from pyspark.sql import SparkSession\n",
-    "import pyspark.sql.functions as F\n",
+    "\n",
     "\n",
     "# SparkSession\n",
     "URL_SPARK = \"spark://spark:7077\"\n",
@@ -74,7 +81,7 @@
     "tags": []
    },
    "source": [
-    "### Run example - pyspark.sql / pyspark.ml"
+    "### Run example - pyspark.sql / pyspark.ml, build a ML Pipeline"
    ]
   },
   {
@@ -110,7 +117,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 51,
+   "execution_count": 8,
    "id": "888a85f7-5e40-4e90-8a35-3cb1435d1460",
    "metadata": {
     "tags": []
@@ -152,44 +159,225 @@
    "source": [
     "# Cache table/dataframe for re-usable table with .cache()\n",
     "# caching operation takes place only when a Spark action (count, show, take or write) is also performed on the same dataframe\n",
-    "df_avocado = spark.read.csv(\n",
+    "df = spark.read.csv(\n",
     "  \"data/avocado.csv\", \n",
     "  header=True, \n",
     "  inferSchema=True\n",
     ").cache() # cache transformation\n",
     "\n",
-    "df_avocado.printSchema()\n",
-    "df_avocado.show(4) # call show() from the cached df_avocado. df_avocado cached in memory right after we call the action (show)"
+    "df.printSchema()\n",
+    "df.show(4) # call show() from the cached df_avocado. df_avocado cached in memory right after we call the action (show)"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "75c92350-8990-4534-81cc-4c9da00a40b1",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "source": [
     "#### Steps overview"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "e0068bc2-270c-4e43-beeb-082b404ce297",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import pyspark.sql.functions as F\n",
+    "from pyspark.sql.functions import col\n",
+    "\n",
+    "from pyspark.ml.feature import StringIndexer, OneHotEncoder, StandardScaler, VectorAssembler\n",
+    "from pyspark.ml.regression import RandomForestRegressor\n",
+    "from pyspark.ml.pipeline import Pipeline"
+   ]
+  },
   {
    "cell_type": "markdown",
-   "id": "a4734d8a-4399-4980-82c3-05940fbf888e",
+   "id": "b840a5b1-8bd7-4c73-a8c9-133e4983e8dd",
    "metadata": {},
    "source": [
-    "- No EDA, has been done a million times, we jump to the implementation directly\n",
-    "- feature creation from 'Date' : --> yy and mm\n",
-    "- one hot encoding categorical 'region'\n",
-    "- scale numerical features ()\n",
-    "- Drop columns : _c0 (index), Total Bags, Total Volume (strong corr with respective subcategories)\n",
-    "- Drop transformed columns (if not aready done):  Date, region"
+    "- Steps differs a bit from sklearn. Search for 'transformers' and 'estimators'\n",
+    "- No EDA, has been done a million times on this dataset. \n",
+    "- Format data  \n",
+    "-Feature creation from 'Date' : yy and mm  \n",
+    "-Drop columns : Total Bags, Total Volume (strong corr with respective subcategories) ; could also be done in pipeline tho ?\n",
+    "- Pipeline (encode etc...)  \n",
+    "-One hot encoding categorical 'region' (before that, use StringIndexer)   \n",
+    "-Drop transformed columns:  Date, region. Note : unlike scikit-learn col transf, pyspark adds new col when transforming    \n",
+    "- Consolidate all remaining features in a single vector using VectorAssembler\n",
+    "- Scale numerical features using StandardScaler <- would be earlier in a sklearn pipeline\n",
+    "- Predict"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee3ac793-eff8-4d36-ae00-96dd66fd2d19",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "#### Format Data"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "id": "95442a63-b7d7-409a-95f2-6650cbe6bc5b",
+   "execution_count": 10,
+   "id": "ea5b4865-062b-491a-bf10-1242d46d358c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+------------+-----------+----------+-----------+----------+----------+-----------+------------+----+------+----------+-----+\n",
+      "|AveragePrice|Medium Size|Large Size|XLarge Size|Small Bags|Large Bags|XLarge Bags|        type|year|region|Year Index|Month|\n",
+      "+------------+-----------+----------+-----------+----------+----------+-----------+------------+----+------+----------+-----+\n",
+      "|        1.33|    1036.74|  54454.85|      48.16|   8603.62|     93.25|        0.0|conventional|2015|Albany|        15|   12|\n",
+      "|        1.35|     674.28|  44638.81|      58.33|   9408.07|     97.49|        0.0|conventional|2015|Albany|        15|   12|\n",
+      "|        0.93|      794.7| 109149.67|      130.5|   8042.21|    103.14|        0.0|conventional|2015|Albany|        15|   12|\n",
+      "|        1.08|     1132.0|  71976.41|      72.58|    5677.4|    133.76|        0.0|conventional|2015|Albany|        15|   12|\n",
+      "+------------+-----------+----------+-----------+----------+----------+-----------+------------+----+------+----------+-----+\n",
+      "only showing top 4 rows\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# convert 'year' yyyy to yy (yyyy - 2000, since we have 2015-2018 values)\n",
+    "df = df.withColumn('Year Index', col('Year') - 2000)\n",
+    "\n",
+    "# extract month from 'Date' timestamps col\n",
+    "df = df.withColumn('Month', F.month('Date'))\n",
+    "\n",
+    "# drop \"useless\" columns ; from multiple notebooks on this particular dataset : \"Total Bags\", \"Total Volume\" & index (c_0)\n",
+    "# /!\\ Optional : not really useful tho, as we will assemble a features (the ones we're interested in) vector later in the pipeline\n",
+    "drop_cols =  (\"Total Bags\", \"Total Volume\", \"_c0\", \"Date\") # we keep \"year\" just to show we do not need to drop it.\n",
+    "df = df.drop(*drop_cols)\n",
+    "\n",
+    "# rename columns avocado sizes columns for clarity \n",
+    "df = df.withColumnRenamed(\"4046\", \"Medium Size\").withColumnRenamed(\"4225\", \"Large Size\").withColumnRenamed(\"4770\", \"XLarge Size\")\n",
+    "df.show(4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4dfcd69e-fbdb-4eff-97fb-9b6e3a58a06f",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "#### Build Pipeline (encode, vectorize, scaling)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "382272ea-07aa-43a4-af0f-681b332af34d",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "# 1. Must use StringIndexer before encoding categorical features. (OHE input is category indices)\n",
+    "str_indexer = StringIndexer(inputCols=['region','type'], outputCols=['region_str', 'type_str'])\n",
+    "\n",
+    "# 2. Encoder categorical features. \n",
+    "# Unlike sklearn, transformations add new columns, keeping inputs. So we keep track of changes with outp cols names.\n",
+    "#  Spark OHE is different from sklearn’s OneHotEncoder which keeps all categories. The output vectors are sparse\n",
+    "ohe = OneHotEncoder(\n",
+    "    inputCols=['region_str','type_str'], \n",
+    "    outputCols=['region_str_ohe','type_str_ohe']\n",
+    ")\n",
+    "\n",
+    "# 3. Assemble (used) features in one vector\n",
+    "assembler = VectorAssembler(\n",
+    "    inputCols=[\n",
+    "        'Medium Size',\n",
+    "        'Large Size',\n",
+    "        'XLarge Size',\n",
+    "        'Small Bags',\n",
+    "        'Large Bags',\n",
+    "        'XLarge Bags',\n",
+    "        'region_str_ohe',\n",
+    "        'type_str_ohe'\n",
+    "    ], outputCol='features')\n",
+    "\n",
+    "# 4. Standardize numerical features\n",
+    "scaler = StandardScaler(inputCol='features',outputCol='scaled_features')\n",
+    "\n",
+    "# 5. define regressor\n",
+    "rf_regressor = RandomForestRegressor(featuresCol='scaled_features',labelCol='AveragePrice', numTrees=50, maxDepth=15)\n",
+    "\n",
+    "# 6. build Pipeline\n",
+    "pipeline = Pipeline(stages=[str_indexer, ohe, assembler, scaler, rf_regressor])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b81532fd-1edf-4a7b-a665-676ee9cb21cd",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "#### Simple randomforest modeling : train-test split, apply Pipeline to train & test, eval"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3332499-66a1-4f79-be00-bcefcbda212a",
+   "metadata": {},
+   "source": [
+    "Crude attempt, no cv, some default rf parameters.  \n",
+    "For parameters tuning, look up for pyspark.ml.tuning  / CrossValidator, ParamGridBuilder. Not used here"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "ae2ebec7-8379-45bd-b375-faac5c64824c",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.1975694758480664"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from pyspark.ml.evaluation import RegressionEvaluator\n",
+    "\n",
+    "# split\n",
+    "train, test = df.randomSplit([.8,.2])\n",
+    "\n",
+    "# fit the model\n",
+    "model = pipeline.fit(train)\n",
+    "\n",
+    "\n",
+    "# apply the model to the test set\n",
+    "prediction = model.transform(test)\n",
+    "eval = RegressionEvaluator(predictionCol='prediction',\n",
+    "                                       labelCol='AveragePrice', metricName='rmse')\n",
+    "\n",
+    "eval.evaluate(prediction)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a769698-04bc-4eda-9edc-63a4bfd11d25",
+   "metadata": {},
+   "source": [
+    "For reference, original article, using Linear regression + cv : rmse of .28"
+   ]
   }
  ],
  "metadata": {