Merge pull request #71 from LittleLittleCloud/u/xiaoyun/tuner

luisquintanilla · web-flow · commit a9ab32bc2de6 · 2022-10-24T10:57:07.000-04:00
add AutoML - HPO and tuner notebook
diff --git a/machine-learning/06-AutoML HPO and tuner.ipynb b/machine-learning/06-AutoML HPO and tuner.ipynb
@@ -0,0 +1,325 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    }
+   },
+   "source": [
+    "Hyper Parameter Optimization, aka HPO, is to find a well-performed hyper-parameter on a given search space. The most well-known HPO is grid-search but it only performs well on tiny search space. To resolve hpo on large search space, a lot of algorithms are applied. For example, bayesian optimization is designed for optimizing expensive, black box functions which is very suitable for hpo task. Cost-Frugal optimization on the other hand, taking the training cost into consideration and is aimed to find a better solution within limited cost.\n",
+    "\n",
+    "One thing to note is even though hpo is a very activate research field and a lot of algorithms have been invented in the last few years, there's still lacking a general, all-in-one hpo alogrithm that performs well on all datasets. So the best way to find out the right hpo algorithm is always try different hpos on your dataset.\n",
+    "\n",
+    "AutoML.Net provides several hpos for you to try out, and you can configure and replace different hpos easily for `AutoMLExperiment` via setting different tuner. In this notebook, we'll go through the following topics.\n",
+    "- Available tuners in AutoML.Net, and how to use it.\n",
+    "- Comparing the performance for those tuners."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "// install dependencies and import using statement\n",
+    "#i \"nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json\"\n",
+    "#r \"nuget: Plotly.NET.Interactive, 3.0.2\"\n",
+    "#r \"nuget: Plotly.NET.CSharp, 0.0.1\"\n",
+    "\n",
+    "// make sure you are using Microsoft.ML.AutoML later than 0.20.0.\n",
+    "#r \"nuget: Microsoft.ML.AutoML, 0.20.0-preview.22514.1\"\n",
+    "#r \"nuget: Microsoft.Data.Analysis, 0.20.0-preview.22514.1\"\n",
+    "// Import usings.\n",
+    "using System;\n",
+    "using System.IO;\n",
+    "using System.Net;\n",
+    "using Microsoft.ML;\n",
+    "using Microsoft.ML.AutoML;\n",
+    "using Microsoft.ML.Data;\n",
+    "using Microsoft.ML.SearchSpace;\n",
+    "using Newtonsoft.Json;\n",
+    "using Microsoft.ML.AutoML.CodeGen;\n",
+    "using Microsoft.Data.Analysis;\n",
+    "using Microsoft.ML.SearchSpace.Option;"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Available Tuners in AutoML.Net\n",
+    "For now, those tuners are available in AutoML.Net\n",
+    "- CostFrugalTuner: low-cost HPO algorithm, this is an implementation of [Frugal Optimization for Cost-related Hyperparameters](https://arxiv.org/abs/2005.01571).\n",
+    "- SMAC: Bayesian optimziation using random forest as regression model.\n",
+    "- EciCostFrugalTuner: CostFrugalTuner for hierarchical search space. This will be used as default tuner if `AutoMLExperiment.SetPipeline` get called.\n",
+    "- GridSearch\n",
+    "- RandomSearch\n",
+    "\n",
+    "The following section shows how to use different tuner in `AutoMLExperiment`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var context = new MLContext(1);\n",
+    "var experiment = context.Auto().CreateExperiment();\n",
+    "\n",
+    "// use EciCostFrugalTuner\n",
+    "// Note: EciCostFrugalTuner will be set as default tuner if you call \n",
+    "// experiment.SetPipeline()\n",
+    "experiment.SetEciCostFrugalTuner();\n",
+    "\n",
+    "// use CostFrugalTuner\n",
+    "experiment.SetCostFrugalTuner();\n",
+    "\n",
+    "// use SMAC\n",
+    "experiment.SetSmacTuner();\n",
+    "\n",
+    "// use GridSearch\n",
+    "experiment.SetGridSearchTuner(step: 10);\n",
+    "\n",
+    "// use RandomSearch\n",
+    "experiment.SetRandomSearchTuner(seed: 1);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Compare GridSearch and EciCostFrugal on titanic dataset\n",
+    "\n",
+    "The following section shows how different hpo effect automl performance, by comparing metric trend from GridSearch and EciCostFrugal on titanic dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    }
+   },
+   "source": [
+    "## Download titanic if necessary"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "string EnsureDataSetDownloaded(string fileName)\n",
+    "{\n",
+    "\n",
+    "\t// This is the path if the repo has been checked out.\n",
+    "\tvar filePath = Path.Combine(Directory.GetCurrentDirectory(),\"data\", fileName);\n",
+    "\n",
+    "\tif (!File.Exists(filePath))\n",
+    "\t{\n",
+    "\t\t// This is the path if the file has already been downloaded.\n",
+    "\t\tfilePath = Path.Combine(Directory.GetCurrentDirectory(), fileName);\n",
+    "\t}\n",
+    "\n",
+    "\tif (!File.Exists(filePath))\n",
+    "\t{\n",
+    "\t\tusing (var client = new WebClient())\n",
+    "\t\t{\n",
+    "\t\t\tclient.DownloadFile($\"https://raw.githubusercontent.com/dotnet/csharp-notebooks/main/machine-learning/data/{fileName}\", filePath);\n",
+    "\t\t}\n",
+    "\t\tConsole.WriteLine($\"Downloaded {fileName}  to : {filePath}\");\n",
+    "\t}\n",
+    "\telse\n",
+    "\t{\n",
+    "\t\tConsole.WriteLine($\"{fileName} found here: {filePath}\");\n",
+    "\t}\n",
+    "\n",
+    "\treturn filePath;\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Load Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var trainDataPath = EnsureDataSetDownloaded(\"titanic-train.csv\");\n",
+    "var df = DataFrame.LoadCsv(trainDataPath);\n",
+    "\n",
+    "var trainTestSplit = context.Data.TrainTestSplit(df, 0.1);\n",
+    "df.Head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Construct pipeline and AutoMLExperiment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "var pipeline = context.Auto().Featurizer(df, excludeColumns: new[]{\"Survived\"})\n",
+    "                        .Append(context.Transforms.Conversion.ConvertType(\"Survived\", \"Survived\", DataKind.Boolean))\n",
+    "\t\t\t\t\t    .Append(context.Auto().BinaryClassification(labelColumnName: \"Survived\"));\n",
+    "// Configure AutoML\n",
+    "var monitor = new NotebookMonitor(pipeline);\n",
+    "\n",
+    "var experiment = context.Auto().CreateExperiment()\n",
+    "                    .SetPipeline(pipeline)\n",
+    "                    .SetTrainingTimeInSeconds(10)\n",
+    "                    .SetDataset(trainTestSplit.TrainSet, trainTestSplit.TestSet)\n",
+    "                    .SetBinaryClassificationMetric(BinaryClassificationMetric.Accuracy, \"Survived\", \"PredictedLabel\")\n",
+    "                    .SetMonitor(monitor);\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run HPO using GridSearch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "experiment.SetGridSearchTuner(step: 10);\n",
+    "await experiment.RunAsync();\n",
+    "var gridSearchTrial = monitor.CompletedTrials.ToArray();\n",
+    "monitor.CompletedTrials.Clear();"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run HPO using EciCostFrugal"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "experiment.SetEciCostFrugalTuner();\n",
+    "await experiment.RunAsync();\n",
+    "var eciSearchTrials = monitor.CompletedTrials.ToArray();\n",
+    "monitor.CompletedTrials.Clear();"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Compare HPO performace among GridSearch, EciCostFrugal"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "vscode": {
+     "languageId": "dotnet-interactive.csharp"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "using Plotly.NET;\n",
+    "\n",
+    "var gridSearchChart = Chart2D.Chart.Line<int, float, string>(gridSearchTrial.Select(t => t.TrialSettings.TrialId), gridSearchTrial.Select(t => (float)t.Metric), Name: \"grid_search\");\n",
+    "var eciCfoSearchChart = Chart2D.Chart.Line<int, float, string>(eciSearchTrials.Select(t => t.TrialSettings.TrialId), eciSearchTrials.Select(t => (float)t.Metric), Name: \"eci_cfo\");\n",
+    "var combineChart = Chart.Combine(new[]{ gridSearchChart, eciCfoSearchChart});\n",
+    "combineChart.Display()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".NET (C#)",
+   "language": "C#",
+   "name": ".net-csharp"
+  },
+  "language_info": {
+   "file_extension": ".cs",
+   "mimetype": "text/x-csharp",
+   "name": "C#",
+   "pygments_lexer": "csharp",
+   "version": "9.0"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}