|
| 1 | +--- |
| 2 | +title: Deploy and run MLflow models in Spark jobs |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Learn to deploy your MLflow model in Spark jobs to perform inference. |
| 5 | +services: machine-learning |
| 6 | +ms.service: machine-learning |
| 7 | +ms.subservice: core |
| 8 | +author: santiagxf |
| 9 | +ms.author: fasantia |
| 10 | +ms.reviewer: mopeakande |
| 11 | +ms.date: 12/30/2022 |
| 12 | +ms.topic: how-to |
| 13 | +ms.custom: deploy, mlflow, devplatv2, no-code-deployment, devx-track-azurecli, cliv2, event-tier1-build-2022 |
| 14 | +--- |
| 15 | + |
| 16 | +# Deploy and run MLflow models in Spark jobs |
| 17 | + |
| 18 | +In this article, learn how to deploy and run your [MLflow](https://www.mlflow.org) model in Spark jobs to perform inference over large amounts of data or as part of data wrangling jobs. |
| 19 | + |
| 20 | + |
| 21 | +## About this example |
| 22 | + |
| 23 | +This example shows how you can deploy an MLflow model registered in Azure Machine Learning to Spark jobs running in [managed Spark clusters (preview)](how-to-submit-spark-jobs.md), Azure Databricks, or Azure Synapse Analytics, to perform inference over large amounts of data. |
| 24 | + |
| 25 | +It uses an MLflow model based on the [Diabetes dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). This dataset contains ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements obtained from n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline (regression). |
| 26 | + |
| 27 | +The model has been trained using an `scikit-learn` regressor and all the required preprocessing has been packaged as a pipeline, making this model an end-to-end pipeline that goes from raw data to predictions. |
| 28 | + |
| 29 | +The information in this article is based on code samples contained in the [azureml-examples](https://github.com/azure/azureml-examples) repository. To run the commands locally without having to copy/paste YAML and other files, clone the repo and then change directories to the `sdk/python/using-mlflow/deploy`. |
| 30 | + |
| 31 | +```azurecli |
| 32 | +git clone https://github.com/Azure/azureml-examples --depth 1 |
| 33 | +cd sdk/python/using-mlflow/deploy |
| 34 | +``` |
| 35 | + |
| 36 | +## Prerequisites |
| 37 | + |
| 38 | +Before following the steps in this article, make sure you have the following prerequisites: |
| 39 | + |
| 40 | +- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/). |
| 41 | +- You must have a MLflow model registered in your workspace. Particularly, this example will register a model trained for the [Diabetes dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). |
| 42 | +- Install the Mlflow SDK package `mlflow` and the Azure Machine Learning plug-in for MLflow `azureml-mlflow`. |
| 43 | + |
| 44 | + ```bash |
| 45 | + pip install mlflow azureml-mlflow |
| 46 | + ``` |
| 47 | + |
| 48 | +- If you aren't running in Azure Machine Learning compute, configure the MLflow tracking URI or MLflow's registry URI to point to the workspace you are working on. See [Track runs using MLflow with Azure Machine Learning](how-to-use-mlflow-cli-runs.md#set-up-tracking-environment) for more details. |
| 49 | + |
| 50 | + |
| 51 | +### Connect to your workspace |
| 52 | + |
| 53 | +First, let's connect to Azure Machine Learning workspace where your model is registered. |
| 54 | +
|
| 55 | +# [Azure Machine Learning compute](#tab/aml) |
| 56 | +
|
| 57 | +Tracking is already configured for you. Your default credentials will also be used when working with MLflow. |
| 58 | +
|
| 59 | +# [Remote compute](#tab/remote) |
| 60 | +
|
| 61 | +**Configure tracking URI** |
| 62 | +
|
| 63 | +You need to configure MLflow to point to the Azure Machine Learning MLflow tracking URI. The tracking URI has the protocol `azureml://`. You can use MLflow to configure it. |
| 64 | +
|
| 65 | +```python |
| 66 | +azureml_tracking_uri = "<AZUREML_TRACKING_URI>" |
| 67 | +mlflow.set_tracking_uri(azureml_tracking_uri) |
| 68 | +``` |
| 69 | +
|
| 70 | +There are multiple ways to get the Azure Machine Learning MLflow tracking URI. Refer to [Set up tracking environment](how-to-use-mlflow-cli-runs.md) to see all the alternatives. |
| 71 | +
|
| 72 | +> [!TIP] |
| 73 | +> When working on shared environments, like for instance an Azure Databricks cluster, Azure Synapse Analytics cluster, or similar, it is useful to configure the environment variable `MLFLOW_TRACKING_URI` to automatically configure the MLflow tracking URI to the desired target for all the sessions running in the cluster rather than to do it on a per-session basis. |
| 74 | +
|
| 75 | +**Configure authentication** |
| 76 | +
|
| 77 | +Once the tracking is configured, you'll also need to configure how the authentication needs to happen to the associated workspace. For interactive jobs where there's a user connected to the session, you can rely on Interactive Authentication. |
| 78 | +
|
| 79 | +For those scenarios where unattended execution is required, you'll have to configure a service principal to communicate with Azure Machine Learning. |
| 80 | + |
| 81 | +```python |
| 82 | +import os |
| 83 | +
|
| 84 | +os.environ["AZURE_TENANT_ID"] = "<AZURE_TENANT_ID>" |
| 85 | +os.environ["AZURE_CLIENT_ID"] = "<AZURE_CLIENT_ID>" |
| 86 | +os.environ["AZURE_CLIENT_SECRET"] = "<AZURE_CLIENT_SECRET>" |
| 87 | +``` |
| 88 | + |
| 89 | +> [!TIP] |
| 90 | +> When working on shared environments, it is better to configure this environment variables for the entire cluster. As a best practice, manage them as secrets in an instance of Azure Key Vault. For instance, in Azure Databricks, you can use secrets to set this variables as follows: `AZURE_CLIENT_SECRET={{secrets/<scope-name>/<secret-name>}}`. See [Reference a secret in an environment variable](https://learn.microsoft.com/azure/databricks/security/secrets/secrets#reference-a-secret-in-an-environment-variable) for how to do it in Azure Databricks or refer to similar documentation in your platform. |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +### Registering the model |
| 95 | + |
| 96 | +We need a model registered in the Azure Machine Learning registry to perform inference. In this case, we already have a local copy of the model in the repository, so we only need to publish the model to the registry in the workspace. You can skip this step if the model you are trying to deploy is already registered. |
| 97 | + |
| 98 | +```python |
| 99 | +model_name = 'sklearn-diabetes' |
| 100 | +model_local_path = "sklearn-diabetes/model" |
| 101 | +
|
| 102 | +registered_model = mlflow_client.create_model_version( |
| 103 | + name=model_name, source=f"file://{model_local_path}" |
| 104 | +) |
| 105 | +version = registered_model.version |
| 106 | +``` |
| 107 | + |
| 108 | +Alternatively, if your model was logged inside of a run, you can register it directly. |
| 109 | + |
| 110 | +> [!TIP] |
| 111 | +> To register the model, you'll need to know the location where the model has been stored. If you are using `autolog` feature of MLflow, the path will depend on the type and framework of the model being used. We recommend to check the jobs output to identify which is the name of this folder. You can look for the folder that contains a file named `MLModel`. If you are logging your models manually using `log_model`, then the path is the argument you pass to such method. As an example, if you log the model using `mlflow.sklearn.log_model(my_model, "classifier")`, then the path where the model is stored is `classifier`. |
| 112 | +
|
| 113 | +```python |
| 114 | +model_name = 'sklearn-diabetes' |
| 115 | +
|
| 116 | +registered_model = mlflow_client.create_model_version( |
| 117 | + name=model_name, source=f"runs://{RUN_ID}/{MODEL_PATH}" |
| 118 | +) |
| 119 | +version = registered_model.version |
| 120 | +``` |
| 121 | +
|
| 122 | +> [!NOTE] |
| 123 | +> The path `MODEL_PATH` is the location where the model has been stored in the run. |
| 124 | +
|
| 125 | +--- |
| 126 | +
|
| 127 | +### Get input data to score |
| 128 | +
|
| 129 | +We'll need some input data to run or jobs on. In this example, we'll download sample data from internet and place it in a shared storage used by the Spark cluster. |
| 130 | +
|
| 131 | +```python |
| 132 | +import urllib |
| 133 | +
|
| 134 | +urllib.request.urlretrieve("https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv", "/tmp/data") |
| 135 | +``` |
| 136 | +
|
| 137 | +Move the data to a mounted storage account available to the entire cluster. |
| 138 | +
|
| 139 | +```python |
| 140 | +dbutils.fs.mv("file:/tmp/data", "dbfs:/") |
| 141 | +``` |
| 142 | +
|
| 143 | +> [!IMPORTANT] |
| 144 | +> The previous code uses `dbutils`, which is a tool available in Azure Databricks cluster. Use the appropriate tool depending on the platform you are using. |
| 145 | +
|
| 146 | +The input data is then placed in the following folder: |
| 147 | +
|
| 148 | +```python |
| 149 | +input_data_path = "dbfs:/data" |
| 150 | +``` |
| 151 | +
|
| 152 | +## Run the model in Spark clusters |
| 153 | +
|
| 154 | +The following section explains how to run MLflow models registered in Azure Machine Learning in Spark jobs. |
| 155 | +
|
| 156 | +1. Configure the model URI. The following URI brings a model named `heart-classifier` in its latest version. |
| 157 | +
|
| 158 | + ```python |
| 159 | + model_uri = "models:/heart-classifier/latest" |
| 160 | + ``` |
| 161 | +
|
| 162 | +1. Load the model as an UDF function. A user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment. |
| 163 | +
|
| 164 | + ```python |
| 165 | + predict_function = mlflow.pyfunc.spark_udf(spark, model_uri, env_manager="local") |
| 166 | + ``` |
| 167 | +
|
| 168 | + > [!TIP] |
| 169 | + > Use the argument `result_type` to control the type returned by the `predict()` function. |
| 170 | +
|
| 171 | +1. Read the data you want to score: |
| 172 | +
|
| 173 | + ```python |
| 174 | + df = spark.read.option("header", "true").option("inferSchema", "true").csv(input_data_path).drop("target") |
| 175 | + ``` |
| 176 | +
|
| 177 | + In our case, the input data is on `CSV` format and placed in the folder `dbfs:/data/`. We're also dropping the column `target` as this dataset contains the target variable to predict. In production scenarios, your data won't have this column. |
| 178 | +
|
| 179 | +1. Run the function `predict_function` and place the predictions on a new column. In this case, we're placing the predictions in the column `predictions`. |
| 180 | + |
| 181 | + ```python |
| 182 | + df.withColumn("predictions", score_function(*df.columns)) |
| 183 | + ``` |
| 184 | + |
| 185 | + > [!TIP] |
| 186 | + > The `predict_function` receives as arguments the columns required. In our case, all the columns of the data frame are expected by the model and hence `df.columns` is used. If your model requires a subset of the columns, you can introduce them manually. If you model has a signature, types need to be compatible between inputs and expected types. |
| 187 | + |
| 188 | + |
| 189 | +## Run the model in a standalone Spark job in Azure Machine Learning |
| 190 | + |
| 191 | + Azure Machine Learning supports creation of a standalone Spark job, and creation of a reusable Spark component that can be used in [Azure Machine Learning pipelines](concept-ml-pipelines.md). In this example, we'll deploy a scoring job that runs in Azure Machine Learning standalone Spark job and runs an MLflow model to perform inference. |
| 192 | +
|
| 193 | +> [!NOTE] |
| 194 | +> To learn more about Spark jobs in Azure Machine Learning, see [Submit Spark jobs in Azure Machine Learning (preview)](how-to-submit-spark-jobs.md). |
| 195 | +
|
| 196 | +1. A Spark job requires a Python script that takes arguments. Create a scoring script: |
| 197 | +
|
| 198 | + __score.py__ |
| 199 | +
|
| 200 | + ```python |
| 201 | + import argparse |
| 202 | + |
| 203 | + parser = argparse.ArgumentParser() |
| 204 | + parser.add_argument("--model") |
| 205 | + parser.add_argument("--input_data") |
| 206 | + parser.add_argument("--scored_data") |
| 207 | + |
| 208 | + args = parser.parse_args() |
| 209 | + print(args.model) |
| 210 | + print(args.input_data) |
| 211 | + |
| 212 | + # Load the model as an UDF function |
| 213 | + predict_function = mlflow.pyfunc.spark_udf(spark, args.model, env_manager="local") |
| 214 | + |
| 215 | + # Read the data you want to score |
| 216 | + df = spark.read.option("header", "true").option("inferSchema", "true").csv(input_data).drop("target") |
| 217 | + |
| 218 | + # Run the function `predict_function` and place the predictions on a new column |
| 219 | + scored_data = df.withColumn("predictions", score_function(*df.columns)) |
| 220 | + |
| 221 | + # Save the predictions |
| 222 | + scored_data.to_csv(args.scored_data) |
| 223 | + ``` |
| 224 | + |
| 225 | + The above script takes three arguments `--model`, `--input_data` and `--scored_data`. The first two are inputs and represent the model we want to run and the input data, the last one is an output and it is the output folder where predictions will be placed. |
| 226 | +
|
| 227 | +1. Create a job definition: |
| 228 | +
|
| 229 | + __mlflow-score-spark-job.yml__ |
| 230 | +
|
| 231 | + ```yml |
| 232 | + $schema: http://azureml/sdk-2-0/SparkJob.json |
| 233 | + type: spark |
| 234 | + |
| 235 | + code: ./src |
| 236 | + entry: |
| 237 | + file: score.py |
| 238 | + |
| 239 | + conf: |
| 240 | + spark.driver.cores: 1 |
| 241 | + spark.driver.memory: 2g |
| 242 | + spark.executor.cores: 2 |
| 243 | + spark.executor.memory: 2g |
| 244 | + spark.executor.instances: 2 |
| 245 | + |
| 246 | + inputs: |
| 247 | + model: |
| 248 | + type: mlflow_model |
| 249 | + path: azureml:heart-classifier@latest |
| 250 | + input_data: |
| 251 | + type: uri_file |
| 252 | + path: https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv |
| 253 | + mode: direct |
| 254 | + |
| 255 | + outputs: |
| 256 | + scored_data: |
| 257 | + type: uri_folder |
| 258 | + |
| 259 | + args: >- |
| 260 | + --model ${{inputs.model}} |
| 261 | + --input_data ${{inputs.input_data}} |
| 262 | + --scored_data ${{outputs.scored_data}} |
| 263 | + |
| 264 | + identity: |
| 265 | + type: user_identity |
| 266 | + |
| 267 | + resources: |
| 268 | + instance_type: standard_e4s_v3 |
| 269 | + runtime_version: "3.2" |
| 270 | + ``` |
| 271 | +
|
| 272 | + > [!TIP] |
| 273 | + > To use an attached Synapse Spark pool, define `compute` property in the sample YAML specification file shown above instead of `resources` property. |
| 274 | +
|
| 275 | +1. The YAML files shown above can be used in the `az ml job create` command, with the `--file` parameter, to create a standalone Spark job as shown: |
| 276 | +
|
| 277 | + ```azurecli |
| 278 | + az ml job create -f mlflow-score-spark-job.yml |
| 279 | + ``` |
| 280 | +
|
| 281 | +## Next steps |
| 282 | +
|
| 283 | +- [Deploy MLflow models to batch endpoints](how-to-mlflow-batch.md) |
| 284 | +- [Deploy MLflow models to online endpoint](how-to-deploy-mlflow-models-online-endpoints.md) |
| 285 | +- [Using MLflow models for no-code deployment](how-to-log-mlflow-models.md) |
0 commit comments