sample added

brown9804 · web-flow · commit 1d201b8111e8 · 2025-05-02T23:16:45.000-06:00
diff --git a/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb b/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","source":["# Demonstration: Train a ML model with AutoML\n","\n","## Introduction\n","\n","This notebook is automatically generated by the Fabric low-code AutoML wizard based on your selections. Whether you're building a regression model, a classifier, or another machine-learning solution, this tool simplifies the process by transforming your goals into executable code. You can easily modify any settings or code snippets to better align with your requirements.\n","\n","### What is FLAML?\n","\n","[FLAML (Fast and Lightweight Automated Machine Learning)](https://aka.ms/fabric-automl) is an open-source AutoML library designed to quickly and efficiently find the best machine learning models and hyperparameters. FLAML optimizes for speed, accuracy, and cost, making it an excellent choice for a wide range of machine-learning tasks.\n","\n","### Steps in this notebook\n","\n","1. **Load the data**: Import your dataset.\n","2. **Generate features**: Automatically transform and preprocess your data to improve model performance.\n","3. **Use AutoML to find your best model**: Use FLAML to automatically select the most suitable model and optimize its parameters.\n","4. **Save the final machine learning model**: Store the trained model for future use.\n","5. **Generate predictions**: Use the saved model to predict outcomes on new data.\n","\n","> [!IMPORTANT]\n","> **Automated ML is currently supported on Fabric Runtimes 1.2+ or any Fabric environment with Spark 3.4+.**\n"],"metadata":{},"id":"d8d36bfe-0884-4c73-a24f-175233d98bdf"},{"cell_type":"code","source":["%pip install scikit-learn==1.5.1\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"592531fe-7a06-4837-a5eb-2650113cbf13"},{"cell_type":"markdown","source":["### Default notebook optimization\n","\n","This cell configures the logging and warning settings to reduce unnecessary output and focus on critical information. It suppresses specific warnings and logs from the underlying libraries, ensuring a cleaner and more readable notebook experience."],"metadata":{},"id":"14223c8d-f82a-44ef-a466-e03ebcc6b430"},{"cell_type":"code","source":["import logging\n","import warnings\n"," \n","logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)\n","logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)\n","warnings.simplefilter('ignore', category=FutureWarning)\n","warnings.simplefilter('ignore', category=UserWarning)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"9878de39-d1c1-485b-9058-e429715b5cd8"},{"cell_type":"markdown","source":["## Step 1: Load the Data\n","\n","This cell is responsible for importing the raw data from the specified source into the notebook environment. The data could come from various sources, such as a file or table in your lakehouse.\n","\n","Once loaded, this data will serve as the input for subsequent steps, such as data transformation, model training, and evaluation."],"metadata":{},"id":"67153540-7117-4adb-9766-b701ff7fc616"},{"cell_type":"code","source":["import re\n","import pandas as pd\n","import numpy as np\n","\n","df = spark.read.format(\"delta\").load(\n","    \"Tables/2020orders\"\n",").cache()\n","# Transform to pandas according to the selected models\n","X = df.limit(100000).toPandas() # Use df.toPandas() to use all the data\n","X = X.rename(columns = lambda c:re.sub('[^A-Za-z0-9_]+', '_', c))  # Replace not supported characters in column name with underscore to avoid invalid character for model training and saving\n","\n","target_col = re.sub('[^A-Za-z0-9_]+', '_', \"price\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"63113dbc-16ab-4932-97c2-b0f54cfe9b3f"},{"cell_type":"code","source":["display(X)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ae621756-f044-4553-8509-d64973d5d903"},{"cell_type":"markdown","source":["## Step 2: Generate features\n","\n","Featurization is the process of transforming raw data into a format optimized for training a machine learning model. It ensures the model can access the most relevant information, significantly impacting its accuracy and performance.\n","\n","This step applies various techniques to refine the data, enhance its quality, and make it compatible with the selected algorithms, helping the model learn patterns more effectively."],"metadata":{},"id":"761a4b6e-6698-4bd3-948c-3e5274efbaad"},{"cell_type":"code","source":["# Handle class imbalance\n","import matplotlib.pyplot as plt\n","\n","\n","distribution = X[target_col].value_counts(normalize=True)\n","dominant_class_proportion = distribution.max()\n","\n","distribution.plot(kind='bar')\n","plt.title(\"Class Distribution\")\n","plt.xlabel(\"Class\")\n","plt.ylabel(\"Proportion\")\n","plt.show()\n","\n","if dominant_class_proportion > 0.8:\n","    print(f\"The dataset is imbalanced. The dominant class has {dominant_class_proportion * 100:.2f}% of the samples.\")\n","    print(\"You may need to handle class imbalance before training the model.\")\n","    print(\"You can use techniques such as oversampling, undersampling, or using class weights to handle class imbalance.\")\n","    print(\"For more information, see https://aka.ms/smote-example\")\n","else:\n","    print(\"The dataset is balanced.\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d7e7a55b-434d-42c3-b457-88b89dd57461"},{"cell_type":"code","source":["# Set Functions if needed for Featurization\n","def create_fillna_processor(\n","    df, mean_features=None, median_features=None, mode_features=None\n","):\n","    \"\"\"\n","    Create a ColumnTransformer that fills missing values in a DataFrame using different strategies\n","    based on the skewness of the numerical features and the specified feature lists.\n","\n","    Parameters:\n","    df (pd.DataFrame): The input DataFrame.\n","    mean_features (list, optional): List of features to impute using the mean strategy. Defaults to None.\n","    median_features (list, optional): List of features to impute using the median strategy. Defaults to None.\n","    mode_features (list, optional): List of features to impute using the mode strategy. Defaults to None.\n","\n","    Returns:\n","    ColumnTransformer: A fitted ColumnTransformer that can be used to transform the DataFrame.\n","    list: List of all features supported by SimpleImputer in the DataFrame.\n","    list: List of datetime features in the DataFrame.\n","    \"\"\"\n","    if mean_features is None:\n","        mean_features = []\n","    if median_features is None:\n","        median_features = []\n","    if mode_features is None:\n","        mode_features = []\n","    all_features = mean_features + median_features + mode_features\n","    # Group features by their imputation needs\n","    mean_features = [\n","        col\n","        for col in df.select_dtypes(include=[\"number\"]).columns\n","        if df[col].skew(skipna=True) <= 1 and col not in all_features\n","    ] + mean_features\n","    median_features = [\n","        col\n","        for col in df.select_dtypes(include=[\"number\"]).columns\n","        if df[col].skew(skipna=True) > 1 and col not in all_features\n","    ] + median_features\n","    all_features = mean_features + median_features\n","    datetime_features = df.select_dtypes(include=[\"datetime\"]).columns.tolist()\n","    mode_features = [col for col in df.columns.tolist() if col not in all_features + datetime_features]\n","\n","    transformers = []\n","\n","    if mean_features:\n","        transformers.append(\n","            (\"mean_imputer\", SimpleImputer(strategy=\"mean\"), mean_features)\n","        )\n","    if median_features:\n","        transformers.append(\n","            (\"median_imputer\", SimpleImputer(strategy=\"median\"), median_features)\n","        )\n","    if mode_features:\n","        transformers.append(\n","            (\"mode_imputer\", SimpleImputer(strategy=\"most_frequent\"), mode_features)\n","        )\n","\n","    column_transformer = ColumnTransformer(transformers=transformers)\n","    all_features = mean_features + median_features + mode_features\n","\n","    return column_transformer.fit(df), all_features, datetime_features\n","\n","\n","def fillna(df, processor, all_features, datetime_features):\n","    \"\"\"\n","    Fill missing values in a DataFrame using a specified processor and mode imputation.\n","\n","    Parameters:\n","    df (pd.DataFrame): The input DataFrame with missing values.\n","    processor (object): An object with a `transform` method that processes the DataFrame.\n","    all_features (list): List of all features supported by SimpleImputer in the DataFrame.\n","    datetime_features (list): List of datetime features in the DataFrame.\n","\n","    Returns:\n","    pd.DataFrame: A DataFrame with missing values filled.\n","    \"\"\"\n","    filled_array = processor.transform(df)\n","    filled_df = pd.DataFrame(filled_array, columns=all_features)\n","    if datetime_features:\n","        datetime_data = df[datetime_features]\n","        datetime_data.ffill()\n","        filled_df = pd.concat([datetime_data, filled_df], axis=1)\n","    for col in df.columns:\n","        filled_df[col].fillna(filled_df[col].mode()[0], inplace=True)\n","\n","    return filled_df\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"e37b9c43-7220-4b0c-9fa3-6ad9226dc85e"},{"cell_type":"code","source":["from sklearn.pipeline import Pipeline\n","from sklearn.impute import SimpleImputer\n","from sklearn.compose import ColumnTransformer\n","\n","\n","# convert object type to nearest dtype\n","X = X.convert_dtypes()\n","X = X.dropna(axis=1, how='all')\n","\n","# select columns for model training\n","X = X.select_dtypes(include=['number', 'datetime', 'category'])\n","\n","from sklearn.model_selection import train_test_split\n","\n","# You may need to update the test_size based on your scenario\n","X_train, X_test = train_test_split(X, test_size=0.2, random_state=41)\n","\n","mean_features, median_features, mode_features = [], [], []\n"," \n","preprocessor, all_features, datetime_features = create_fillna_processor(X_train, mean_features, median_features, mode_features)\n","X_train = fillna(X_train, preprocessor, all_features, datetime_features)\n","X_test = fillna(X_test, preprocessor, all_features, datetime_features)\n"," \n","y_train = X_train.pop(target_col)\n","y_test = X_test.pop(target_col)\n","\n","display(X_train[:10])\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"c9c6728f-7385-4c76-8284-6708d67bc5c7"},{"cell_type":"markdown","source":["## Step 3: Use AutoML to find your best model\n","\n","We will now use FLAML's AutoML to automatically find the best machine learning model for our data. AutoML (Automated Machine Learning) simplifies the model selection process by automatically testing and tuning various algorithms and configurations, helping us quickly identify the most effective model with minimal manual effort."],"metadata":{},"id":"3b4c43b4-9416-43d9-9ed8-a8d32858250d"},{"cell_type":"markdown","source":["### Tracking results with experiments in Fabric\n","\n","Experiments in Fabric let you track the results of your AutoML process, providing a comprehensive view of all the metrics and parameters from your trials."],"metadata":{},"id":"f287fb60-1e24-45f9-9493-1c563c797702"},{"cell_type":"code","source":["# MLFlow Logging Related\n","\n","import mlflow\n","\n","mlflow.autolog(exclusive=False)\n","mlflow.set_experiment(\"exp-test\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ff2e3568-ce88-4a63-8bf8-c768a6cfdc3c"},{"cell_type":"markdown","source":["#### Configure the AutoML trial and settings\n","\n","These configurations are driven by the AutoML mode and task selected in the wizard. For example, if you select \"quick prototype\", you'll see a setting for time budget."],"metadata":{},"id":"4f02f65d-bc49-4090-b00b-2bb28d59e754"},{"cell_type":"code","source":["# Import the AutoML class from the FLAML package\n","import flaml\n","from flaml import AutoML\n","\n","# Define AutoML settings\n","settings = {\n","    \"time_budget\": 120, # Total running time in seconds\n","    \"task\": \"binary\", \n","    \"log_file_name\": \"flaml_experiment.log\",  # FLAML log file\n","    \"eval_method\": \"cv\",\n","    \"n_splits\": 3,\n","    \"max_iter\": 10, \n","    \"force_cancel\": True, \n","    \"seed\": 41 , # Random seed \n","    \"mlflow_exp_name\": \"exp-test\",  # MLflow experiment name\n","    \"use_spark\": True, # whether to use Spark for distributed training\n","    \"n_concurrent_trials\": 3,  # the maximum number of concurrent trials \n","    \"verbose\": 1,  \n","    \"featurization\": \"auto\", \n","}\n","\n","if flaml.__version__ > \"2.3.3\":\n","    settings[\"entrypoint\"] = \"low-code\"\n","\n","# Create an AutoML instance\n","automl = AutoML(**settings)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d05dcde3-bf5f-43c5-a6fa-01e0a07affab"},{"cell_type":"markdown","source":["#### Run the AutoML trial\n","\n","Run the AutoML trial, with all trials being tracked as experiment runs. The trial is performed on the processed dataset, using the `Exited` variable as the target, and applying the defined configurations for optimal model selection."],"metadata":{},"id":"fc13e255-3bfb-4b54-9337-7f0fd070dbbc"},{"cell_type":"code","source":["with mlflow.start_run(nested=True, run_name=\"exp-test-AutoMLModel\"):\n","    automl.fit(\n","        X_train=X_train, \n","        y_train=y_train,  # target column of the training data \n","    )"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"6c995371-878a-40be-a6ca-106181976ace"},{"cell_type":"markdown","source":["## Step 4: Save the final machine learning model\n","\n","Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric."],"metadata":{},"id":"0d052eef-0756-411e-8ab2-7fabd7a6076a"},{"cell_type":"code","source":["model_path = f\"runs:/{automl.best_run_id}/model\"\n","\n","# Register the model to the MLflow registry\n","registered_model = mlflow.register_model(model_uri=model_path, name=\"exp-test-AutoMLModel\")\n","\n","# Print the registered model's name and version\n","print(f\"Model '{registered_model.name}' version {registered_model.version} registered successfully.\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"2ce45e61-6094-4faa-9c9a-e6350bc4de6b"},{"cell_type":"markdown","source":["## Step 5: Generate predictions"],"metadata":{},"id":"b628aab7-22c6-47e6-8b79-a7767b519830"},{"cell_type":"markdown","source":["Microsoft Fabric lets you operationalize machine learning models with a scalable function called `PREDICT`, which supports batch scoring (or batch inferencing) in any compute engine. You can generate batch predictions directly from the Microsoft Fabric notebook or from a given ML model's item page. For more information on how to use `PREDICT`, see [Model scoring with PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict)."],"metadata":{},"id":"993e8880-f55e-438c-8d2d-fb7215e63c63"},{"cell_type":"markdown","source":["1. Generate predictions."],"metadata":{},"id":"aa12ec97-d582-4a43-88c3-ddde42b7b44b"},{"cell_type":"code","source":["model_name = \"exp-test-AutoMLModel\"\n","from synapse.ml.predict import MLFlowTransformer\n","\n","feature_cols = X_train.columns.to_list()\n","model = MLFlowTransformer(\n","    inputCols=feature_cols,\n","    outputCol=target_col,\n","    modelName=model_name,\n","    modelVersion=registered_model.version,\n",")\n","\n","df_test = spark.createDataFrame(X_test)\n","batch_predictions = model.transform(df_test)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"3c6f2b3a-ad30-4cf3-9740-9da5b90a859e"},{"cell_type":"code","source":["display(batch_predictions)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"1af8b16c-cdb4-4add-8df5-5c179fffdb95"},{"cell_type":"markdown","source":["2. Save the predictions to a table."],"metadata":{},"id":"2642ffad-253b-4ea9-ac34-9ad0c3690f34"},{"cell_type":"code","source":["saved_name = \"2020orders_predictions\".replace(\".\", \"_\")\n","batch_predictions.write.mode(\"overwrite\").format(\"delta\").option(\"overwriteSchema\", \"true\").save(f\"Tables/{saved_name}\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"fb16d367-0570-427c-a04a-2980b6e5d014"}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"automl_config":{"lakehouseInfo":{"lakehouseName":"lake_samples","lakehouseId":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","workspaceId":"98ea70b8-712f-49ac-9250-d737780bb594","state":"ready","errMsg":""},"tableInfo":{"type":"table","tableInfo":{"name":"2020orders","fullAbfsPath":"abfss://98ea70b8-712f-49ac-9250-d737780bb594@onelake.dfs.fabric.microsoft.com/3b406a22-8d06-40ef-9f97-8c2ab976f7a4/Tables/2020orders","type":"MANAGED","format":"","isDeltaTable":true,"relativePath":"Tables/2020orders"},"columns":[{"name":"ID","type":"string","nullable":true},{"name":"Count","type":"integer","nullable":true},{"name":"Date","type":"string","nullable":true},{"name":"Name","type":"string","nullable":true},{"name":"Style","type":"string","nullable":true},{"name":"price","type":"double","nullable":true},{"name":"tax","type":"double","nullable":true}]},"trainData":{"predictColumn":"price","enableFeaturization":true,"mappingColumns":[{"name":"ID","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Count","type":"integer","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Date","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Name","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Style","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"price","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"tax","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"}]},"mlModel":{"task":"Binary Classification","mode":"QuickProto","duration":"-1","metric":"","endEarly":false},"finalDetails":{"parallelizationMethod":"trainMultiple","notebookName":"AutoML Sample Test - Demo ","experimentName":"exp-test","modelName":"exp-test-AutoMLModel","model":{"modelSelection":"","modelInput":"exp-test-AutoMLModel","modelType":"CreateNew"}},"step":5},"microsoft":{"language":"python","language_group":"synapse_pyspark","ms_spell_check":{"ms_spell_check_language":"en"}},"nteract":{"version":"nteract-front-end@1.0.0"},"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{"spark.synapse.nbs.session.timeout":"1200000"}}},"dependencies":{"lakehouse":{"default_lakehouse":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","default_lakehouse_name":"lake_samples","known_lakehouses":[{"id":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4"}],"default_lakehouse_workspace_id":"98ea70b8-712f-49ac-9250-d737780bb594"}}},"nbformat":4,"nbformat_minor":5}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"cells":[{"cell_type":"markdown","source":["# Demonstration: Train a ML model with AutoML\n","\n","## Introduction\n","\n","This notebook is automatically generated by the Fabric low-code AutoML wizard based on your selections. Whether you're building a regression model, a classifier, or another machine-learning solution, this tool simplifies the process by transforming your goals into executable code. You can easily modify any settings or code snippets to better align with your requirements.\n","\n","### What is FLAML?\n","\n","[FLAML (Fast and Lightweight Automated Machine Learning)](https://aka.ms/fabric-automl) is an open-source AutoML library designed to quickly and efficiently find the best machine learning models and hyperparameters. FLAML optimizes for speed, accuracy, and cost, making it an excellent choice for a wide range of machine-learning tasks.\n","\n","### Steps in this notebook\n","\n","1. Load the data: Import your dataset.\n","2. Generate features: Automatically transform and preprocess your data to improve model performance.\n","3. Use AutoML to find your best model: Use FLAML to automatically select the most suitable model and optimize its parameters.\n","4. Save the final machine learning model: Store the trained model for future use.\n","5. Generate predictions: Use the saved model to predict outcomes on new data.\n","\n","> [!IMPORTANT]\n","> Automated ML is currently supported on Fabric Runtimes 1.2+ or any Fabric environment with Spark 3.4+.\n"],"metadata":{},"id":"d8d36bfe-0884-4c73-a24f-175233d98bdf"},{"cell_type":"code","source":["%pip install scikit-learn==1.5.1\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"592531fe-7a06-4837-a5eb-2650113cbf13"},{"cell_type":"markdown","source":["### Default notebook optimization\n","\n","This cell configures the logging and warning settings to reduce unnecessary output and focus on critical information. It suppresses specific warnings and logs from the underlying libraries, ensuring a cleaner and more readable notebook experience."],"metadata":{},"id":"14223c8d-f82a-44ef-a466-e03ebcc6b430"},{"cell_type":"code","source":["import logging\n","import warnings\n"," \n","logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)\n","logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)\n","warnings.simplefilter('ignore', category=FutureWarning)\n","warnings.simplefilter('ignore', category=UserWarning)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"9878de39-d1c1-485b-9058-e429715b5cd8"},{"cell_type":"markdown","source":["## Step 1: Load the Data\n","\n","This cell is responsible for importing the raw data from the specified source into the notebook environment. The data could come from various sources, such as a file or table in your lakehouse.\n","\n","Once loaded, this data will serve as the input for subsequent steps, such as data transformation, model training, and evaluation."],"metadata":{},"id":"67153540-7117-4adb-9766-b701ff7fc616"},{"cell_type":"code","source":["import re\n","import pandas as pd\n","import numpy as np\n","\n","df = spark.read.format(\"delta\").load(\n"," \"Tables/2020orders\"\n",").cache()\n","# Transform to pandas according to the selected models\n","X = df.limit(100000).toPandas() # Use df.toPandas() to use all the data\n","X = X.rename(columns = lambda c:re.sub('[^A-Za-z0-9_]+', '_', c)) # Replace not supported characters in column name with underscore to avoid invalid character for model training and saving\n","\n","target_col = re.sub('[^A-Za-z0-9_]+', '_', \"price\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"63113dbc-16ab-4932-97c2-b0f54cfe9b3f"},{"cell_type":"code","source":["display(X)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ae621756-f044-4553-8509-d64973d5d903"},{"cell_type":"markdown","source":["## Step 2: Generate features\n","\n","Featurization is the process of transforming raw data into a format optimized for training a machine learning model. It ensures the model can access the most relevant information, significantly impacting its accuracy and performance.\n","\n","This step applies various techniques to refine the data, enhance its quality, and make it compatible with the selected algorithms, helping the model learn patterns more effectively."],"metadata":{},"id":"761a4b6e-6698-4bd3-948c-3e5274efbaad"},{"cell_type":"code","source":["# Handle class imbalance\n","import matplotlib.pyplot as plt\n","\n","\n","distribution = X[target_col].value_counts(normalize=True)\n","dominant_class_proportion = distribution.max()\n","\n","distribution.plot(kind='bar')\n","plt.title(\"Class Distribution\")\n","plt.xlabel(\"Class\")\n","plt.ylabel(\"Proportion\")\n","plt.show()\n","\n","if dominant_class_proportion > 0.8:\n"," print(f\"The dataset is imbalanced. The dominant class has {dominant_class_proportion * 100:.2f}% of the samples.\")\n"," print(\"You may need to handle class imbalance before training the model.\")\n"," print(\"You can use techniques such as oversampling, undersampling, or using class weights to handle class imbalance.\")\n"," print(\"For more information, see https://aka.ms/smote-example\")\n","else:\n"," print(\"The dataset is balanced.\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d7e7a55b-434d-42c3-b457-88b89dd57461"},{"cell_type":"code","source":["# Set Functions if needed for Featurization\n","def create_fillna_processor(\n"," df, mean_features=None, median_features=None, mode_features=None\n","):\n"," \"\"\"\n"," Create a ColumnTransformer that fills missing values in a DataFrame using different strategies\n"," based on the skewness of the numerical features and the specified feature lists.\n","\n"," Parameters:\n"," df (pd.DataFrame): The input DataFrame.\n"," mean_features (list, optional): List of features to impute using the mean strategy. Defaults to None.\n"," median_features (list, optional): List of features to impute using the median strategy. Defaults to None.\n"," mode_features (list, optional): List of features to impute using the mode strategy. Defaults to None.\n","\n"," Returns:\n"," ColumnTransformer: A fitted ColumnTransformer that can be used to transform the DataFrame.\n"," list: List of all features supported by SimpleImputer in the DataFrame.\n"," list: List of datetime features in the DataFrame.\n"," \"\"\"\n"," if mean_features is None:\n"," mean_features = []\n"," if median_features is None:\n"," median_features = []\n"," if mode_features is None:\n"," mode_features = []\n"," all_features = mean_features + median_features + mode_features\n"," # Group features by their imputation needs\n"," mean_features = [\n"," col\n"," for col in df.select_dtypes(include=[\"number\"]).columns\n"," if df[col].skew(skipna=True) <= 1 and col not in all_features\n"," ] + mean_features\n"," median_features = [\n"," col\n"," for col in df.select_dtypes(include=[\"number\"]).columns\n"," if df[col].skew(skipna=True) > 1 and col not in all_features\n"," ] + median_features\n"," all_features = mean_features + median_features\n"," datetime_features = df.select_dtypes(include=[\"datetime\"]).columns.tolist()\n"," mode_features = [col for col in df.columns.tolist() if col not in all_features + datetime_features]\n","\n"," transformers = []\n","\n"," if mean_features:\n"," transformers.append(\n"," (\"mean_imputer\", SimpleImputer(strategy=\"mean\"), mean_features)\n"," )\n"," if median_features:\n"," transformers.append(\n"," (\"median_imputer\", SimpleImputer(strategy=\"median\"), median_features)\n"," )\n"," if mode_features:\n"," transformers.append(\n"," (\"mode_imputer\", SimpleImputer(strategy=\"most_frequent\"), mode_features)\n"," )\n","\n"," column_transformer = ColumnTransformer(transformers=transformers)\n"," all_features = mean_features + median_features + mode_features\n","\n"," return column_transformer.fit(df), all_features, datetime_features\n","\n","\n","def fillna(df, processor, all_features, datetime_features):\n"," \"\"\"\n"," Fill missing values in a DataFrame using a specified processor and mode imputation.\n","\n"," Parameters:\n"," df (pd.DataFrame): The input DataFrame with missing values.\n"," processor (object): An object with a `transform` method that processes the DataFrame.\n"," all_features (list): List of all features supported by SimpleImputer in the DataFrame.\n"," datetime_features (list): List of datetime features in the DataFrame.\n","\n"," Returns:\n"," pd.DataFrame: A DataFrame with missing values filled.\n"," \"\"\"\n"," filled_array = processor.transform(df)\n"," filled_df = pd.DataFrame(filled_array, columns=all_features)\n"," if datetime_features:\n"," datetime_data = df[datetime_features]\n"," datetime_data.ffill()\n"," filled_df = pd.concat([datetime_data, filled_df], axis=1)\n"," for col in df.columns:\n"," filled_df[col].fillna(filled_df[col].mode()[0], inplace=True)\n","\n"," return filled_df\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"e37b9c43-7220-4b0c-9fa3-6ad9226dc85e"},{"cell_type":"code","source":["from sklearn.pipeline import Pipeline\n","from sklearn.impute import SimpleImputer\n","from sklearn.compose import ColumnTransformer\n","\n","\n","# convert object type to nearest dtype\n","X = X.convert_dtypes()\n","X = X.dropna(axis=1, how='all')\n","\n","# select columns for model training\n","X = X.select_dtypes(include=['number', 'datetime', 'category'])\n","\n","from sklearn.model_selection import train_test_split\n","\n","# You may need to update the test_size based on your scenario\n","X_train, X_test = train_test_split(X, test_size=0.2, random_state=41)\n","\n","mean_features, median_features, mode_features = [], [], []\n"," \n","preprocessor, all_features, datetime_features = create_fillna_processor(X_train, mean_features, median_features, mode_features)\n","X_train = fillna(X_train, preprocessor, all_features, datetime_features)\n","X_test = fillna(X_test, preprocessor, all_features, datetime_features)\n"," \n","y_train = X_train.pop(target_col)\n","y_test = X_test.pop(target_col)\n","\n","display(X_train[:10])\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"c9c6728f-7385-4c76-8284-6708d67bc5c7"},{"cell_type":"markdown","source":["## Step 3: Use AutoML to find your best model\n","\n","We will now use FLAML's AutoML to automatically find the best machine learning model for our data. AutoML (Automated Machine Learning) simplifies the model selection process by automatically testing and tuning various algorithms and configurations, helping us quickly identify the most effective model with minimal manual effort."],"metadata":{},"id":"3b4c43b4-9416-43d9-9ed8-a8d32858250d"},{"cell_type":"markdown","source":["### Tracking results with experiments in Fabric\n","\n","Experiments in Fabric let you track the results of your AutoML process, providing a comprehensive view of all the metrics and parameters from your trials."],"metadata":{},"id":"f287fb60-1e24-45f9-9493-1c563c797702"},{"cell_type":"code","source":["# MLFlow Logging Related\n","\n","import mlflow\n","\n","mlflow.autolog(exclusive=False)\n","mlflow.set_experiment(\"exp-test\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ff2e3568-ce88-4a63-8bf8-c768a6cfdc3c"},{"cell_type":"markdown","source":["#### Configure the AutoML trial and settings\n","\n","These configurations are driven by the AutoML mode and task selected in the wizard. For example, if you select \"quick prototype\", you'll see a setting for time budget."],"metadata":{},"id":"4f02f65d-bc49-4090-b00b-2bb28d59e754"},{"cell_type":"code","source":["# Import the AutoML class from the FLAML package\n","import flaml\n","from flaml import AutoML\n","\n","# Define AutoML settings\n","settings = {\n"," \"time_budget\": 120, # Total running time in seconds\n"," \"task\": \"binary\", \n"," \"log_file_name\": \"flaml_experiment.log\", # FLAML log file\n"," \"eval_method\": \"cv\",\n"," \"n_splits\": 3,\n"," \"max_iter\": 10, \n"," \"force_cancel\": True, \n"," \"seed\": 41 , # Random seed \n"," \"mlflow_exp_name\": \"exp-test\", # MLflow experiment name\n"," \"use_spark\": True, # whether to use Spark for distributed training\n"," \"n_concurrent_trials\": 3, # the maximum number of concurrent trials \n"," \"verbose\": 1, \n"," \"featurization\": \"auto\", \n","}\n","\n","if flaml.__version__ > \"2.3.3\":\n"," settings[\"entrypoint\"] = \"low-code\"\n","\n","# Create an AutoML instance\n","automl = AutoML(**settings)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d05dcde3-bf5f-43c5-a6fa-01e0a07affab"},{"cell_type":"markdown","source":["#### Run the AutoML trial\n","\n","Run the AutoML trial, with all trials being tracked as experiment runs. The trial is performed on the processed dataset, using the `Exited` variable as the target, and applying the defined configurations for optimal model selection."],"metadata":{},"id":"fc13e255-3bfb-4b54-9337-7f0fd070dbbc"},{"cell_type":"code","source":["with mlflow.start_run(nested=True, run_name=\"exp-test-AutoMLModel\"):\n"," automl.fit(\n"," X_train=X_train, \n"," y_train=y_train, # target column of the training data \n"," )"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"6c995371-878a-40be-a6ca-106181976ace"},{"cell_type":"markdown","source":["## Step 4: Save the final machine learning model\n","\n","Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric."],"metadata":{},"id":"0d052eef-0756-411e-8ab2-7fabd7a6076a"},{"cell_type":"code","source":["model_path = f\"runs:/{automl.best_run_id}/model\"\n","\n","# Register the model to the MLflow registry\n","registered_model = mlflow.register_model(model_uri=model_path, name=\"exp-test-AutoMLModel\")\n","\n","# Print the registered model's name and version\n","print(f\"Model '{registered_model.name}' version {registered_model.version} registered successfully.\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"2ce45e61-6094-4faa-9c9a-e6350bc4de6b"},{"cell_type":"markdown","source":["## Step 5: Generate predictions"],"metadata":{},"id":"b628aab7-22c6-47e6-8b79-a7767b519830"},{"cell_type":"markdown","source":["Microsoft Fabric lets you operationalize machine learning models with a scalable function called `PREDICT`, which supports batch scoring (or batch inferencing) in any compute engine. You can generate batch predictions directly from the Microsoft Fabric notebook or from a given ML model's item page. For more information on how to use `PREDICT`, see [Model scoring with PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict)."],"metadata":{},"id":"993e8880-f55e-438c-8d2d-fb7215e63c63"},{"cell_type":"markdown","source":["1. Generate predictions."],"metadata":{},"id":"aa12ec97-d582-4a43-88c3-ddde42b7b44b"},{"cell_type":"code","source":["model_name = \"exp-test-AutoMLModel\"\n","from synapse.ml.predict import MLFlowTransformer\n","\n","feature_cols = X_train.columns.to_list()\n","model = MLFlowTransformer(\n"," inputCols=feature_cols,\n"," outputCol=target_col,\n"," modelName=model_name,\n"," modelVersion=registered_model.version,\n",")\n","\n","df_test = spark.createDataFrame(X_test)\n","batch_predictions = model.transform(df_test)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"3c6f2b3a-ad30-4cf3-9740-9da5b90a859e"},{"cell_type":"code","source":["display(batch_predictions)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"1af8b16c-cdb4-4add-8df5-5c179fffdb95"},{"cell_type":"markdown","source":["2. Save the predictions to a table."],"metadata":{},"id":"2642ffad-253b-4ea9-ac34-9ad0c3690f34"},{"cell_type":"code","source":["saved_name = \"2020orders_predictions\".replace(\".\", \"_\")\n","batch_predictions.write.mode(\"overwrite\").format(\"delta\").option(\"overwriteSchema\", \"true\").save(f\"Tables/{saved_name}\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"fb16d367-0570-427c-a04a-2980b6e5d014"}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"automl_config":{"lakehouseInfo":{"lakehouseName":"lake_samples","lakehouseId":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","workspaceId":"98ea70b8-712f-49ac-9250-d737780bb594","state":"ready","errMsg":""},"tableInfo":{"type":"table","tableInfo":{"name":"2020orders","fullAbfsPath":"abfss://98ea70b8-712f-49ac-9250-d737780bb594@onelake.dfs.fabric.microsoft.com/3b406a22-8d06-40ef-9f97-8c2ab976f7a4/Tables/2020orders","type":"MANAGED","format":"","isDeltaTable":true,"relativePath":"Tables/2020orders"},"columns":[{"name":"ID","type":"string","nullable":true},{"name":"Count","type":"integer","nullable":true},{"name":"Date","type":"string","nullable":true},{"name":"Name","type":"string","nullable":true},{"name":"Style","type":"string","nullable":true},{"name":"price","type":"double","nullable":true},{"name":"tax","type":"double","nullable":true}]},"trainData":{"predictColumn":"price","enableFeaturization":true,"mappingColumns":[{"name":"ID","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Count","type":"integer","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Date","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Name","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Style","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"price","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"tax","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"}]},"mlModel":{"task":"Binary Classification","mode":"QuickProto","duration":"-1","metric":"","endEarly":false},"finalDetails":{"parallelizationMethod":"trainMultiple","notebookName":"AutoML Sample Test - Demo ","experimentName":"exp-test","modelName":"exp-test-AutoMLModel","model":{"modelSelection":"","modelInput":"exp-test-AutoMLModel","modelType":"CreateNew"}},"step":5},"microsoft":{"language":"python","language_group":"synapse_pyspark","ms_spell_check":{"ms_spell_check_language":"en"}},"nteract":{"version":"[email protected]"},"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{"spark.synapse.nbs.session.timeout":"1200000"}}},"dependencies":{"lakehouse":{"default_lakehouse":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","default_lakehouse_name":"lake_samples","known_lakehouses":[{"id":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4"}],"default_lakehouse_workspace_id":"98ea70b8-712f-49ac-9250-d737780bb594"}}},"nbformat":4,"nbformat_minor":5}