Total Visitors

From 1d201b8111e87ea2a3341b9f6069dd0b523ce65b Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Fri, 2 May 2025 23:16:45 -0600
Subject: [PATCH 13/31] sample added
---
.../DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb | 1 +
1 file changed, 1 insertion(+)
create mode 100644 Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb
diff --git a/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb b/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb
new file mode 100644
index 0000000..31b6c55
--- /dev/null
+++ b/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","source":["# Demonstration: Train a ML model with AutoML\n","\n","## Introduction\n","\n","This notebook is automatically generated by the Fabric low-code AutoML wizard based on your selections. Whether you're building a regression model, a classifier, or another machine-learning solution, this tool simplifies the process by transforming your goals into executable code. You can easily modify any settings or code snippets to better align with your requirements.\n","\n","### What is FLAML?\n","\n","[FLAML (Fast and Lightweight Automated Machine Learning)](https://aka.ms/fabric-automl) is an open-source AutoML library designed to quickly and efficiently find the best machine learning models and hyperparameters. FLAML optimizes for speed, accuracy, and cost, making it an excellent choice for a wide range of machine-learning tasks.\n","\n","### Steps in this notebook\n","\n","1. **Load the data**: Import your dataset.\n","2. **Generate features**: Automatically transform and preprocess your data to improve model performance.\n","3. **Use AutoML to find your best model**: Use FLAML to automatically select the most suitable model and optimize its parameters.\n","4. **Save the final machine learning model**: Store the trained model for future use.\n","5. **Generate predictions**: Use the saved model to predict outcomes on new data.\n","\n","> [!IMPORTANT]\n","> **Automated ML is currently supported on Fabric Runtimes 1.2+ or any Fabric environment with Spark 3.4+.**\n"],"metadata":{},"id":"d8d36bfe-0884-4c73-a24f-175233d98bdf"},{"cell_type":"code","source":["%pip install scikit-learn==1.5.1\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"592531fe-7a06-4837-a5eb-2650113cbf13"},{"cell_type":"markdown","source":["### Default notebook optimization\n","\n","This cell configures the logging and warning settings to reduce unnecessary output and focus on critical information. It suppresses specific warnings and logs from the underlying libraries, ensuring a cleaner and more readable notebook experience."],"metadata":{},"id":"14223c8d-f82a-44ef-a466-e03ebcc6b430"},{"cell_type":"code","source":["import logging\n","import warnings\n"," \n","logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)\n","logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)\n","warnings.simplefilter('ignore', category=FutureWarning)\n","warnings.simplefilter('ignore', category=UserWarning)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"9878de39-d1c1-485b-9058-e429715b5cd8"},{"cell_type":"markdown","source":["## Step 1: Load the Data\n","\n","This cell is responsible for importing the raw data from the specified source into the notebook environment. The data could come from various sources, such as a file or table in your lakehouse.\n","\n","Once loaded, this data will serve as the input for subsequent steps, such as data transformation, model training, and evaluation."],"metadata":{},"id":"67153540-7117-4adb-9766-b701ff7fc616"},{"cell_type":"code","source":["import re\n","import pandas as pd\n","import numpy as np\n","\n","df = spark.read.format(\"delta\").load(\n"," \"Tables/2020orders\"\n",").cache()\n","# Transform to pandas according to the selected models\n","X = df.limit(100000).toPandas() # Use df.toPandas() to use all the data\n","X = X.rename(columns = lambda c:re.sub('[^A-Za-z0-9_]+', '_', c)) # Replace not supported characters in column name with underscore to avoid invalid character for model training and saving\n","\n","target_col = re.sub('[^A-Za-z0-9_]+', '_', \"price\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"63113dbc-16ab-4932-97c2-b0f54cfe9b3f"},{"cell_type":"code","source":["display(X)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ae621756-f044-4553-8509-d64973d5d903"},{"cell_type":"markdown","source":["## Step 2: Generate features\n","\n","Featurization is the process of transforming raw data into a format optimized for training a machine learning model. It ensures the model can access the most relevant information, significantly impacting its accuracy and performance.\n","\n","This step applies various techniques to refine the data, enhance its quality, and make it compatible with the selected algorithms, helping the model learn patterns more effectively."],"metadata":{},"id":"761a4b6e-6698-4bd3-948c-3e5274efbaad"},{"cell_type":"code","source":["# Handle class imbalance\n","import matplotlib.pyplot as plt\n","\n","\n","distribution = X[target_col].value_counts(normalize=True)\n","dominant_class_proportion = distribution.max()\n","\n","distribution.plot(kind='bar')\n","plt.title(\"Class Distribution\")\n","plt.xlabel(\"Class\")\n","plt.ylabel(\"Proportion\")\n","plt.show()\n","\n","if dominant_class_proportion > 0.8:\n"," print(f\"The dataset is imbalanced. The dominant class has {dominant_class_proportion * 100:.2f}% of the samples.\")\n"," print(\"You may need to handle class imbalance before training the model.\")\n"," print(\"You can use techniques such as oversampling, undersampling, or using class weights to handle class imbalance.\")\n"," print(\"For more information, see https://aka.ms/smote-example\")\n","else:\n"," print(\"The dataset is balanced.\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d7e7a55b-434d-42c3-b457-88b89dd57461"},{"cell_type":"code","source":["# Set Functions if needed for Featurization\n","def create_fillna_processor(\n"," df, mean_features=None, median_features=None, mode_features=None\n","):\n"," \"\"\"\n"," Create a ColumnTransformer that fills missing values in a DataFrame using different strategies\n"," based on the skewness of the numerical features and the specified feature lists.\n","\n"," Parameters:\n"," df (pd.DataFrame): The input DataFrame.\n"," mean_features (list, optional): List of features to impute using the mean strategy. Defaults to None.\n"," median_features (list, optional): List of features to impute using the median strategy. Defaults to None.\n"," mode_features (list, optional): List of features to impute using the mode strategy. Defaults to None.\n","\n"," Returns:\n"," ColumnTransformer: A fitted ColumnTransformer that can be used to transform the DataFrame.\n"," list: List of all features supported by SimpleImputer in the DataFrame.\n"," list: List of datetime features in the DataFrame.\n"," \"\"\"\n"," if mean_features is None:\n"," mean_features = []\n"," if median_features is None:\n"," median_features = []\n"," if mode_features is None:\n"," mode_features = []\n"," all_features = mean_features + median_features + mode_features\n"," # Group features by their imputation needs\n"," mean_features = [\n"," col\n"," for col in df.select_dtypes(include=[\"number\"]).columns\n"," if df[col].skew(skipna=True) <= 1 and col not in all_features\n"," ] + mean_features\n"," median_features = [\n"," col\n"," for col in df.select_dtypes(include=[\"number\"]).columns\n"," if df[col].skew(skipna=True) > 1 and col not in all_features\n"," ] + median_features\n"," all_features = mean_features + median_features\n"," datetime_features = df.select_dtypes(include=[\"datetime\"]).columns.tolist()\n"," mode_features = [col for col in df.columns.tolist() if col not in all_features + datetime_features]\n","\n"," transformers = []\n","\n"," if mean_features:\n"," transformers.append(\n"," (\"mean_imputer\", SimpleImputer(strategy=\"mean\"), mean_features)\n"," )\n"," if median_features:\n"," transformers.append(\n"," (\"median_imputer\", SimpleImputer(strategy=\"median\"), median_features)\n"," )\n"," if mode_features:\n"," transformers.append(\n"," (\"mode_imputer\", SimpleImputer(strategy=\"most_frequent\"), mode_features)\n"," )\n","\n"," column_transformer = ColumnTransformer(transformers=transformers)\n"," all_features = mean_features + median_features + mode_features\n","\n"," return column_transformer.fit(df), all_features, datetime_features\n","\n","\n","def fillna(df, processor, all_features, datetime_features):\n"," \"\"\"\n"," Fill missing values in a DataFrame using a specified processor and mode imputation.\n","\n"," Parameters:\n"," df (pd.DataFrame): The input DataFrame with missing values.\n"," processor (object): An object with a `transform` method that processes the DataFrame.\n"," all_features (list): List of all features supported by SimpleImputer in the DataFrame.\n"," datetime_features (list): List of datetime features in the DataFrame.\n","\n"," Returns:\n"," pd.DataFrame: A DataFrame with missing values filled.\n"," \"\"\"\n"," filled_array = processor.transform(df)\n"," filled_df = pd.DataFrame(filled_array, columns=all_features)\n"," if datetime_features:\n"," datetime_data = df[datetime_features]\n"," datetime_data.ffill()\n"," filled_df = pd.concat([datetime_data, filled_df], axis=1)\n"," for col in df.columns:\n"," filled_df[col].fillna(filled_df[col].mode()[0], inplace=True)\n","\n"," return filled_df\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"e37b9c43-7220-4b0c-9fa3-6ad9226dc85e"},{"cell_type":"code","source":["from sklearn.pipeline import Pipeline\n","from sklearn.impute import SimpleImputer\n","from sklearn.compose import ColumnTransformer\n","\n","\n","# convert object type to nearest dtype\n","X = X.convert_dtypes()\n","X = X.dropna(axis=1, how='all')\n","\n","# select columns for model training\n","X = X.select_dtypes(include=['number', 'datetime', 'category'])\n","\n","from sklearn.model_selection import train_test_split\n","\n","# You may need to update the test_size based on your scenario\n","X_train, X_test = train_test_split(X, test_size=0.2, random_state=41)\n","\n","mean_features, median_features, mode_features = [], [], []\n"," \n","preprocessor, all_features, datetime_features = create_fillna_processor(X_train, mean_features, median_features, mode_features)\n","X_train = fillna(X_train, preprocessor, all_features, datetime_features)\n","X_test = fillna(X_test, preprocessor, all_features, datetime_features)\n"," \n","y_train = X_train.pop(target_col)\n","y_test = X_test.pop(target_col)\n","\n","display(X_train[:10])\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"c9c6728f-7385-4c76-8284-6708d67bc5c7"},{"cell_type":"markdown","source":["## Step 3: Use AutoML to find your best model\n","\n","We will now use FLAML's AutoML to automatically find the best machine learning model for our data. AutoML (Automated Machine Learning) simplifies the model selection process by automatically testing and tuning various algorithms and configurations, helping us quickly identify the most effective model with minimal manual effort."],"metadata":{},"id":"3b4c43b4-9416-43d9-9ed8-a8d32858250d"},{"cell_type":"markdown","source":["### Tracking results with experiments in Fabric\n","\n","Experiments in Fabric let you track the results of your AutoML process, providing a comprehensive view of all the metrics and parameters from your trials."],"metadata":{},"id":"f287fb60-1e24-45f9-9493-1c563c797702"},{"cell_type":"code","source":["# MLFlow Logging Related\n","\n","import mlflow\n","\n","mlflow.autolog(exclusive=False)\n","mlflow.set_experiment(\"exp-test\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ff2e3568-ce88-4a63-8bf8-c768a6cfdc3c"},{"cell_type":"markdown","source":["#### Configure the AutoML trial and settings\n","\n","These configurations are driven by the AutoML mode and task selected in the wizard. For example, if you select \"quick prototype\", you'll see a setting for time budget."],"metadata":{},"id":"4f02f65d-bc49-4090-b00b-2bb28d59e754"},{"cell_type":"code","source":["# Import the AutoML class from the FLAML package\n","import flaml\n","from flaml import AutoML\n","\n","# Define AutoML settings\n","settings = {\n"," \"time_budget\": 120, # Total running time in seconds\n"," \"task\": \"binary\", \n"," \"log_file_name\": \"flaml_experiment.log\", # FLAML log file\n"," \"eval_method\": \"cv\",\n"," \"n_splits\": 3,\n"," \"max_iter\": 10, \n"," \"force_cancel\": True, \n"," \"seed\": 41 , # Random seed \n"," \"mlflow_exp_name\": \"exp-test\", # MLflow experiment name\n"," \"use_spark\": True, # whether to use Spark for distributed training\n"," \"n_concurrent_trials\": 3, # the maximum number of concurrent trials \n"," \"verbose\": 1, \n"," \"featurization\": \"auto\", \n","}\n","\n","if flaml.__version__ > \"2.3.3\":\n"," settings[\"entrypoint\"] = \"low-code\"\n","\n","# Create an AutoML instance\n","automl = AutoML(**settings)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d05dcde3-bf5f-43c5-a6fa-01e0a07affab"},{"cell_type":"markdown","source":["#### Run the AutoML trial\n","\n","Run the AutoML trial, with all trials being tracked as experiment runs. The trial is performed on the processed dataset, using the `Exited` variable as the target, and applying the defined configurations for optimal model selection."],"metadata":{},"id":"fc13e255-3bfb-4b54-9337-7f0fd070dbbc"},{"cell_type":"code","source":["with mlflow.start_run(nested=True, run_name=\"exp-test-AutoMLModel\"):\n"," automl.fit(\n"," X_train=X_train, \n"," y_train=y_train, # target column of the training data \n"," )"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"6c995371-878a-40be-a6ca-106181976ace"},{"cell_type":"markdown","source":["## Step 4: Save the final machine learning model\n","\n","Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric."],"metadata":{},"id":"0d052eef-0756-411e-8ab2-7fabd7a6076a"},{"cell_type":"code","source":["model_path = f\"runs:/{automl.best_run_id}/model\"\n","\n","# Register the model to the MLflow registry\n","registered_model = mlflow.register_model(model_uri=model_path, name=\"exp-test-AutoMLModel\")\n","\n","# Print the registered model's name and version\n","print(f\"Model '{registered_model.name}' version {registered_model.version} registered successfully.\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"2ce45e61-6094-4faa-9c9a-e6350bc4de6b"},{"cell_type":"markdown","source":["## Step 5: Generate predictions"],"metadata":{},"id":"b628aab7-22c6-47e6-8b79-a7767b519830"},{"cell_type":"markdown","source":["Microsoft Fabric lets you operationalize machine learning models with a scalable function called `PREDICT`, which supports batch scoring (or batch inferencing) in any compute engine. You can generate batch predictions directly from the Microsoft Fabric notebook or from a given ML model's item page. For more information on how to use `PREDICT`, see [Model scoring with PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict)."],"metadata":{},"id":"993e8880-f55e-438c-8d2d-fb7215e63c63"},{"cell_type":"markdown","source":["1. Generate predictions."],"metadata":{},"id":"aa12ec97-d582-4a43-88c3-ddde42b7b44b"},{"cell_type":"code","source":["model_name = \"exp-test-AutoMLModel\"\n","from synapse.ml.predict import MLFlowTransformer\n","\n","feature_cols = X_train.columns.to_list()\n","model = MLFlowTransformer(\n"," inputCols=feature_cols,\n"," outputCol=target_col,\n"," modelName=model_name,\n"," modelVersion=registered_model.version,\n",")\n","\n","df_test = spark.createDataFrame(X_test)\n","batch_predictions = model.transform(df_test)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"3c6f2b3a-ad30-4cf3-9740-9da5b90a859e"},{"cell_type":"code","source":["display(batch_predictions)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"1af8b16c-cdb4-4add-8df5-5c179fffdb95"},{"cell_type":"markdown","source":["2. Save the predictions to a table."],"metadata":{},"id":"2642ffad-253b-4ea9-ac34-9ad0c3690f34"},{"cell_type":"code","source":["saved_name = \"2020orders_predictions\".replace(\".\", \"_\")\n","batch_predictions.write.mode(\"overwrite\").format(\"delta\").option(\"overwriteSchema\", \"true\").save(f\"Tables/{saved_name}\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"fb16d367-0570-427c-a04a-2980b6e5d014"}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"automl_config":{"lakehouseInfo":{"lakehouseName":"lake_samples","lakehouseId":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","workspaceId":"98ea70b8-712f-49ac-9250-d737780bb594","state":"ready","errMsg":""},"tableInfo":{"type":"table","tableInfo":{"name":"2020orders","fullAbfsPath":"abfss://98ea70b8-712f-49ac-9250-d737780bb594@onelake.dfs.fabric.microsoft.com/3b406a22-8d06-40ef-9f97-8c2ab976f7a4/Tables/2020orders","type":"MANAGED","format":"","isDeltaTable":true,"relativePath":"Tables/2020orders"},"columns":[{"name":"ID","type":"string","nullable":true},{"name":"Count","type":"integer","nullable":true},{"name":"Date","type":"string","nullable":true},{"name":"Name","type":"string","nullable":true},{"name":"Style","type":"string","nullable":true},{"name":"price","type":"double","nullable":true},{"name":"tax","type":"double","nullable":true}]},"trainData":{"predictColumn":"price","enableFeaturization":true,"mappingColumns":[{"name":"ID","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Count","type":"integer","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Date","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Name","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Style","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"price","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"tax","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"}]},"mlModel":{"task":"Binary Classification","mode":"QuickProto","duration":"-1","metric":"","endEarly":false},"finalDetails":{"parallelizationMethod":"trainMultiple","notebookName":"AutoML Sample Test - Demo ","experimentName":"exp-test","modelName":"exp-test-AutoMLModel","model":{"modelSelection":"","modelInput":"exp-test-AutoMLModel","modelType":"CreateNew"}},"step":5},"microsoft":{"language":"python","language_group":"synapse_pyspark","ms_spell_check":{"ms_spell_check_language":"en"}},"nteract":{"version":"nteract-front-end@1.0.0"},"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{"spark.synapse.nbs.session.timeout":"1200000"}}},"dependencies":{"lakehouse":{"default_lakehouse":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","default_lakehouse_name":"lake_samples","known_lakehouses":[{"id":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4"}],"default_lakehouse_workspace_id":"98ea70b8-712f-49ac-9250-d737780bb594"}}},"nbformat":4,"nbformat_minor":5}
\ No newline at end of file
From 332b65368e2656c12b5fa59a04dfe91855cd4828 Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
Date: Sat, 3 May 2025 05:17:10 +0000
Subject: [PATCH 14/31] Fix notebook format issues
---
.../How_AutoML/Train_MLmodel_AutoML.ipynb | 733 +++++++++++++++++-
1 file changed, 732 insertions(+), 1 deletion(-)
diff --git a/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb b/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb
index 31b6c55..6d80469 100644
--- a/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb
+++ b/Workloads-Specific/DataScience/How_AutoML/Train_MLmodel_AutoML.ipynb
@@ -1 +1,732 @@
-{"cells":[{"cell_type":"markdown","source":["# Demonstration: Train a ML model with AutoML\n","\n","## Introduction\n","\n","This notebook is automatically generated by the Fabric low-code AutoML wizard based on your selections. Whether you're building a regression model, a classifier, or another machine-learning solution, this tool simplifies the process by transforming your goals into executable code. You can easily modify any settings or code snippets to better align with your requirements.\n","\n","### What is FLAML?\n","\n","[FLAML (Fast and Lightweight Automated Machine Learning)](https://aka.ms/fabric-automl) is an open-source AutoML library designed to quickly and efficiently find the best machine learning models and hyperparameters. FLAML optimizes for speed, accuracy, and cost, making it an excellent choice for a wide range of machine-learning tasks.\n","\n","### Steps in this notebook\n","\n","1. **Load the data**: Import your dataset.\n","2. **Generate features**: Automatically transform and preprocess your data to improve model performance.\n","3. **Use AutoML to find your best model**: Use FLAML to automatically select the most suitable model and optimize its parameters.\n","4. **Save the final machine learning model**: Store the trained model for future use.\n","5. **Generate predictions**: Use the saved model to predict outcomes on new data.\n","\n","> [!IMPORTANT]\n","> **Automated ML is currently supported on Fabric Runtimes 1.2+ or any Fabric environment with Spark 3.4+.**\n"],"metadata":{},"id":"d8d36bfe-0884-4c73-a24f-175233d98bdf"},{"cell_type":"code","source":["%pip install scikit-learn==1.5.1\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"592531fe-7a06-4837-a5eb-2650113cbf13"},{"cell_type":"markdown","source":["### Default notebook optimization\n","\n","This cell configures the logging and warning settings to reduce unnecessary output and focus on critical information. It suppresses specific warnings and logs from the underlying libraries, ensuring a cleaner and more readable notebook experience."],"metadata":{},"id":"14223c8d-f82a-44ef-a466-e03ebcc6b430"},{"cell_type":"code","source":["import logging\n","import warnings\n"," \n","logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)\n","logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)\n","warnings.simplefilter('ignore', category=FutureWarning)\n","warnings.simplefilter('ignore', category=UserWarning)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"9878de39-d1c1-485b-9058-e429715b5cd8"},{"cell_type":"markdown","source":["## Step 1: Load the Data\n","\n","This cell is responsible for importing the raw data from the specified source into the notebook environment. The data could come from various sources, such as a file or table in your lakehouse.\n","\n","Once loaded, this data will serve as the input for subsequent steps, such as data transformation, model training, and evaluation."],"metadata":{},"id":"67153540-7117-4adb-9766-b701ff7fc616"},{"cell_type":"code","source":["import re\n","import pandas as pd\n","import numpy as np\n","\n","df = spark.read.format(\"delta\").load(\n"," \"Tables/2020orders\"\n",").cache()\n","# Transform to pandas according to the selected models\n","X = df.limit(100000).toPandas() # Use df.toPandas() to use all the data\n","X = X.rename(columns = lambda c:re.sub('[^A-Za-z0-9_]+', '_', c)) # Replace not supported characters in column name with underscore to avoid invalid character for model training and saving\n","\n","target_col = re.sub('[^A-Za-z0-9_]+', '_', \"price\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"63113dbc-16ab-4932-97c2-b0f54cfe9b3f"},{"cell_type":"code","source":["display(X)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ae621756-f044-4553-8509-d64973d5d903"},{"cell_type":"markdown","source":["## Step 2: Generate features\n","\n","Featurization is the process of transforming raw data into a format optimized for training a machine learning model. It ensures the model can access the most relevant information, significantly impacting its accuracy and performance.\n","\n","This step applies various techniques to refine the data, enhance its quality, and make it compatible with the selected algorithms, helping the model learn patterns more effectively."],"metadata":{},"id":"761a4b6e-6698-4bd3-948c-3e5274efbaad"},{"cell_type":"code","source":["# Handle class imbalance\n","import matplotlib.pyplot as plt\n","\n","\n","distribution = X[target_col].value_counts(normalize=True)\n","dominant_class_proportion = distribution.max()\n","\n","distribution.plot(kind='bar')\n","plt.title(\"Class Distribution\")\n","plt.xlabel(\"Class\")\n","plt.ylabel(\"Proportion\")\n","plt.show()\n","\n","if dominant_class_proportion > 0.8:\n"," print(f\"The dataset is imbalanced. The dominant class has {dominant_class_proportion * 100:.2f}% of the samples.\")\n"," print(\"You may need to handle class imbalance before training the model.\")\n"," print(\"You can use techniques such as oversampling, undersampling, or using class weights to handle class imbalance.\")\n"," print(\"For more information, see https://aka.ms/smote-example\")\n","else:\n"," print(\"The dataset is balanced.\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d7e7a55b-434d-42c3-b457-88b89dd57461"},{"cell_type":"code","source":["# Set Functions if needed for Featurization\n","def create_fillna_processor(\n"," df, mean_features=None, median_features=None, mode_features=None\n","):\n"," \"\"\"\n"," Create a ColumnTransformer that fills missing values in a DataFrame using different strategies\n"," based on the skewness of the numerical features and the specified feature lists.\n","\n"," Parameters:\n"," df (pd.DataFrame): The input DataFrame.\n"," mean_features (list, optional): List of features to impute using the mean strategy. Defaults to None.\n"," median_features (list, optional): List of features to impute using the median strategy. Defaults to None.\n"," mode_features (list, optional): List of features to impute using the mode strategy. Defaults to None.\n","\n"," Returns:\n"," ColumnTransformer: A fitted ColumnTransformer that can be used to transform the DataFrame.\n"," list: List of all features supported by SimpleImputer in the DataFrame.\n"," list: List of datetime features in the DataFrame.\n"," \"\"\"\n"," if mean_features is None:\n"," mean_features = []\n"," if median_features is None:\n"," median_features = []\n"," if mode_features is None:\n"," mode_features = []\n"," all_features = mean_features + median_features + mode_features\n"," # Group features by their imputation needs\n"," mean_features = [\n"," col\n"," for col in df.select_dtypes(include=[\"number\"]).columns\n"," if df[col].skew(skipna=True) <= 1 and col not in all_features\n"," ] + mean_features\n"," median_features = [\n"," col\n"," for col in df.select_dtypes(include=[\"number\"]).columns\n"," if df[col].skew(skipna=True) > 1 and col not in all_features\n"," ] + median_features\n"," all_features = mean_features + median_features\n"," datetime_features = df.select_dtypes(include=[\"datetime\"]).columns.tolist()\n"," mode_features = [col for col in df.columns.tolist() if col not in all_features + datetime_features]\n","\n"," transformers = []\n","\n"," if mean_features:\n"," transformers.append(\n"," (\"mean_imputer\", SimpleImputer(strategy=\"mean\"), mean_features)\n"," )\n"," if median_features:\n"," transformers.append(\n"," (\"median_imputer\", SimpleImputer(strategy=\"median\"), median_features)\n"," )\n"," if mode_features:\n"," transformers.append(\n"," (\"mode_imputer\", SimpleImputer(strategy=\"most_frequent\"), mode_features)\n"," )\n","\n"," column_transformer = ColumnTransformer(transformers=transformers)\n"," all_features = mean_features + median_features + mode_features\n","\n"," return column_transformer.fit(df), all_features, datetime_features\n","\n","\n","def fillna(df, processor, all_features, datetime_features):\n"," \"\"\"\n"," Fill missing values in a DataFrame using a specified processor and mode imputation.\n","\n"," Parameters:\n"," df (pd.DataFrame): The input DataFrame with missing values.\n"," processor (object): An object with a `transform` method that processes the DataFrame.\n"," all_features (list): List of all features supported by SimpleImputer in the DataFrame.\n"," datetime_features (list): List of datetime features in the DataFrame.\n","\n"," Returns:\n"," pd.DataFrame: A DataFrame with missing values filled.\n"," \"\"\"\n"," filled_array = processor.transform(df)\n"," filled_df = pd.DataFrame(filled_array, columns=all_features)\n"," if datetime_features:\n"," datetime_data = df[datetime_features]\n"," datetime_data.ffill()\n"," filled_df = pd.concat([datetime_data, filled_df], axis=1)\n"," for col in df.columns:\n"," filled_df[col].fillna(filled_df[col].mode()[0], inplace=True)\n","\n"," return filled_df\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"e37b9c43-7220-4b0c-9fa3-6ad9226dc85e"},{"cell_type":"code","source":["from sklearn.pipeline import Pipeline\n","from sklearn.impute import SimpleImputer\n","from sklearn.compose import ColumnTransformer\n","\n","\n","# convert object type to nearest dtype\n","X = X.convert_dtypes()\n","X = X.dropna(axis=1, how='all')\n","\n","# select columns for model training\n","X = X.select_dtypes(include=['number', 'datetime', 'category'])\n","\n","from sklearn.model_selection import train_test_split\n","\n","# You may need to update the test_size based on your scenario\n","X_train, X_test = train_test_split(X, test_size=0.2, random_state=41)\n","\n","mean_features, median_features, mode_features = [], [], []\n"," \n","preprocessor, all_features, datetime_features = create_fillna_processor(X_train, mean_features, median_features, mode_features)\n","X_train = fillna(X_train, preprocessor, all_features, datetime_features)\n","X_test = fillna(X_test, preprocessor, all_features, datetime_features)\n"," \n","y_train = X_train.pop(target_col)\n","y_test = X_test.pop(target_col)\n","\n","display(X_train[:10])\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"c9c6728f-7385-4c76-8284-6708d67bc5c7"},{"cell_type":"markdown","source":["## Step 3: Use AutoML to find your best model\n","\n","We will now use FLAML's AutoML to automatically find the best machine learning model for our data. AutoML (Automated Machine Learning) simplifies the model selection process by automatically testing and tuning various algorithms and configurations, helping us quickly identify the most effective model with minimal manual effort."],"metadata":{},"id":"3b4c43b4-9416-43d9-9ed8-a8d32858250d"},{"cell_type":"markdown","source":["### Tracking results with experiments in Fabric\n","\n","Experiments in Fabric let you track the results of your AutoML process, providing a comprehensive view of all the metrics and parameters from your trials."],"metadata":{},"id":"f287fb60-1e24-45f9-9493-1c563c797702"},{"cell_type":"code","source":["# MLFlow Logging Related\n","\n","import mlflow\n","\n","mlflow.autolog(exclusive=False)\n","mlflow.set_experiment(\"exp-test\")\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"ff2e3568-ce88-4a63-8bf8-c768a6cfdc3c"},{"cell_type":"markdown","source":["#### Configure the AutoML trial and settings\n","\n","These configurations are driven by the AutoML mode and task selected in the wizard. For example, if you select \"quick prototype\", you'll see a setting for time budget."],"metadata":{},"id":"4f02f65d-bc49-4090-b00b-2bb28d59e754"},{"cell_type":"code","source":["# Import the AutoML class from the FLAML package\n","import flaml\n","from flaml import AutoML\n","\n","# Define AutoML settings\n","settings = {\n"," \"time_budget\": 120, # Total running time in seconds\n"," \"task\": \"binary\", \n"," \"log_file_name\": \"flaml_experiment.log\", # FLAML log file\n"," \"eval_method\": \"cv\",\n"," \"n_splits\": 3,\n"," \"max_iter\": 10, \n"," \"force_cancel\": True, \n"," \"seed\": 41 , # Random seed \n"," \"mlflow_exp_name\": \"exp-test\", # MLflow experiment name\n"," \"use_spark\": True, # whether to use Spark for distributed training\n"," \"n_concurrent_trials\": 3, # the maximum number of concurrent trials \n"," \"verbose\": 1, \n"," \"featurization\": \"auto\", \n","}\n","\n","if flaml.__version__ > \"2.3.3\":\n"," settings[\"entrypoint\"] = \"low-code\"\n","\n","# Create an AutoML instance\n","automl = AutoML(**settings)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"d05dcde3-bf5f-43c5-a6fa-01e0a07affab"},{"cell_type":"markdown","source":["#### Run the AutoML trial\n","\n","Run the AutoML trial, with all trials being tracked as experiment runs. The trial is performed on the processed dataset, using the `Exited` variable as the target, and applying the defined configurations for optimal model selection."],"metadata":{},"id":"fc13e255-3bfb-4b54-9337-7f0fd070dbbc"},{"cell_type":"code","source":["with mlflow.start_run(nested=True, run_name=\"exp-test-AutoMLModel\"):\n"," automl.fit(\n"," X_train=X_train, \n"," y_train=y_train, # target column of the training data \n"," )"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"6c995371-878a-40be-a6ca-106181976ace"},{"cell_type":"markdown","source":["## Step 4: Save the final machine learning model\n","\n","Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric."],"metadata":{},"id":"0d052eef-0756-411e-8ab2-7fabd7a6076a"},{"cell_type":"code","source":["model_path = f\"runs:/{automl.best_run_id}/model\"\n","\n","# Register the model to the MLflow registry\n","registered_model = mlflow.register_model(model_uri=model_path, name=\"exp-test-AutoMLModel\")\n","\n","# Print the registered model's name and version\n","print(f\"Model '{registered_model.name}' version {registered_model.version} registered successfully.\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"2ce45e61-6094-4faa-9c9a-e6350bc4de6b"},{"cell_type":"markdown","source":["## Step 5: Generate predictions"],"metadata":{},"id":"b628aab7-22c6-47e6-8b79-a7767b519830"},{"cell_type":"markdown","source":["Microsoft Fabric lets you operationalize machine learning models with a scalable function called `PREDICT`, which supports batch scoring (or batch inferencing) in any compute engine. You can generate batch predictions directly from the Microsoft Fabric notebook or from a given ML model's item page. For more information on how to use `PREDICT`, see [Model scoring with PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict)."],"metadata":{},"id":"993e8880-f55e-438c-8d2d-fb7215e63c63"},{"cell_type":"markdown","source":["1. Generate predictions."],"metadata":{},"id":"aa12ec97-d582-4a43-88c3-ddde42b7b44b"},{"cell_type":"code","source":["model_name = \"exp-test-AutoMLModel\"\n","from synapse.ml.predict import MLFlowTransformer\n","\n","feature_cols = X_train.columns.to_list()\n","model = MLFlowTransformer(\n"," inputCols=feature_cols,\n"," outputCol=target_col,\n"," modelName=model_name,\n"," modelVersion=registered_model.version,\n",")\n","\n","df_test = spark.createDataFrame(X_test)\n","batch_predictions = model.transform(df_test)\n"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"3c6f2b3a-ad30-4cf3-9740-9da5b90a859e"},{"cell_type":"code","source":["display(batch_predictions)"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"1af8b16c-cdb4-4add-8df5-5c179fffdb95"},{"cell_type":"markdown","source":["2. Save the predictions to a table."],"metadata":{},"id":"2642ffad-253b-4ea9-ac34-9ad0c3690f34"},{"cell_type":"code","source":["saved_name = \"2020orders_predictions\".replace(\".\", \"_\")\n","batch_predictions.write.mode(\"overwrite\").format(\"delta\").option(\"overwriteSchema\", \"true\").save(f\"Tables/{saved_name}\")"],"outputs":[],"execution_count":null,"metadata":{"microsoft":{"language":"python","language_group":"synapse_pyspark"}},"id":"fb16d367-0570-427c-a04a-2980b6e5d014"}],"metadata":{"kernel_info":{"name":"synapse_pyspark"},"kernelspec":{"display_name":"Synapse PySpark","language":"Python","name":"synapse_pyspark"},"language_info":{"name":"python"},"automl_config":{"lakehouseInfo":{"lakehouseName":"lake_samples","lakehouseId":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","workspaceId":"98ea70b8-712f-49ac-9250-d737780bb594","state":"ready","errMsg":""},"tableInfo":{"type":"table","tableInfo":{"name":"2020orders","fullAbfsPath":"abfss://98ea70b8-712f-49ac-9250-d737780bb594@onelake.dfs.fabric.microsoft.com/3b406a22-8d06-40ef-9f97-8c2ab976f7a4/Tables/2020orders","type":"MANAGED","format":"","isDeltaTable":true,"relativePath":"Tables/2020orders"},"columns":[{"name":"ID","type":"string","nullable":true},{"name":"Count","type":"integer","nullable":true},{"name":"Date","type":"string","nullable":true},{"name":"Name","type":"string","nullable":true},{"name":"Style","type":"string","nullable":true},{"name":"price","type":"double","nullable":true},{"name":"tax","type":"double","nullable":true}]},"trainData":{"predictColumn":"price","enableFeaturization":true,"mappingColumns":[{"name":"ID","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Count","type":"integer","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Date","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Name","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"Style","type":"string","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"price","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"},{"name":"tax","type":"double","nullable":true,"valueType":"Auto","imputationMethod":"Auto"}]},"mlModel":{"task":"Binary Classification","mode":"QuickProto","duration":"-1","metric":"","endEarly":false},"finalDetails":{"parallelizationMethod":"trainMultiple","notebookName":"AutoML Sample Test - Demo ","experimentName":"exp-test","modelName":"exp-test-AutoMLModel","model":{"modelSelection":"","modelInput":"exp-test-AutoMLModel","modelType":"CreateNew"}},"step":5},"microsoft":{"language":"python","language_group":"synapse_pyspark","ms_spell_check":{"ms_spell_check_language":"en"}},"nteract":{"version":"nteract-front-end@1.0.0"},"spark_compute":{"compute_id":"/trident/default","session_options":{"conf":{"spark.synapse.nbs.session.timeout":"1200000"}}},"dependencies":{"lakehouse":{"default_lakehouse":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4","default_lakehouse_name":"lake_samples","known_lakehouses":[{"id":"3b406a22-8d06-40ef-9f97-8c2ab976f7a4"}],"default_lakehouse_workspace_id":"98ea70b8-712f-49ac-9250-d737780bb594"}}},"nbformat":4,"nbformat_minor":5}
\ No newline at end of file
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "d8d36bfe-0884-4c73-a24f-175233d98bdf",
+ "metadata": {},
+ "source": [
+ "# Demonstration: Train a ML model with AutoML\n",
+ "\n",
+ "## Introduction\n",
+ "\n",
+ "This notebook is automatically generated by the Fabric low-code AutoML wizard based on your selections. Whether you're building a regression model, a classifier, or another machine-learning solution, this tool simplifies the process by transforming your goals into executable code. You can easily modify any settings or code snippets to better align with your requirements.\n",
+ "\n",
+ "### What is FLAML?\n",
+ "\n",
+ "[FLAML (Fast and Lightweight Automated Machine Learning)](https://aka.ms/fabric-automl) is an open-source AutoML library designed to quickly and efficiently find the best machine learning models and hyperparameters. FLAML optimizes for speed, accuracy, and cost, making it an excellent choice for a wide range of machine-learning tasks.\n",
+ "\n",
+ "### Steps in this notebook\n",
+ "\n",
+ "1. **Load the data**: Import your dataset.\n",
+ "2. **Generate features**: Automatically transform and preprocess your data to improve model performance.\n",
+ "3. **Use AutoML to find your best model**: Use FLAML to automatically select the most suitable model and optimize its parameters.\n",
+ "4. **Save the final machine learning model**: Store the trained model for future use.\n",
+ "5. **Generate predictions**: Use the saved model to predict outcomes on new data.\n",
+ "\n",
+ "> [!IMPORTANT]\n",
+ "> **Automated ML is currently supported on Fabric Runtimes 1.2+ or any Fabric environment with Spark 3.4+.**\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "592531fe-7a06-4837-a5eb-2650113cbf13",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%pip install scikit-learn==1.5.1\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "14223c8d-f82a-44ef-a466-e03ebcc6b430",
+ "metadata": {},
+ "source": [
+ "### Default notebook optimization\n",
+ "\n",
+ "This cell configures the logging and warning settings to reduce unnecessary output and focus on critical information. It suppresses specific warnings and logs from the underlying libraries, ensuring a cleaner and more readable notebook experience."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9878de39-d1c1-485b-9058-e429715b5cd8",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import logging\n",
+ "import warnings\n",
+ " \n",
+ "logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)\n",
+ "logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)\n",
+ "warnings.simplefilter('ignore', category=FutureWarning)\n",
+ "warnings.simplefilter('ignore', category=UserWarning)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "67153540-7117-4adb-9766-b701ff7fc616",
+ "metadata": {},
+ "source": [
+ "## Step 1: Load the Data\n",
+ "\n",
+ "This cell is responsible for importing the raw data from the specified source into the notebook environment. The data could come from various sources, such as a file or table in your lakehouse.\n",
+ "\n",
+ "Once loaded, this data will serve as the input for subsequent steps, such as data transformation, model training, and evaluation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "63113dbc-16ab-4932-97c2-b0f54cfe9b3f",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import re\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "df = spark.read.format(\"delta\").load(\n",
+ " \"Tables/2020orders\"\n",
+ ").cache()\n",
+ "# Transform to pandas according to the selected models\n",
+ "X = df.limit(100000).toPandas() # Use df.toPandas() to use all the data\n",
+ "X = X.rename(columns = lambda c:re.sub('[^A-Za-z0-9_]+', '_', c)) # Replace not supported characters in column name with underscore to avoid invalid character for model training and saving\n",
+ "\n",
+ "target_col = re.sub('[^A-Za-z0-9_]+', '_', \"price\")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ae621756-f044-4553-8509-d64973d5d903",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "display(X)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "761a4b6e-6698-4bd3-948c-3e5274efbaad",
+ "metadata": {},
+ "source": [
+ "## Step 2: Generate features\n",
+ "\n",
+ "Featurization is the process of transforming raw data into a format optimized for training a machine learning model. It ensures the model can access the most relevant information, significantly impacting its accuracy and performance.\n",
+ "\n",
+ "This step applies various techniques to refine the data, enhance its quality, and make it compatible with the selected algorithms, helping the model learn patterns more effectively."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d7e7a55b-434d-42c3-b457-88b89dd57461",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# Handle class imbalance\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "\n",
+ "distribution = X[target_col].value_counts(normalize=True)\n",
+ "dominant_class_proportion = distribution.max()\n",
+ "\n",
+ "distribution.plot(kind='bar')\n",
+ "plt.title(\"Class Distribution\")\n",
+ "plt.xlabel(\"Class\")\n",
+ "plt.ylabel(\"Proportion\")\n",
+ "plt.show()\n",
+ "\n",
+ "if dominant_class_proportion > 0.8:\n",
+ " print(f\"The dataset is imbalanced. The dominant class has {dominant_class_proportion * 100:.2f}% of the samples.\")\n",
+ " print(\"You may need to handle class imbalance before training the model.\")\n",
+ " print(\"You can use techniques such as oversampling, undersampling, or using class weights to handle class imbalance.\")\n",
+ " print(\"For more information, see https://aka.ms/smote-example\")\n",
+ "else:\n",
+ " print(\"The dataset is balanced.\")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e37b9c43-7220-4b0c-9fa3-6ad9226dc85e",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# Set Functions if needed for Featurization\n",
+ "def create_fillna_processor(\n",
+ " df, mean_features=None, median_features=None, mode_features=None\n",
+ "):\n",
+ " \"\"\"\n",
+ " Create a ColumnTransformer that fills missing values in a DataFrame using different strategies\n",
+ " based on the skewness of the numerical features and the specified feature lists.\n",
+ "\n",
+ " Parameters:\n",
+ " df (pd.DataFrame): The input DataFrame.\n",
+ " mean_features (list, optional): List of features to impute using the mean strategy. Defaults to None.\n",
+ " median_features (list, optional): List of features to impute using the median strategy. Defaults to None.\n",
+ " mode_features (list, optional): List of features to impute using the mode strategy. Defaults to None.\n",
+ "\n",
+ " Returns:\n",
+ " ColumnTransformer: A fitted ColumnTransformer that can be used to transform the DataFrame.\n",
+ " list: List of all features supported by SimpleImputer in the DataFrame.\n",
+ " list: List of datetime features in the DataFrame.\n",
+ " \"\"\"\n",
+ " if mean_features is None:\n",
+ " mean_features = []\n",
+ " if median_features is None:\n",
+ " median_features = []\n",
+ " if mode_features is None:\n",
+ " mode_features = []\n",
+ " all_features = mean_features + median_features + mode_features\n",
+ " # Group features by their imputation needs\n",
+ " mean_features = [\n",
+ " col\n",
+ " for col in df.select_dtypes(include=[\"number\"]).columns\n",
+ " if df[col].skew(skipna=True) <= 1 and col not in all_features\n",
+ " ] + mean_features\n",
+ " median_features = [\n",
+ " col\n",
+ " for col in df.select_dtypes(include=[\"number\"]).columns\n",
+ " if df[col].skew(skipna=True) > 1 and col not in all_features\n",
+ " ] + median_features\n",
+ " all_features = mean_features + median_features\n",
+ " datetime_features = df.select_dtypes(include=[\"datetime\"]).columns.tolist()\n",
+ " mode_features = [col for col in df.columns.tolist() if col not in all_features + datetime_features]\n",
+ "\n",
+ " transformers = []\n",
+ "\n",
+ " if mean_features:\n",
+ " transformers.append(\n",
+ " (\"mean_imputer\", SimpleImputer(strategy=\"mean\"), mean_features)\n",
+ " )\n",
+ " if median_features:\n",
+ " transformers.append(\n",
+ " (\"median_imputer\", SimpleImputer(strategy=\"median\"), median_features)\n",
+ " )\n",
+ " if mode_features:\n",
+ " transformers.append(\n",
+ " (\"mode_imputer\", SimpleImputer(strategy=\"most_frequent\"), mode_features)\n",
+ " )\n",
+ "\n",
+ " column_transformer = ColumnTransformer(transformers=transformers)\n",
+ " all_features = mean_features + median_features + mode_features\n",
+ "\n",
+ " return column_transformer.fit(df), all_features, datetime_features\n",
+ "\n",
+ "\n",
+ "def fillna(df, processor, all_features, datetime_features):\n",
+ " \"\"\"\n",
+ " Fill missing values in a DataFrame using a specified processor and mode imputation.\n",
+ "\n",
+ " Parameters:\n",
+ " df (pd.DataFrame): The input DataFrame with missing values.\n",
+ " processor (object): An object with a `transform` method that processes the DataFrame.\n",
+ " all_features (list): List of all features supported by SimpleImputer in the DataFrame.\n",
+ " datetime_features (list): List of datetime features in the DataFrame.\n",
+ "\n",
+ " Returns:\n",
+ " pd.DataFrame: A DataFrame with missing values filled.\n",
+ " \"\"\"\n",
+ " filled_array = processor.transform(df)\n",
+ " filled_df = pd.DataFrame(filled_array, columns=all_features)\n",
+ " if datetime_features:\n",
+ " datetime_data = df[datetime_features]\n",
+ " datetime_data.ffill()\n",
+ " filled_df = pd.concat([datetime_data, filled_df], axis=1)\n",
+ " for col in df.columns:\n",
+ " filled_df[col].fillna(filled_df[col].mode()[0], inplace=True)\n",
+ "\n",
+ " return filled_df\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c9c6728f-7385-4c76-8284-6708d67bc5c7",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.pipeline import Pipeline\n",
+ "from sklearn.impute import SimpleImputer\n",
+ "from sklearn.compose import ColumnTransformer\n",
+ "\n",
+ "\n",
+ "# convert object type to nearest dtype\n",
+ "X = X.convert_dtypes()\n",
+ "X = X.dropna(axis=1, how='all')\n",
+ "\n",
+ "# select columns for model training\n",
+ "X = X.select_dtypes(include=['number', 'datetime', 'category'])\n",
+ "\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "# You may need to update the test_size based on your scenario\n",
+ "X_train, X_test = train_test_split(X, test_size=0.2, random_state=41)\n",
+ "\n",
+ "mean_features, median_features, mode_features = [], [], []\n",
+ " \n",
+ "preprocessor, all_features, datetime_features = create_fillna_processor(X_train, mean_features, median_features, mode_features)\n",
+ "X_train = fillna(X_train, preprocessor, all_features, datetime_features)\n",
+ "X_test = fillna(X_test, preprocessor, all_features, datetime_features)\n",
+ " \n",
+ "y_train = X_train.pop(target_col)\n",
+ "y_test = X_test.pop(target_col)\n",
+ "\n",
+ "display(X_train[:10])\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b4c43b4-9416-43d9-9ed8-a8d32858250d",
+ "metadata": {},
+ "source": [
+ "## Step 3: Use AutoML to find your best model\n",
+ "\n",
+ "We will now use FLAML's AutoML to automatically find the best machine learning model for our data. AutoML (Automated Machine Learning) simplifies the model selection process by automatically testing and tuning various algorithms and configurations, helping us quickly identify the most effective model with minimal manual effort."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f287fb60-1e24-45f9-9493-1c563c797702",
+ "metadata": {},
+ "source": [
+ "### Tracking results with experiments in Fabric\n",
+ "\n",
+ "Experiments in Fabric let you track the results of your AutoML process, providing a comprehensive view of all the metrics and parameters from your trials."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ff2e3568-ce88-4a63-8bf8-c768a6cfdc3c",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# MLFlow Logging Related\n",
+ "\n",
+ "import mlflow\n",
+ "\n",
+ "mlflow.autolog(exclusive=False)\n",
+ "mlflow.set_experiment(\"exp-test\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4f02f65d-bc49-4090-b00b-2bb28d59e754",
+ "metadata": {},
+ "source": [
+ "#### Configure the AutoML trial and settings\n",
+ "\n",
+ "These configurations are driven by the AutoML mode and task selected in the wizard. For example, if you select \"quick prototype\", you'll see a setting for time budget."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d05dcde3-bf5f-43c5-a6fa-01e0a07affab",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# Import the AutoML class from the FLAML package\n",
+ "import flaml\n",
+ "from flaml import AutoML\n",
+ "\n",
+ "# Define AutoML settings\n",
+ "settings = {\n",
+ " \"time_budget\": 120, # Total running time in seconds\n",
+ " \"task\": \"binary\", \n",
+ " \"log_file_name\": \"flaml_experiment.log\", # FLAML log file\n",
+ " \"eval_method\": \"cv\",\n",
+ " \"n_splits\": 3,\n",
+ " \"max_iter\": 10, \n",
+ " \"force_cancel\": True, \n",
+ " \"seed\": 41 , # Random seed \n",
+ " \"mlflow_exp_name\": \"exp-test\", # MLflow experiment name\n",
+ " \"use_spark\": True, # whether to use Spark for distributed training\n",
+ " \"n_concurrent_trials\": 3, # the maximum number of concurrent trials \n",
+ " \"verbose\": 1, \n",
+ " \"featurization\": \"auto\", \n",
+ "}\n",
+ "\n",
+ "if flaml.__version__ > \"2.3.3\":\n",
+ " settings[\"entrypoint\"] = \"low-code\"\n",
+ "\n",
+ "# Create an AutoML instance\n",
+ "automl = AutoML(**settings)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc13e255-3bfb-4b54-9337-7f0fd070dbbc",
+ "metadata": {},
+ "source": [
+ "#### Run the AutoML trial\n",
+ "\n",
+ "Run the AutoML trial, with all trials being tracked as experiment runs. The trial is performed on the processed dataset, using the `Exited` variable as the target, and applying the defined configurations for optimal model selection."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6c995371-878a-40be-a6ca-106181976ace",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "with mlflow.start_run(nested=True, run_name=\"exp-test-AutoMLModel\"):\n",
+ " automl.fit(\n",
+ " X_train=X_train, \n",
+ " y_train=y_train, # target column of the training data \n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0d052eef-0756-411e-8ab2-7fabd7a6076a",
+ "metadata": {},
+ "source": [
+ "## Step 4: Save the final machine learning model\n",
+ "\n",
+ "Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2ce45e61-6094-4faa-9c9a-e6350bc4de6b",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "model_path = f\"runs:/{automl.best_run_id}/model\"\n",
+ "\n",
+ "# Register the model to the MLflow registry\n",
+ "registered_model = mlflow.register_model(model_uri=model_path, name=\"exp-test-AutoMLModel\")\n",
+ "\n",
+ "# Print the registered model's name and version\n",
+ "print(f\"Model '{registered_model.name}' version {registered_model.version} registered successfully.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b628aab7-22c6-47e6-8b79-a7767b519830",
+ "metadata": {},
+ "source": [
+ "## Step 5: Generate predictions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "993e8880-f55e-438c-8d2d-fb7215e63c63",
+ "metadata": {},
+ "source": [
+ "Microsoft Fabric lets you operationalize machine learning models with a scalable function called `PREDICT`, which supports batch scoring (or batch inferencing) in any compute engine. You can generate batch predictions directly from the Microsoft Fabric notebook or from a given ML model's item page. For more information on how to use `PREDICT`, see [Model scoring with PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "aa12ec97-d582-4a43-88c3-ddde42b7b44b",
+ "metadata": {},
+ "source": [
+ "1. Generate predictions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3c6f2b3a-ad30-4cf3-9740-9da5b90a859e",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "model_name = \"exp-test-AutoMLModel\"\n",
+ "from synapse.ml.predict import MLFlowTransformer\n",
+ "\n",
+ "feature_cols = X_train.columns.to_list()\n",
+ "model = MLFlowTransformer(\n",
+ " inputCols=feature_cols,\n",
+ " outputCol=target_col,\n",
+ " modelName=model_name,\n",
+ " modelVersion=registered_model.version,\n",
+ ")\n",
+ "\n",
+ "df_test = spark.createDataFrame(X_test)\n",
+ "batch_predictions = model.transform(df_test)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1af8b16c-cdb4-4add-8df5-5c179fffdb95",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "display(batch_predictions)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2642ffad-253b-4ea9-ac34-9ad0c3690f34",
+ "metadata": {},
+ "source": [
+ "2. Save the predictions to a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fb16d367-0570-427c-a04a-2980b6e5d014",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "saved_name = \"2020orders_predictions\".replace(\".\", \"_\")\n",
+ "batch_predictions.write.mode(\"overwrite\").format(\"delta\").option(\"overwriteSchema\", \"true\").save(f\"Tables/{saved_name}\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "automl_config": {
+ "finalDetails": {
+ "experimentName": "exp-test",
+ "model": {
+ "modelInput": "exp-test-AutoMLModel",
+ "modelSelection": "",
+ "modelType": "CreateNew"
+ },
+ "modelName": "exp-test-AutoMLModel",
+ "notebookName": "AutoML Sample Test - Demo ",
+ "parallelizationMethod": "trainMultiple"
+ },
+ "lakehouseInfo": {
+ "errMsg": "",
+ "lakehouseId": "3b406a22-8d06-40ef-9f97-8c2ab976f7a4",
+ "lakehouseName": "lake_samples",
+ "state": "ready",
+ "workspaceId": "98ea70b8-712f-49ac-9250-d737780bb594"
+ },
+ "mlModel": {
+ "duration": "-1",
+ "endEarly": false,
+ "metric": "",
+ "mode": "QuickProto",
+ "task": "Binary Classification"
+ },
+ "step": 5,
+ "tableInfo": {
+ "columns": [
+ {
+ "name": "ID",
+ "nullable": true,
+ "type": "string"
+ },
+ {
+ "name": "Count",
+ "nullable": true,
+ "type": "integer"
+ },
+ {
+ "name": "Date",
+ "nullable": true,
+ "type": "string"
+ },
+ {
+ "name": "Name",
+ "nullable": true,
+ "type": "string"
+ },
+ {
+ "name": "Style",
+ "nullable": true,
+ "type": "string"
+ },
+ {
+ "name": "price",
+ "nullable": true,
+ "type": "double"
+ },
+ {
+ "name": "tax",
+ "nullable": true,
+ "type": "double"
+ }
+ ],
+ "tableInfo": {
+ "format": "",
+ "fullAbfsPath": "abfss://98ea70b8-712f-49ac-9250-d737780bb594@onelake.dfs.fabric.microsoft.com/3b406a22-8d06-40ef-9f97-8c2ab976f7a4/Tables/2020orders",
+ "isDeltaTable": true,
+ "name": "2020orders",
+ "relativePath": "Tables/2020orders",
+ "type": "MANAGED"
+ },
+ "type": "table"
+ },
+ "trainData": {
+ "enableFeaturization": true,
+ "mappingColumns": [
+ {
+ "imputationMethod": "Auto",
+ "name": "ID",
+ "nullable": true,
+ "type": "string",
+ "valueType": "Auto"
+ },
+ {
+ "imputationMethod": "Auto",
+ "name": "Count",
+ "nullable": true,
+ "type": "integer",
+ "valueType": "Auto"
+ },
+ {
+ "imputationMethod": "Auto",
+ "name": "Date",
+ "nullable": true,
+ "type": "string",
+ "valueType": "Auto"
+ },
+ {
+ "imputationMethod": "Auto",
+ "name": "Name",
+ "nullable": true,
+ "type": "string",
+ "valueType": "Auto"
+ },
+ {
+ "imputationMethod": "Auto",
+ "name": "Style",
+ "nullable": true,
+ "type": "string",
+ "valueType": "Auto"
+ },
+ {
+ "imputationMethod": "Auto",
+ "name": "price",
+ "nullable": true,
+ "type": "double",
+ "valueType": "Auto"
+ },
+ {
+ "imputationMethod": "Auto",
+ "name": "tax",
+ "nullable": true,
+ "type": "double",
+ "valueType": "Auto"
+ }
+ ],
+ "predictColumn": "price"
+ }
+ },
+ "dependencies": {
+ "lakehouse": {
+ "default_lakehouse": "3b406a22-8d06-40ef-9f97-8c2ab976f7a4",
+ "default_lakehouse_name": "lake_samples",
+ "default_lakehouse_workspace_id": "98ea70b8-712f-49ac-9250-d737780bb594",
+ "known_lakehouses": [
+ {
+ "id": "3b406a22-8d06-40ef-9f97-8c2ab976f7a4"
+ }
+ ]
+ }
+ },
+ "kernel_info": {
+ "name": "synapse_pyspark"
+ },
+ "kernelspec": {
+ "display_name": "Synapse PySpark",
+ "language": "Python",
+ "name": "synapse_pyspark"
+ },
+ "language_info": {
+ "name": "python"
+ },
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark",
+ "ms_spell_check": {
+ "ms_spell_check_language": "en"
+ }
+ },
+ "nteract": {
+ "version": "nteract-front-end@1.0.0"
+ },
+ "spark_compute": {
+ "compute_id": "/trident/default",
+ "session_options": {
+ "conf": {
+ "spark.synapse.nbs.session.timeout": "1200000"
+ }
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
From 6263f41f37e832e1bcd9d325777ad19775863206 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Fri, 2 May 2025 23:35:20 -0600
Subject: [PATCH 15/31] quick demo
---
Workloads-Specific/DataScience/How_AutoML/README.md | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/Workloads-Specific/DataScience/How_AutoML/README.md b/Workloads-Specific/DataScience/How_AutoML/README.md
index 1db1a5c..8df0d3c 100644
--- a/Workloads-Specific/DataScience/How_AutoML/README.md
+++ b/Workloads-Specific/DataScience/How_AutoML/README.md
@@ -14,7 +14,11 @@ Last updated: 2025-05-03
-> Click to see notebook generated [Train a ML model with AutoML](./Train_MLmodel_AutoML.ipynb)
+Click to see notebook generated [Train a ML model with AutoML](./Train_MLmodel_AutoML.ipynb)
+
+> Run the notebook the generated:
+
+https://github.com/user-attachments/assets/6dfedbac-beb7-4025-9a42-f98dade7f431
Total Visitors
From 1dd2fee3d6f49afe84b65929950a173fe70edefe Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
Date: Sat, 3 May 2025 05:35:37 +0000
Subject: [PATCH 16/31] Fix Markdown syntax issues
---
Workloads-Specific/DataScience/How_AutoML/README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Workloads-Specific/DataScience/How_AutoML/README.md b/Workloads-Specific/DataScience/How_AutoML/README.md
index 8df0d3c..4eb0160 100644
--- a/Workloads-Specific/DataScience/How_AutoML/README.md
+++ b/Workloads-Specific/DataScience/How_AutoML/README.md
@@ -18,7 +18,7 @@ Click to see notebook generated [Train a ML model with AutoML](./Train_MLmodel_A
> Run the notebook the generated:
-https://github.com/user-attachments/assets/6dfedbac-beb7-4025-9a42-f98dade7f431
+
Total Visitors
From 2cd208677c8f26bd979dc3271d0d8865ccf876b1 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Fri, 2 May 2025 23:40:06 -0600
Subject: [PATCH 17/31] ds workload
---
Workloads-Specific/DataScience/BestPractices.md | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/Workloads-Specific/DataScience/BestPractices.md b/Workloads-Specific/DataScience/BestPractices.md
index b11bb47..9e2ba79 100644
--- a/Workloads-Specific/DataScience/BestPractices.md
+++ b/Workloads-Specific/DataScience/BestPractices.md
@@ -30,6 +30,10 @@ Last updated: 2025-05-03
> Ensure that your data science workflows in Microsoft Fabric are built for rapid experimentation, efficient model management, and seamless deployment. Each element should be managed with clear versioning, detailed documentation, and reproducible environments, enabling a smooth transition from experimentation to production.
+
+

+
+
## ML Model Management
> Use model registries integrated within Fabric to store and version your models. Include a descriptive README, link relevant experiment IDs, and attach performance metrics such as accuracy, AUC, and confusion matrices. For example, link your production-ready model (v#.#) from a registered repository along with its associated validation metrics and deployment instructions.
From a95cc80efa4b5201b8d365eeaa0ee6c929c36142 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Fri, 2 May 2025 23:47:41 -0600
Subject: [PATCH 18/31] in progress
---
.../RealTimeIntelligence/BestPractices.md | 32 +++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/Workloads-Specific/RealTimeIntelligence/BestPractices.md b/Workloads-Specific/RealTimeIntelligence/BestPractices.md
index 6369ab1..35ef5f6 100644
--- a/Workloads-Specific/RealTimeIntelligence/BestPractices.md
+++ b/Workloads-Specific/RealTimeIntelligence/BestPractices.md
@@ -15,6 +15,38 @@ Last updated: 2025-05-03
+
+Table of Content (Click to expand)
+
+- [Structured Eventhouse Implementation](#structured-eventhouse-implementation)
+- [Interactive Real-Time Dashboard Creation](#interactive-real-time-dashboard-creation)
+- [Efficient Eventstream Management](#efficient-eventstream-management)
+- [Dynamic Activator Configuration](#dynamic-activator-configuration)
+
+
+
+> Ensure that your real time intelligence system in Microsoft Fabric is designed for both rapid ingestion and instantaneous analysis. By structuring your Eventhouse, leveraging powerful KQL query sets, building dynamic dashboards, managing high-throughput event streams, and configuring rule-based Activator triggers, you can achieve actionable insights and automated responses as events occur.
+
+
+

+
+
+## Structured Eventhouse Implementation
+
+> Design your Eventhouse to serve as the backbone of your real-time data ingestion. Organize event data using defined schemas, partitioning strategies, and indexing to optimize for both immediate query performance and historical analysis. This approach enhances data governance and ensures that critical event details are captured for quick retrieval and auditing. E.g `Create dedicated partitions in Eventhouse based on time windows or event type. For instance, set up policies to automatically archive older events while retaining a hot partition for current data. This enables rapid detection of anomalies and supports retrospective analysis when patterns or trends need to be reviewed.`
+
+## Interactive Real-Time Dashboard Creation
+
+> Build dashboards that dynamically update as new data flows in. Utilize real-time visualizations, clear metric hierarchies, and fast refresh cycles to ensure stakeholders receive immediate feedback on key performance indicators (KPIs) and system health. This empowers decision-makers to respond quickly to emerging issues. For example, implement drill-down capabilities so that clicking on an alert leads to detailed logs derived from the Eventhouse via KQL queries.
+
+## Efficient Eventstream Management
+
+> Configure Eventstream with dynamic scaling and load balancing. For example, integrate pre-processing steps that filter out noise and enrich events before they enter the Eventhouse, and monitor key metrics (such as latency and event volume) to automatically adjust resource allocation based on current demand.
+
+## Dynamic Activator Configuration
+
+> Implement Activator to respond to events with rule-based triggers that can automatically initiate workflows, send notifications, or activate remediation processes. Ensure that your activation rules are flexible and customizable so that actions can be fine-tuned to the specific nuances of your environment. For example: Set up Activator rules that trigger alerts or automated remedial actions when certain thresholds are reached—such as a sudden spike in error events or a dip in transaction volumes. For example, configure the system to send an SMS or email alert when abnormal patterns are detected, and automatically adjust system parameters via an integrated ITSM tool.
+
Total Visitors

From cf17ab6c1fadc27f887276962f4bbc290a097e40 Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
Date: Sat, 3 May 2025 05:48:01 +0000
Subject: [PATCH 19/31] Fix Markdown syntax issues
---
Workloads-Specific/RealTimeIntelligence/BestPractices.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Workloads-Specific/RealTimeIntelligence/BestPractices.md b/Workloads-Specific/RealTimeIntelligence/BestPractices.md
index 35ef5f6..b576110 100644
--- a/Workloads-Specific/RealTimeIntelligence/BestPractices.md
+++ b/Workloads-Specific/RealTimeIntelligence/BestPractices.md
@@ -31,7 +31,7 @@ Last updated: 2025-05-03
-## Structured Eventhouse Implementation
+## Structured Eventhouse Implementation
> Design your Eventhouse to serve as the backbone of your real-time data ingestion. Organize event data using defined schemas, partitioning strategies, and indexing to optimize for both immediate query performance and historical analysis. This approach enhances data governance and ensures that critical event details are captured for quick retrieval and auditing. E.g `Create dedicated partitions in Eventhouse based on time windows or event type. For instance, set up policies to automatically archive older events while retaining a hot partition for current data. This enables rapid detection of anomalies and supports retrospective analysis when patterns or trends need to be reviewed.`
From a3be59fdf3220f7ce668527d2e3a0e64ea9d6729 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Fri, 2 May 2025 23:50:49 -0600
Subject: [PATCH 20/31] no needed those 2
---
README.md | 2 --
1 file changed, 2 deletions(-)
diff --git a/README.md b/README.md
index 4d17b77..ee7b623 100644
--- a/README.md
+++ b/README.md
@@ -205,9 +205,7 @@ Click to read more about [Microsoft Purview for Fabric - Overview](./Workloads-S
- [Data Science - Best Practices Overview](./Workloads-Specific/DataScience/BestPractices.md)
- [Real-Time Intelligence - Best Practices Overview](./Workloads-Specific/RealTimeIntelligence/BestPractices.md) - in progress
- [Power Bi - Best Practices Overview](./Workloads-Specific/PowerBi/BestPractices.md)
-- [Copilot - Best Practices Overview](./Workloads-Specific/Copilot/BestPractices.md) - in progress
- [Purview - Best Practices Overview](./Workloads-Specific/Purview/BestPractices.md) - in progress
-- [OneLake - Best Practices Overview](./Workloads-Specific/OneLake/BestPractices.md) - in progress
Total Visitors
From fb07f4d4408a7795c663d4c9bb5766a972c805e6 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Fri, 2 May 2025 23:51:07 -0600
Subject: [PATCH 21/31] no need
---
Workloads-Specific/Copilot/BestPractices.md | 21 ---------------------
1 file changed, 21 deletions(-)
delete mode 100644 Workloads-Specific/Copilot/BestPractices.md
diff --git a/Workloads-Specific/Copilot/BestPractices.md b/Workloads-Specific/Copilot/BestPractices.md
deleted file mode 100644
index be4da10..0000000
--- a/Workloads-Specific/Copilot/BestPractices.md
+++ /dev/null
@@ -1,21 +0,0 @@
-# Copilot - Best Practices Overview
-
-Costa Rica
-
-[](https://github.com)
-[](https://github.com/)
-[brown9804](https://github.com/brown9804)
-
-Last updated: 2025-05-03
-
-----------
-
-
-List of References (Click to expand)
-
-
-
-
-
Total Visitors
-

-
From daf0dd185e29d49bd1d192cb3c22f182ee525916 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Fri, 2 May 2025 23:51:20 -0600
Subject: [PATCH 22/31] no need
---
Workloads-Specific/OneLake/BestPractices.md | 21 ---------------------
1 file changed, 21 deletions(-)
delete mode 100644 Workloads-Specific/OneLake/BestPractices.md
diff --git a/Workloads-Specific/OneLake/BestPractices.md b/Workloads-Specific/OneLake/BestPractices.md
deleted file mode 100644
index 7ccee2f..0000000
--- a/Workloads-Specific/OneLake/BestPractices.md
+++ /dev/null
@@ -1,21 +0,0 @@
-# OneLake - Best Practices Overview
-
-Costa Rica
-
-[](https://github.com)
-[](https://github.com/)
-[brown9804](https://github.com/brown9804)
-
-Last updated: 2025-05-03
-
-----------
-
-
-List of References (Click to expand)
-
-
-
-
-
Total Visitors
-

-
From 65701acd0fc885d4319851f4e30b00ca32a18ee5 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 08:58:52 -0600
Subject: [PATCH 23/31] moved
---
.../FabricActivatorRulePipeline/README.md | 132 ++++++++++++++++++
1 file changed, 132 insertions(+)
create mode 100644 Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
diff --git a/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
new file mode 100644
index 0000000..99d14b7
--- /dev/null
+++ b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
@@ -0,0 +1,132 @@
+# Microsoft Fabric: Automating Pipeline Execution with Activator
+
+Costa Rica
+
+[](https://github.com/)
+[brown9804](https://github.com/brown9804)
+
+Last updated: 2025-04-21
+
+----------
+
+> This process shows how to set up Microsoft Fabric Activator to automate workflows by detecting file creation events in a storage system and triggering another pipeline to run.
+>
+> 1. **First Pipeline**: The process starts with a pipeline that ends with a `Copy Data` activity. This activity uploads data into the `Lakehouse`.
+> 2. **Event Stream Setup**: An `Event Stream` is configured in Activator to monitor the Lakehouse for file creation or data upload events.
+> 3. **Triggering the Second Pipeline**: Once the event is detected (e.g., a file is uploaded), the Event Stream triggers the second pipeline to continue the workflow.
+
+
+List of References (Click to expand)
+
+- [Activate Fabric items](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/data-activator/activator-trigger-fabric-items)
+- [Create a rule in Fabric Activator](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/data-activator/activator-create-activators)
+
+
+
+
+List of Content (Click to expand)
+
+- [Set Up the First Pipeline](#set-up-the-first-pipeline)
+- [Configure Activator to Detect the Event](#configure-activator-to-detect-the-event)
+- [Set Up the Second Pipeline](#set-up-the-second-pipeline)
+- [Define the Rule in Activator](#define-the-rule-in-activator)
+- [Test the Entire Workflow](#test-the-entire-workflow)
+- [Troubleshooting If Needed](#troubleshooting-if-needed)
+
+
+
+> [!NOTE]
+> This code generates random data with fields such as id, name, age, email, and created_at, organizes it into a PySpark DataFrame, and saves it to a specified Lakehouse path using the Delta format. Click here to see the [example script](./GeneratesRandomData.ipynb)
+
+
+
+## Set Up the First Pipeline
+
+1. **Create the Pipeline**:
+ - In [Microsoft Fabric](https://app.fabric.microsoft.com/), create the first pipeline that performs the required tasks.
+ - Add a `Copy Data` activity as the final step in the pipeline.
+
+2. **Generate the Trigger File**:
+ - Configure the `Copy Data` activity to create a trigger file in a specific location, such as `Azure Data Lake Storage (ADLS)` or `OneLake`.
+ - Ensure the file name and path are consistent and predictable (e.g., `trigger_file.json` in a specific folder).
+3. **Publish and Test**: Publish the pipeline and test it to ensure the trigger file is created successfully.
+
+
+
+## Configure Activator to Detect the Event
+
+> [!TIP]
+> Event options:
+
+
+
+1. **Set Up an Event**:
+ - Create a new event to monitor the location where the trigger file is created (e.g., ADLS or OneLake). Click on `Real-Time`:
+
+
+
+ - Choose the appropriate event type, such as `File Created`.
+
+
+
+
+
+ - Add a source:
+
+
+
+
+
+
+
+2. **Test Event Detection**:
+ - Save the event and test it by manually running the first pipeline to ensure Activator detects the file creation.
+ - Check the **Event Details** screen in Activator to confirm the event is logged.
+
+
+
+## Set Up the Second Pipeline
+
+1. **Create the Pipeline**:
+ - In Microsoft Fabric, create the second pipeline that performs the next set of tasks.
+ - Ensure it is configured to accept external triggers.
+2. **Publish the Pipeline**: Publish the second pipeline and ensure it is ready to be triggered.
+
+
+
+## Define the Rule in Activator
+
+1. **Setup the Activator**:
+
+
+
+2. **Create a New Rule**:
+ - In `Activator`, create a rule that responds to the event you just configured.
+ - Set the condition to match the event details (e.g., file name, path, or metadata).
+3. **Set the Action**:
+ - Configure the rule to trigger the second pipeline.
+ - Specify the pipeline name and pass any required parameters.
+3. **Save and Activate**:
+ - Save the rule and activate it.
+ - Ensure the rule is enabled and ready to respond to the event.
+
+
+
+## Test the Entire Workflow
+
+1. **Run the First Pipeline**: Execute the first pipeline and verify that the trigger file is created.
+2. **Monitor Activator**: Check the `Event Details` and `Rule Activation Details` in Activator to ensure the event is detected and the rule is activated.
+3. **Verify the Second Pipeline**: Confirm that the second pipeline is triggered and runs successfully.
+
+
+
+## Troubleshooting (If Needed)
+
+- If the second pipeline does not trigger:
+ 1. Double-check the rule configuration in Activator.
+ 2. Review the logs in Activator for any errors or warnings.
+
+
+
Total Visitors
+

+
From 75ddc5ba7563a1eb9cbaf66f8311e3f10442dc93 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 08:59:13 -0600
Subject: [PATCH 24/31] moved
---
.../FabricActivatorRulePipeline/README.md | 132 ------------------
1 file changed, 132 deletions(-)
delete mode 100644 Monitoring-Observability/FabricActivatorRulePipeline/README.md
diff --git a/Monitoring-Observability/FabricActivatorRulePipeline/README.md b/Monitoring-Observability/FabricActivatorRulePipeline/README.md
deleted file mode 100644
index 99d14b7..0000000
--- a/Monitoring-Observability/FabricActivatorRulePipeline/README.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Microsoft Fabric: Automating Pipeline Execution with Activator
-
-Costa Rica
-
-[](https://github.com/)
-[brown9804](https://github.com/brown9804)
-
-Last updated: 2025-04-21
-
-----------
-
-> This process shows how to set up Microsoft Fabric Activator to automate workflows by detecting file creation events in a storage system and triggering another pipeline to run.
->
-> 1. **First Pipeline**: The process starts with a pipeline that ends with a `Copy Data` activity. This activity uploads data into the `Lakehouse`.
-> 2. **Event Stream Setup**: An `Event Stream` is configured in Activator to monitor the Lakehouse for file creation or data upload events.
-> 3. **Triggering the Second Pipeline**: Once the event is detected (e.g., a file is uploaded), the Event Stream triggers the second pipeline to continue the workflow.
-
-
-List of References (Click to expand)
-
-- [Activate Fabric items](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/data-activator/activator-trigger-fabric-items)
-- [Create a rule in Fabric Activator](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/data-activator/activator-create-activators)
-
-
-
-
-List of Content (Click to expand)
-
-- [Set Up the First Pipeline](#set-up-the-first-pipeline)
-- [Configure Activator to Detect the Event](#configure-activator-to-detect-the-event)
-- [Set Up the Second Pipeline](#set-up-the-second-pipeline)
-- [Define the Rule in Activator](#define-the-rule-in-activator)
-- [Test the Entire Workflow](#test-the-entire-workflow)
-- [Troubleshooting If Needed](#troubleshooting-if-needed)
-
-
-
-> [!NOTE]
-> This code generates random data with fields such as id, name, age, email, and created_at, organizes it into a PySpark DataFrame, and saves it to a specified Lakehouse path using the Delta format. Click here to see the [example script](./GeneratesRandomData.ipynb)
-
-
-
-## Set Up the First Pipeline
-
-1. **Create the Pipeline**:
- - In [Microsoft Fabric](https://app.fabric.microsoft.com/), create the first pipeline that performs the required tasks.
- - Add a `Copy Data` activity as the final step in the pipeline.
-
-2. **Generate the Trigger File**:
- - Configure the `Copy Data` activity to create a trigger file in a specific location, such as `Azure Data Lake Storage (ADLS)` or `OneLake`.
- - Ensure the file name and path are consistent and predictable (e.g., `trigger_file.json` in a specific folder).
-3. **Publish and Test**: Publish the pipeline and test it to ensure the trigger file is created successfully.
-
-
-
-## Configure Activator to Detect the Event
-
-> [!TIP]
-> Event options:
-
-
-
-1. **Set Up an Event**:
- - Create a new event to monitor the location where the trigger file is created (e.g., ADLS or OneLake). Click on `Real-Time`:
-
-
-
- - Choose the appropriate event type, such as `File Created`.
-
-
-
-
-
- - Add a source:
-
-
-
-
-
-
-
-2. **Test Event Detection**:
- - Save the event and test it by manually running the first pipeline to ensure Activator detects the file creation.
- - Check the **Event Details** screen in Activator to confirm the event is logged.
-
-
-
-## Set Up the Second Pipeline
-
-1. **Create the Pipeline**:
- - In Microsoft Fabric, create the second pipeline that performs the next set of tasks.
- - Ensure it is configured to accept external triggers.
-2. **Publish the Pipeline**: Publish the second pipeline and ensure it is ready to be triggered.
-
-
-
-## Define the Rule in Activator
-
-1. **Setup the Activator**:
-
-
-
-2. **Create a New Rule**:
- - In `Activator`, create a rule that responds to the event you just configured.
- - Set the condition to match the event details (e.g., file name, path, or metadata).
-3. **Set the Action**:
- - Configure the rule to trigger the second pipeline.
- - Specify the pipeline name and pass any required parameters.
-3. **Save and Activate**:
- - Save the rule and activate it.
- - Ensure the rule is enabled and ready to respond to the event.
-
-
-
-## Test the Entire Workflow
-
-1. **Run the First Pipeline**: Execute the first pipeline and verify that the trigger file is created.
-2. **Monitor Activator**: Check the `Event Details` and `Rule Activation Details` in Activator to ensure the event is detected and the rule is activated.
-3. **Verify the Second Pipeline**: Confirm that the second pipeline is triggered and runs successfully.
-
-
-
-## Troubleshooting (If Needed)
-
-- If the second pipeline does not trigger:
- 1. Double-check the rule configuration in Activator.
- 2. Review the logs in Activator for any errors or warnings.
-
-
-
Total Visitors
-

-
From 17e2e02f25ea421b62a572f6506044a70ec7a7a3 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 08:59:31 -0600
Subject: [PATCH 25/31] moved
---
.../GeneratesRandomData.ipynb | 107 ------------------
1 file changed, 107 deletions(-)
delete mode 100644 Monitoring-Observability/FabricActivatorRulePipeline/GeneratesRandomData.ipynb
diff --git a/Monitoring-Observability/FabricActivatorRulePipeline/GeneratesRandomData.ipynb b/Monitoring-Observability/FabricActivatorRulePipeline/GeneratesRandomData.ipynb
deleted file mode 100644
index 6cc6a2c..0000000
--- a/Monitoring-Observability/FabricActivatorRulePipeline/GeneratesRandomData.ipynb
+++ /dev/null
@@ -1,107 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8d820f25-3c2e-45b3-8a08-af78f0d45e1d",
- "metadata": {
- "microsoft": {
- "language": "python",
- "language_group": "synapse_pyspark"
- }
- },
- "outputs": [],
- "source": [
- "# Generates Dummy json file in Files/\n",
- "\n",
- "# Import necessary libraries\n",
- "from pyspark.sql import SparkSession\n",
- "from pyspark.sql.types import *\n",
- "import random\n",
- "from datetime import datetime, timedelta\n",
- "\n",
- "# Initialize Spark session (if not already initialized)\n",
- "spark = SparkSession.builder.appName(\"GenerateRandomData\").getOrCreate()\n",
- "\n",
- "# Function to generate random data\n",
- "def generate_random_data(num_entries):\n",
- " data = []\n",
- " for i in range(1, num_entries + 1):\n",
- " name = f\"User{i}\"\n",
- " entry = {\n",
- " \"id\": i,\n",
- " \"name\": name,\n",
- " \"age\": random.randint(18, 65),\n",
- " \"email\": f\"{name.lower()}@example.com\",\n",
- " \"created_at\": (datetime.now() - timedelta(days=random.randint(0, 365))).strftime(\"%Y-%m-%d %H:%M:%S\")\n",
- " }\n",
- " data.append(entry)\n",
- " return data\n",
- "\n",
- "# Generate 10 random entries\n",
- "random_data = generate_random_data(10)\n",
- "\n",
- "# Define schema for the DataFrame\n",
- "schema = StructType([\n",
- " StructField(\"id\", IntegerType(), True),\n",
- " StructField(\"name\", StringType(), True),\n",
- " StructField(\"age\", IntegerType(), True),\n",
- " StructField(\"email\", StringType(), True),\n",
- " StructField(\"created_at\", StringType(), True)\n",
- "])\n",
- "\n",
- "# Create a DataFrame from the random data\n",
- "df_random_data = spark.createDataFrame(random_data, schema=schema)\n",
- "\n",
- "# Write the DataFrame to the Lakehouse in the specified path\n",
- "output_path = \"abfss://{WORKSPACE-NAME}@onelake.dfs.fabric.microsoft.com/raw_Bronze.Lakehouse/Files/random_data\" # Replace {WORKSPACE-NAME}\n",
- "df_random_data.write.format(\"delta\").mode(\"overwrite\").save(output_path)\n",
- "\n",
- "print(f\"Random data has been saved to the Lakehouse at '{output_path}'.\")"
- ]
- }
- ],
- "metadata": {
- "application/vnd.jupyter.widget-state+json": {
- "version": "1.0"
- },
- "dependencies": {},
- "kernel_info": {
- "name": "synapse_pyspark"
- },
- "kernelspec": {
- "display_name": "Synapse PySpark",
- "language": "Python",
- "name": "synapse_pyspark"
- },
- "language_info": {
- "name": "python"
- },
- "microsoft": {
- "language": "python",
- "language_group": "synapse_pyspark",
- "ms_spell_check": {
- "ms_spell_check_language": "en"
- }
- },
- "nteract": {
- "version": "nteract-front-end@1.0.0"
- },
- "spark_compute": {
- "compute_id": "/trident/default",
- "session_options": {
- "conf": {
- "spark.synapse.nbs.session.timeout": "1200000"
- }
- }
- },
- "widgets": {
- "application/vnd.jupyter.widget-state+json": {
- "state": {},
- "version": "1.0"
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
From 59ae01f5d0b2d08318e5d558e0e3fc9a39c91ee8 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 08:59:55 -0600
Subject: [PATCH 26/31] moved
---
.../GeneratesRandomData.ipynb | 107 ++++++++++++++++++
1 file changed, 107 insertions(+)
create mode 100644 Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/GeneratesRandomData.ipynb
diff --git a/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/GeneratesRandomData.ipynb b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/GeneratesRandomData.ipynb
new file mode 100644
index 0000000..6cc6a2c
--- /dev/null
+++ b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/GeneratesRandomData.ipynb
@@ -0,0 +1,107 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8d820f25-3c2e-45b3-8a08-af78f0d45e1d",
+ "metadata": {
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# Generates Dummy json file in Files/\n",
+ "\n",
+ "# Import necessary libraries\n",
+ "from pyspark.sql import SparkSession\n",
+ "from pyspark.sql.types import *\n",
+ "import random\n",
+ "from datetime import datetime, timedelta\n",
+ "\n",
+ "# Initialize Spark session (if not already initialized)\n",
+ "spark = SparkSession.builder.appName(\"GenerateRandomData\").getOrCreate()\n",
+ "\n",
+ "# Function to generate random data\n",
+ "def generate_random_data(num_entries):\n",
+ " data = []\n",
+ " for i in range(1, num_entries + 1):\n",
+ " name = f\"User{i}\"\n",
+ " entry = {\n",
+ " \"id\": i,\n",
+ " \"name\": name,\n",
+ " \"age\": random.randint(18, 65),\n",
+ " \"email\": f\"{name.lower()}@example.com\",\n",
+ " \"created_at\": (datetime.now() - timedelta(days=random.randint(0, 365))).strftime(\"%Y-%m-%d %H:%M:%S\")\n",
+ " }\n",
+ " data.append(entry)\n",
+ " return data\n",
+ "\n",
+ "# Generate 10 random entries\n",
+ "random_data = generate_random_data(10)\n",
+ "\n",
+ "# Define schema for the DataFrame\n",
+ "schema = StructType([\n",
+ " StructField(\"id\", IntegerType(), True),\n",
+ " StructField(\"name\", StringType(), True),\n",
+ " StructField(\"age\", IntegerType(), True),\n",
+ " StructField(\"email\", StringType(), True),\n",
+ " StructField(\"created_at\", StringType(), True)\n",
+ "])\n",
+ "\n",
+ "# Create a DataFrame from the random data\n",
+ "df_random_data = spark.createDataFrame(random_data, schema=schema)\n",
+ "\n",
+ "# Write the DataFrame to the Lakehouse in the specified path\n",
+ "output_path = \"abfss://{WORKSPACE-NAME}@onelake.dfs.fabric.microsoft.com/raw_Bronze.Lakehouse/Files/random_data\" # Replace {WORKSPACE-NAME}\n",
+ "df_random_data.write.format(\"delta\").mode(\"overwrite\").save(output_path)\n",
+ "\n",
+ "print(f\"Random data has been saved to the Lakehouse at '{output_path}'.\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "application/vnd.jupyter.widget-state+json": {
+ "version": "1.0"
+ },
+ "dependencies": {},
+ "kernel_info": {
+ "name": "synapse_pyspark"
+ },
+ "kernelspec": {
+ "display_name": "Synapse PySpark",
+ "language": "Python",
+ "name": "synapse_pyspark"
+ },
+ "language_info": {
+ "name": "python"
+ },
+ "microsoft": {
+ "language": "python",
+ "language_group": "synapse_pyspark",
+ "ms_spell_check": {
+ "ms_spell_check_language": "en"
+ }
+ },
+ "nteract": {
+ "version": "nteract-front-end@1.0.0"
+ },
+ "spark_compute": {
+ "compute_id": "/trident/default",
+ "session_options": {
+ "conf": {
+ "spark.synapse.nbs.session.timeout": "1200000"
+ }
+ }
+ },
+ "widgets": {
+ "application/vnd.jupyter.widget-state+json": {
+ "state": {},
+ "version": "1.0"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
From dbe7dc6b211d93648ee7c0a0d4410425f72ba57c Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 09:01:29 -0600
Subject: [PATCH 27/31] ref path changed
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index ee7b623..3ca7bc0 100644
--- a/README.md
+++ b/README.md
@@ -173,7 +173,7 @@ Click to read more about [Microsoft Purview for Fabric - Overview](./Workloads-S
- **Microsoft [Fabric Capacity Metrics](https://github.com/MicrosoftCloudEssentials-LearningHub/Fabric-EnterpriseFramework/blob/main/Monitoring-Observability/MonitorUsage.md#microsoft-fabric-capacity-metrics-app) app**: Powerful tool for administrators to `monitor and manage their capacity usage`. It provides detailed insights into `capacity utilization, throttling, and system events, helping to optimize performance and resource allocation`. By tracking these metrics, admins can make informed decisions to ensure efficient use of resources.
- **Admin Monitoring**: Configure and use the [Admin Monitoring Workspace](https://github.com/MicrosoftCloudEssentials-LearningHub/Fabric-EnterpriseFramework/blob/main/Monitoring-Observability/MonitorUsage.md#admin-monitoring) it's a centralized hub for `tracking and analyzing usage metrics across the organization`. It includes `pre-built reports and semantic models that provide insights into feature adoption, performance, and compliance`. This workspace helps administrators maintain the health and efficiency of their Fabric environment by offering a comprehensive `view of usage patterns and system events`.
- **Monitor Hub**: Access and utilize the [Monitor Hub](https://github.com/MicrosoftCloudEssentials-LearningHub/Fabric-EnterpriseFramework/blob/main/Monitoring-Observability/MonitorUsage.md#monitor-hub). Allows users to `view and track the status of activities across all workspaces they have permissions for`. It provides a detailed overview of operations, `including dataset refreshes, Spark job runs, and other activities`. With features like historical views, customizable displays, and filtering options, the Monitor Hub helps ensure smooth operations and timely interventions when needed.
-- **Event Hub Integration**: Use Event Hub to capture and analyze events for real-time monitoring. For example, leverage it for [Automating pipeline execution with Activator](./Monitoring-Observability/FabricActivatorRulePipeline/)
+- **Event Hub Integration**: Use Event Hub to capture and analyze events for real-time monitoring. For example, leverage it for [Automating pipeline execution with Activator](./Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/)
- **Alerting**: Configure alerts for critical events and thresholds to ensure timely responses to issues. For example, [Steps to Configure Capacity Alerts](./Monitoring-Observability/StepsCapacityAlert.md)
## Cost Management
From dcdfa2bd53c7a598d30b2f080fc61e3d0739ab15 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 09:03:05 -0600
Subject: [PATCH 28/31] title c
---
.../RealTimeIntelligence/FabricActivatorRulePipeline/README.md | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
index 99d14b7..0dbf7ac 100644
--- a/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
+++ b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
@@ -1,4 +1,4 @@
-# Microsoft Fabric: Automating Pipeline Execution with Activator
+# Demonstration: Automating Pipeline Execution with Activator
Costa Rica
@@ -10,7 +10,6 @@ Last updated: 2025-04-21
----------
> This process shows how to set up Microsoft Fabric Activator to automate workflows by detecting file creation events in a storage system and triggering another pipeline to run.
->
> 1. **First Pipeline**: The process starts with a pipeline that ends with a `Copy Data` activity. This activity uploads data into the `Lakehouse`.
> 2. **Event Stream Setup**: An `Event Stream` is configured in Activator to monitor the Lakehouse for file creation or data upload events.
> 3. **Triggering the Second Pipeline**: Once the event is detected (e.g., a file is uploaded), the Event Stream triggers the second pipeline to continue the workflow.
From 8a82f55842c855c9a14b3f730722791eadb98dd2 Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
Date: Sat, 3 May 2025 15:03:24 +0000
Subject: [PATCH 29/31] Fix Markdown syntax issues
---
.../RealTimeIntelligence/FabricActivatorRulePipeline/README.md | 1 +
1 file changed, 1 insertion(+)
diff --git a/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
index 0dbf7ac..256c6e7 100644
--- a/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
+++ b/Workloads-Specific/RealTimeIntelligence/FabricActivatorRulePipeline/README.md
@@ -10,6 +10,7 @@ Last updated: 2025-04-21
----------
> This process shows how to set up Microsoft Fabric Activator to automate workflows by detecting file creation events in a storage system and triggering another pipeline to run.
+>
> 1. **First Pipeline**: The process starts with a pipeline that ends with a `Copy Data` activity. This activity uploads data into the `Lakehouse`.
> 2. **Event Stream Setup**: An `Event Stream` is configured in Activator to monitor the Lakehouse for file creation or data upload events.
> 3. **Triggering the Second Pipeline**: Once the event is detected (e.g., a file is uploaded), the Event Stream triggers the second pipeline to continue the workflow.
From fb84aa4aa5d05576fcf4988ebb17bb5f3434e95c Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 09:07:14 -0600
Subject: [PATCH 30/31] in place
---
.../RealTimeIntelligence/BestPractices.md | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/Workloads-Specific/RealTimeIntelligence/BestPractices.md b/Workloads-Specific/RealTimeIntelligence/BestPractices.md
index b576110..99a3fc8 100644
--- a/Workloads-Specific/RealTimeIntelligence/BestPractices.md
+++ b/Workloads-Specific/RealTimeIntelligence/BestPractices.md
@@ -13,6 +13,10 @@ Last updated: 2025-05-03
List of References (Click to expand)
+- [Real-Time Intelligence documentation in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/)
+- [What is Real-Time Intelligence?](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/overview)
+- [Implement medallion architecture in Real-Time Intelligence](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/architecture-medallion)
+
@@ -45,7 +49,11 @@ Last updated: 2025-05-03
## Dynamic Activator Configuration
-> Implement Activator to respond to events with rule-based triggers that can automatically initiate workflows, send notifications, or activate remediation processes. Ensure that your activation rules are flexible and customizable so that actions can be fine-tuned to the specific nuances of your environment. For example: Set up Activator rules that trigger alerts or automated remedial actions when certain thresholds are reached—such as a sudden spike in error events or a dip in transaction volumes. For example, configure the system to send an SMS or email alert when abnormal patterns are detected, and automatically adjust system parameters via an integrated ITSM tool.
+> Implement Activator to respond to events with rule-based triggers that can automatically initiate workflows, send notifications, or activate remediation processes. Ensure that your activation rules are flexible and customizable so that actions can be fine-tuned to the specific nuances of your environment. For example: Set up Activator rules that trigger alerts or automated remedial actions when certain thresholds are reached—such as a sudden spike in error events or a dip in transaction volumes. For example, configure the system to send an SMS or email alert when abnormal patterns are detected, and automatically adjust system parameters via an integrated ITSM tool.
+
+Click to read [Demonstration: Automating Pipeline Execution with Activator](./FabricActivatorRulePipeline): Shows how to set up Microsoft Fabric Activator to automate workflows by detecting file creation events in a storage system and triggering another pipeline to run.
+
+
Total Visitors
From 3b56667ee555ef289c4afda2e3abb533a0b0cbb4 Mon Sep 17 00:00:00 2001
From: Timna Brown <24630902+brown9804@users.noreply.github.com>
Date: Sat, 3 May 2025 09:07:38 -0600
Subject: [PATCH 31/31] pending purvire
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 3ca7bc0..712eb3d 100644
--- a/README.md
+++ b/README.md
@@ -203,7 +203,7 @@ Click to read more about [Microsoft Purview for Fabric - Overview](./Workloads-S
- [Data Engineering - Best Practices Overview](./Workloads-Specific/DataEngineering/BestPractices.md)
- [Data Warehouse - Best Practices Overview](./Workloads-Specific/DataWarehouse/BestPractices.md)
- [Data Science - Best Practices Overview](./Workloads-Specific/DataScience/BestPractices.md)
-- [Real-Time Intelligence - Best Practices Overview](./Workloads-Specific/RealTimeIntelligence/BestPractices.md) - in progress
+- [Real-Time Intelligence - Best Practices Overview](./Workloads-Specific/RealTimeIntelligence/BestPractices.md)
- [Power Bi - Best Practices Overview](./Workloads-Specific/PowerBi/BestPractices.md)
- [Purview - Best Practices Overview](./Workloads-Specific/Purview/BestPractices.md) - in progress