Merge pull request #267699 from midesa/main

v-dirichards · web-flow · commit eb825de7f3e5 · 2024-03-01T16:55:30.000-06:00
Open Dataset changes for new runtimes &amp; limitations
diff --git a/articles/synapse-analytics/machine-learning/quickstart-gallery-sample-notebook.md b/articles/synapse-analytics/machine-learning/quickstart-gallery-sample-notebook.md
@@ -4,9 +4,9 @@ description: Learn how to use a sample notebook from the Synapse Analytics galle
 ms.service: synapse-analytics
 ms.subservice: machine-learning
 ms.topic: quickstart
-ms.date: 06/11/2021
-author: WilliamDAssafMSFT
-ms.author: wiassaf
+ms.date: 02/29/2024
+author: midesa
+ms.author: midesa
 ms.custom: mode-other
 ---
 
@@ -27,7 +27,7 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
 1. Open your workspace and select **Learn** from the home page.
 1. In the **Knowledge center**, select **Browse gallery**.
 1. In the gallery, select **Notebooks**.
-1. Find and select the notebook "Data Exploration and ML Modeling - NYC taxi predict using Spark MLib".
+1. Find and select a notebook from the gallery.
 
    :::image type="content" source="media\quickstart-gallery-sample-notebook\gallery-select-ml-notebook.png" alt-text="Select the machine learning sample notebook in the gallery.":::
 
@@ -38,24 +38,6 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
 
 1. In the **Attach to** menu in the open notebook, select your Apache Spark pool.
 
-## Run the notebook
-
-The notebook is divided into multiple cells that each perform a specific function.
-You can manually run each cell, running cells sequentially, or select **Run all** to run all the cells.
-
-Here are descriptions for each of the cells in the notebook:
-
-1. Import PySpark functions that the notebook uses.
-1. **Ingest Date** - Ingest data from the Azure Open Dataset **NycTlcYellow** into a local dataframe for processing. The code extracts data within a specific time period - you can modify the start and end dates to get different data.
-1. Downsample the dataset to make development faster. You can modify this step to change the sample size or the sampling seed.
-1. **Exploratory Data Analysis** - Display charts to view the data. This can give you an idea what data prep might be needed before creating the model.
-1. **Data Prep and Featurization** - Filter out outlier data discovered through visualization and create some useful derived variables.
-1. **Data Prep and Featurization Part 2** - Drop unneeded columns and create some additional features.
-1. **Encoding** - Convert string variables to numbers that the Logistic Regression model is expecting.
-1. **Generation of Testing and Training Data Sets** - Split the data into separate testing and training data sets. You can modify the fraction and randomizing seed used to split the data.
-1. **Train the Model** - Train a Logistic Regression model and display its "Area under ROC" metric to see how well the model is working. This step also saves the trained model in case you want to use it elsewhere.
-1. **Evaluate and Visualize** - Plot the model's ROC curve to further evaluate the model.
-
 ## Save the notebook
 
 To save your notebook by selecting **Publish** on the workspace command bar.
diff --git a/articles/synapse-analytics/machine-learning/quickstart-integrate-azure-machine-learning.md b/articles/synapse-analytics/machine-learning/quickstart-integrate-azure-machine-learning.md
@@ -5,7 +5,7 @@ ms.service: synapse-analytics
 ms.subservice: machine-learning
 ms.topic: quickstart
 ms.reviewer: sngun, garye
-ms.date: 12/16/2021
+ms.date: 02/29/2024
 author: nelgson
 ms.author: negust
 ms.custom: mode-other
@@ -14,8 +14,9 @@ ms.custom: mode-other
 # Quickstart: Create a new Azure Machine Learning linked service in Synapse
 
 > **IMPORTANT, PLEASE NOTE THE BELOW LIMITATIONS:**
-> - **The Azure ML integration is not currently supported in Synapse Workspaces with Data Exfiltration Protection.** If you are **not** using data exfiltration protection and want to connect to Azure ML using private endpoints, you can set up a managed AzureML private endpoint in your Synapse workspace. [Read more about managed private endpoints](../security/how-to-create-managed-private-endpoints.md)
+> - **The Azure Machine Learning integration is not currently supported in Synapse Workspaces with Data Exfiltration Protection.** If you are **not** using data exfiltration protection and want to connect to Azure Machine Learning using private endpoints, you can set up a managed Azure Machine Learning private endpoint in your Synapse workspace. [Read more about managed private endpoints](../security/how-to-create-managed-private-endpoints.md)
 > - **AzureML linked service is not supported with self hosted integration runtimes.** This applies to Synapse workspaces with and without Data Exfiltration Protection.
+> - **The Azure Synapse Spark 3.3 and 3.4 runtimes do not support using the Azure Machine Learning Linked Service to authenticate to the Azure Machine Learning MLFlow tracking URI.** To learn more about the limitations on these runtimes, see [Azure Synapse Runtime for Apache Spark 3.3](../spark/apache-spark-33-runtime.md) and [Azure Synapse Runtime for Apache Spark 3.4](../spark//apache-spark-34-runtime.md) 
 
 In this quickstart, you'll link an Azure Synapse Analytics workspace to an Azure Machine Learning workspace. Linking these workspaces allows you to leverage Azure Machine Learning from various experiences in Synapse.
 
@@ -46,13 +47,13 @@ In the following sections, you'll find guidance on how to create an Azure Machin
 
 This section will guide you on how to create an Azure Machine Learning linked service in Azure Synapse, using the [Azure Synapse workspace Managed Identity](../../data-factory/data-factory-service-identity.md?context=/azure/synapse-analytics/context/context&tabs=synapse-analytics)
 
-### Give MSI permission to the Azure ML workspace
+### Give MSI permission to the Azure Machine Learning workspace
 
 1. Navigate to your Azure Machine Learning workspace resource in the Azure portal and select **Access Control**
 
 1. Create a role assignment and add your Synapse workspace Managed Service identity (MSI) as a *contributor* of the Azure Machine Learning workspace. Note that this will require being an owner of the resource group that the Azure Machine Learning workspace belongs to. If you have trouble finding your Synapse workspace MSI, search for the name of the Synapse workspace.
 
-### Create an Azure ML linked service
+### Create an Azure Machine Learning linked service
 
 1. In the Synapse workspace where you want to create the new Azure Machine Learning linked service, go to **Manage** > **Linked services**, and create a new linked service with type "Azure Machine Learning".
 
@@ -94,7 +95,7 @@ This step will create a new Service Principal. If you want to use an existing Se
 
    ![Assign contributor role](media/quickstart-integrate-azure-machine-learning/quickstart-integrate-azure-machine-learning-createsp-00c.png)
 
-### Create an Azure ML linked service
+### Create an Azure Machine Learning linked service
 
 1. In the Synapse workspace where you want to create the new Azure Machine Learning linked service, go to **Manage** -> **Linked services**, create a new linked service with type "Azure Machine Learning".
 
diff --git a/articles/synapse-analytics/spark/apache-spark-data-visualization-tutorial.md b/articles/synapse-analytics/spark/apache-spark-data-visualization-tutorial.md
@@ -5,7 +5,7 @@ author: midesa
 ms.service: synapse-analytics
 ms.topic: conceptual
 ms.subservice: machine-learning
-ms.date: 10/20/2020
+ms.date: 02/29/2024
 ms.author: midesa
 ---
 
@@ -35,14 +35,17 @@ Create an Apache Spark Pool by following the [Create an Apache Spark pool tutori
 3. Because the raw data is in a Parquet format, you can use the Spark context to pull the file into memory as a DataFrame directly. Create a Spark DataFrame by retrieving the data via the Open Datasets API. Here, we use the Spark DataFrame *schema on read* properties to infer the datatypes and schema.
 
    ```python
-   from azureml.opendatasets import NycTlcYellow
-   from datetime import datetime
-   from dateutil import parser
-
-   end_date = parser.parse('2018-06-06')
-   start_date = parser.parse('2018-05-01')
-   nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
-   df = nyc_tlc.to_spark_dataframe()
+    from azureml.opendatasets import NycTlcYellow
+    
+    from datetime import datetime
+    from dateutil import parser
+    
+    end_date = parser.parse('2018-05-08 00:00:00')
+    start_date = parser.parse('2018-05-01 00:00:00')
+    
+    nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
+    filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
+
    ```
 
 4. After the data is read, we'll want to do some initial filtering to clean the dataset. We might remove unneeded columns and add columns that extract important information. In addition, we'll filter out anomalies within the dataset.
diff --git a/articles/synapse-analytics/spark/apache-spark-machine-learning-mllib-notebook.md b/articles/synapse-analytics/spark/apache-spark-machine-learning-mllib-notebook.md
@@ -6,7 +6,7 @@ ms.service:  synapse-analytics
 ms.reviewer: sngun 
 ms.topic: tutorial
 ms.subservice: machine-learning
-ms.date: 02/15/2022
+ms.date: 02/29/2024
 ms.author: negust
 ms.custom: subject-rbac-steps
 
@@ -73,10 +73,15 @@ Because the raw data is in a Parquet format, you can use the Spark context to pu
     ```python
     from azureml.opendatasets import NycTlcYellow
 
-    end_date = parser.parse('2018-06-06')
-    start_date = parser.parse('2018-05-01')
+    from datetime import datetime
+    from dateutil import parser
+    
+    end_date = parser.parse('2018-05-08 00:00:00')
+    start_date = parser.parse('2018-05-01 00:00:00')
+    
     nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
-    filtered_df = nyc_tlc.to_spark_dataframe()
+    filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
+
     ```
 
 2. The downside to simple filtering is that, from a statistical perspective, it might introduce bias into the data. Another approach is to use the sampling built into Spark.