updated for open dataset change and 3.4 updates

midesa · midesa · commit 378a2596e4e6 · 2024-02-29T10:19:04.000-08:00
diff --git a/articles/synapse-analytics/machine-learning/quickstart-gallery-sample-notebook.md b/articles/synapse-analytics/machine-learning/quickstart-gallery-sample-notebook.md
@@ -4,9 +4,9 @@ description: Learn how to use a sample notebook from the Synapse Analytics galle
 ms.service: synapse-analytics
 ms.subservice: machine-learning
 ms.topic: quickstart
-ms.date: 06/11/2021
-author: WilliamDAssafMSFT
-ms.author: wiassaf
+ms.date: 02/29/2024
+author: midesa
+ms.author: midesa
 ms.custom: mode-other
 ---
 
@@ -27,7 +27,7 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
 1. Open your workspace and select **Learn** from the home page.
 1. In the **Knowledge center**, select **Browse gallery**.
 1. In the gallery, select **Notebooks**.
-1. Find and select the notebook "Data Exploration and ML Modeling - NYC taxi predict using Spark MLib".
+1. Find and select a notebook from the gallery.
 
    :::image type="content" source="media\quickstart-gallery-sample-notebook\gallery-select-ml-notebook.png" alt-text="Select the machine learning sample notebook in the gallery.":::
 
@@ -38,24 +38,6 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
 
 1. In the **Attach to** menu in the open notebook, select your Apache Spark pool.
 
-## Run the notebook
-
-The notebook is divided into multiple cells that each perform a specific function.
-You can manually run each cell, running cells sequentially, or select **Run all** to run all the cells.
-
-Here are descriptions for each of the cells in the notebook:
-
-1. Import PySpark functions that the notebook uses.
-1. **Ingest Date** - Ingest data from the Azure Open Dataset **NycTlcYellow** into a local dataframe for processing. The code extracts data within a specific time period - you can modify the start and end dates to get different data.
-1. Downsample the dataset to make development faster. You can modify this step to change the sample size or the sampling seed.
-1. **Exploratory Data Analysis** - Display charts to view the data. This can give you an idea what data prep might be needed before creating the model.
-1. **Data Prep and Featurization** - Filter out outlier data discovered through visualization and create some useful derived variables.
-1. **Data Prep and Featurization Part 2** - Drop unneeded columns and create some additional features.
-1. **Encoding** - Convert string variables to numbers that the Logistic Regression model is expecting.
-1. **Generation of Testing and Training Data Sets** - Split the data into separate testing and training data sets. You can modify the fraction and randomizing seed used to split the data.
-1. **Train the Model** - Train a Logistic Regression model and display its "Area under ROC" metric to see how well the model is working. This step also saves the trained model in case you want to use it elsewhere.
-1. **Evaluate and Visualize** - Plot the model's ROC curve to further evaluate the model.
-
 ## Save the notebook
 
 To save your notebook by selecting **Publish** on the workspace command bar.
diff --git a/articles/synapse-analytics/machine-learning/quickstart-integrate-azure-machine-learning.md b/articles/synapse-analytics/machine-learning/quickstart-integrate-azure-machine-learning.md
@@ -5,7 +5,7 @@ ms.service: synapse-analytics
 ms.subservice: machine-learning
 ms.topic: quickstart
 ms.reviewer: sngun, garye
-ms.date: 12/16/2021
+ms.date: 02/29/2024
 author: nelgson
 ms.author: negust
 ms.custom: mode-other
@@ -16,6 +16,7 @@ ms.custom: mode-other
 > **IMPORTANT, PLEASE NOTE THE BELOW LIMITATIONS:**
 > - **The Azure ML integration is not currently supported in Synapse Workspaces with Data Exfiltration Protection.** If you are **not** using data exfiltration protection and want to connect to Azure ML using private endpoints, you can set up a managed AzureML private endpoint in your Synapse workspace. [Read more about managed private endpoints](../security/how-to-create-managed-private-endpoints.md)
 > - **AzureML linked service is not supported with self hosted integration runtimes.** This applies to Synapse workspaces with and without Data Exfiltration Protection.
+> - **The Azure Synapse Spark 3.3 and 3.4 runtimes do not support using the Azure ML Linked Service to authenticate to the Azure Machine Learning MLFlow tracking URI.** To learn more about the limitations on these runtimes, see [Azure Synapse Runtime for Apache Spark 3.3](../spark/apache-spark-33-runtime.md) and [Azure Synapse Runtime for Apache Spark 3.4](../spark//apache-spark-34-runtime.md) 
 
 In this quickstart, you'll link an Azure Synapse Analytics workspace to an Azure Machine Learning workspace. Linking these workspaces allows you to leverage Azure Machine Learning from various experiences in Synapse.
 
diff --git a/articles/synapse-analytics/spark/apache-spark-data-visualization-tutorial.md b/articles/synapse-analytics/spark/apache-spark-data-visualization-tutorial.md
@@ -35,14 +35,17 @@ Create an Apache Spark Pool by following the [Create an Apache Spark pool tutori
 3. Because the raw data is in a Parquet format, you can use the Spark context to pull the file into memory as a DataFrame directly. Create a Spark DataFrame by retrieving the data via the Open Datasets API. Here, we use the Spark DataFrame *schema on read* properties to infer the datatypes and schema.
 
    ```python
-   from azureml.opendatasets import NycTlcYellow
-   from datetime import datetime
-   from dateutil import parser
-
-   end_date = parser.parse('2018-06-06')
-   start_date = parser.parse('2018-05-01')
-   nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
-   df = nyc_tlc.to_spark_dataframe()
+    from azureml.opendatasets import NycTlcYellow
+    
+    from datetime import datetime
+    from dateutil import parser
+    
+    end_date = parser.parse('2018-05-08 00:00:00')
+    start_date = parser.parse('2018-05-01 00:00:00')
+    
+    nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
+    filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
+
    ```
 
 4. After the data is read, we'll want to do some initial filtering to clean the dataset. We might remove unneeded columns and add columns that extract important information. In addition, we'll filter out anomalies within the dataset.
diff --git a/articles/synapse-analytics/spark/apache-spark-machine-learning-mllib-notebook.md b/articles/synapse-analytics/spark/apache-spark-machine-learning-mllib-notebook.md
@@ -73,10 +73,15 @@ Because the raw data is in a Parquet format, you can use the Spark context to pu
     ```python
     from azureml.opendatasets import NycTlcYellow
 
-    end_date = parser.parse('2018-06-06')
-    start_date = parser.parse('2018-05-01')
+    from datetime import datetime
+    from dateutil import parser
+    
+    end_date = parser.parse('2018-05-08 00:00:00')
+    start_date = parser.parse('2018-05-01 00:00:00')
+    
     nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
-    filtered_df = nyc_tlc.to_spark_dataframe()
+    filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
+
     ```
 
 2. The downside to simple filtering is that, from a statistical perspective, it might introduce bias into the data. Another approach is to use the sampling built into Spark.