Skip to content

Commit eb825de

Browse files
authored
Merge pull request #267699 from midesa/main
Open Dataset changes for new runtimes & limitations
2 parents 858781e + 6404c12 commit eb825de

File tree

4 files changed

+31
-40
lines changed

4 files changed

+31
-40
lines changed

articles/synapse-analytics/machine-learning/quickstart-gallery-sample-notebook.md

Lines changed: 4 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ description: Learn how to use a sample notebook from the Synapse Analytics galle
44
ms.service: synapse-analytics
55
ms.subservice: machine-learning
66
ms.topic: quickstart
7-
ms.date: 06/11/2021
8-
author: WilliamDAssafMSFT
9-
ms.author: wiassaf
7+
ms.date: 02/29/2024
8+
author: midesa
9+
ms.author: midesa
1010
ms.custom: mode-other
1111
---
1212

@@ -27,7 +27,7 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
2727
1. Open your workspace and select **Learn** from the home page.
2828
1. In the **Knowledge center**, select **Browse gallery**.
2929
1. In the gallery, select **Notebooks**.
30-
1. Find and select the notebook "Data Exploration and ML Modeling - NYC taxi predict using Spark MLib".
30+
1. Find and select a notebook from the gallery.
3131

3232
:::image type="content" source="media\quickstart-gallery-sample-notebook\gallery-select-ml-notebook.png" alt-text="Select the machine learning sample notebook in the gallery.":::
3333

@@ -38,24 +38,6 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
3838

3939
1. In the **Attach to** menu in the open notebook, select your Apache Spark pool.
4040

41-
## Run the notebook
42-
43-
The notebook is divided into multiple cells that each perform a specific function.
44-
You can manually run each cell, running cells sequentially, or select **Run all** to run all the cells.
45-
46-
Here are descriptions for each of the cells in the notebook:
47-
48-
1. Import PySpark functions that the notebook uses.
49-
1. **Ingest Date** - Ingest data from the Azure Open Dataset **NycTlcYellow** into a local dataframe for processing. The code extracts data within a specific time period - you can modify the start and end dates to get different data.
50-
1. Downsample the dataset to make development faster. You can modify this step to change the sample size or the sampling seed.
51-
1. **Exploratory Data Analysis** - Display charts to view the data. This can give you an idea what data prep might be needed before creating the model.
52-
1. **Data Prep and Featurization** - Filter out outlier data discovered through visualization and create some useful derived variables.
53-
1. **Data Prep and Featurization Part 2** - Drop unneeded columns and create some additional features.
54-
1. **Encoding** - Convert string variables to numbers that the Logistic Regression model is expecting.
55-
1. **Generation of Testing and Training Data Sets** - Split the data into separate testing and training data sets. You can modify the fraction and randomizing seed used to split the data.
56-
1. **Train the Model** - Train a Logistic Regression model and display its "Area under ROC" metric to see how well the model is working. This step also saves the trained model in case you want to use it elsewhere.
57-
1. **Evaluate and Visualize** - Plot the model's ROC curve to further evaluate the model.
58-
5941
## Save the notebook
6042

6143
To save your notebook by selecting **Publish** on the workspace command bar.

articles/synapse-analytics/machine-learning/quickstart-integrate-azure-machine-learning.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ ms.service: synapse-analytics
55
ms.subservice: machine-learning
66
ms.topic: quickstart
77
ms.reviewer: sngun, garye
8-
ms.date: 12/16/2021
8+
ms.date: 02/29/2024
99
author: nelgson
1010
ms.author: negust
1111
ms.custom: mode-other
@@ -14,8 +14,9 @@ ms.custom: mode-other
1414
# Quickstart: Create a new Azure Machine Learning linked service in Synapse
1515

1616
> **IMPORTANT, PLEASE NOTE THE BELOW LIMITATIONS:**
17-
> - **The Azure ML integration is not currently supported in Synapse Workspaces with Data Exfiltration Protection.** If you are **not** using data exfiltration protection and want to connect to Azure ML using private endpoints, you can set up a managed AzureML private endpoint in your Synapse workspace. [Read more about managed private endpoints](../security/how-to-create-managed-private-endpoints.md)
17+
> - **The Azure Machine Learning integration is not currently supported in Synapse Workspaces with Data Exfiltration Protection.** If you are **not** using data exfiltration protection and want to connect to Azure Machine Learning using private endpoints, you can set up a managed Azure Machine Learning private endpoint in your Synapse workspace. [Read more about managed private endpoints](../security/how-to-create-managed-private-endpoints.md)
1818
> - **AzureML linked service is not supported with self hosted integration runtimes.** This applies to Synapse workspaces with and without Data Exfiltration Protection.
19+
> - **The Azure Synapse Spark 3.3 and 3.4 runtimes do not support using the Azure Machine Learning Linked Service to authenticate to the Azure Machine Learning MLFlow tracking URI.** To learn more about the limitations on these runtimes, see [Azure Synapse Runtime for Apache Spark 3.3](../spark/apache-spark-33-runtime.md) and [Azure Synapse Runtime for Apache Spark 3.4](../spark//apache-spark-34-runtime.md)
1920
2021
In this quickstart, you'll link an Azure Synapse Analytics workspace to an Azure Machine Learning workspace. Linking these workspaces allows you to leverage Azure Machine Learning from various experiences in Synapse.
2122

@@ -46,13 +47,13 @@ In the following sections, you'll find guidance on how to create an Azure Machin
4647

4748
This section will guide you on how to create an Azure Machine Learning linked service in Azure Synapse, using the [Azure Synapse workspace Managed Identity](../../data-factory/data-factory-service-identity.md?context=/azure/synapse-analytics/context/context&tabs=synapse-analytics)
4849

49-
### Give MSI permission to the Azure ML workspace
50+
### Give MSI permission to the Azure Machine Learning workspace
5051

5152
1. Navigate to your Azure Machine Learning workspace resource in the Azure portal and select **Access Control**
5253

5354
1. Create a role assignment and add your Synapse workspace Managed Service identity (MSI) as a *contributor* of the Azure Machine Learning workspace. Note that this will require being an owner of the resource group that the Azure Machine Learning workspace belongs to. If you have trouble finding your Synapse workspace MSI, search for the name of the Synapse workspace.
5455

55-
### Create an Azure ML linked service
56+
### Create an Azure Machine Learning linked service
5657

5758
1. In the Synapse workspace where you want to create the new Azure Machine Learning linked service, go to **Manage** > **Linked services**, and create a new linked service with type "Azure Machine Learning".
5859

@@ -94,7 +95,7 @@ This step will create a new Service Principal. If you want to use an existing Se
9495

9596
![Assign contributor role](media/quickstart-integrate-azure-machine-learning/quickstart-integrate-azure-machine-learning-createsp-00c.png)
9697

97-
### Create an Azure ML linked service
98+
### Create an Azure Machine Learning linked service
9899

99100
1. In the Synapse workspace where you want to create the new Azure Machine Learning linked service, go to **Manage** -> **Linked services**, create a new linked service with type "Azure Machine Learning".
100101

articles/synapse-analytics/spark/apache-spark-data-visualization-tutorial.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: midesa
55
ms.service: synapse-analytics
66
ms.topic: conceptual
77
ms.subservice: machine-learning
8-
ms.date: 10/20/2020
8+
ms.date: 02/29/2024
99
ms.author: midesa
1010
---
1111

@@ -35,14 +35,17 @@ Create an Apache Spark Pool by following the [Create an Apache Spark pool tutori
3535
3. Because the raw data is in a Parquet format, you can use the Spark context to pull the file into memory as a DataFrame directly. Create a Spark DataFrame by retrieving the data via the Open Datasets API. Here, we use the Spark DataFrame *schema on read* properties to infer the datatypes and schema.
3636

3737
```python
38-
from azureml.opendatasets import NycTlcYellow
39-
from datetime import datetime
40-
from dateutil import parser
41-
42-
end_date = parser.parse('2018-06-06')
43-
start_date = parser.parse('2018-05-01')
44-
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
45-
df = nyc_tlc.to_spark_dataframe()
38+
from azureml.opendatasets import NycTlcYellow
39+
40+
from datetime import datetime
41+
from dateutil import parser
42+
43+
end_date = parser.parse('2018-05-08 00:00:00')
44+
start_date = parser.parse('2018-05-01 00:00:00')
45+
46+
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
47+
filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
48+
4649
```
4750

4851
4. After the data is read, we'll want to do some initial filtering to clean the dataset. We might remove unneeded columns and add columns that extract important information. In addition, we'll filter out anomalies within the dataset.

articles/synapse-analytics/spark/apache-spark-machine-learning-mllib-notebook.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.service: synapse-analytics
66
ms.reviewer: sngun
77
ms.topic: tutorial
88
ms.subservice: machine-learning
9-
ms.date: 02/15/2022
9+
ms.date: 02/29/2024
1010
ms.author: negust
1111
ms.custom: subject-rbac-steps
1212

@@ -73,10 +73,15 @@ Because the raw data is in a Parquet format, you can use the Spark context to pu
7373
```python
7474
from azureml.opendatasets import NycTlcYellow
7575

76-
end_date = parser.parse('2018-06-06')
77-
start_date = parser.parse('2018-05-01')
76+
from datetime import datetime
77+
from dateutil import parser
78+
79+
end_date = parser.parse('2018-05-08 00:00:00')
80+
start_date = parser.parse('2018-05-01 00:00:00')
81+
7882
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
79-
filtered_df = nyc_tlc.to_spark_dataframe()
83+
filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
84+
8085
```
8186

8287
2. The downside to simple filtering is that, from a statistical perspective, it might introduce bias into the data. Another approach is to use the sampling built into Spark.

0 commit comments

Comments
 (0)