You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/machine-learning/quickstart-gallery-sample-notebook.md
+4-22Lines changed: 4 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@ description: Learn how to use a sample notebook from the Synapse Analytics galle
4
4
ms.service: synapse-analytics
5
5
ms.subservice: machine-learning
6
6
ms.topic: quickstart
7
-
ms.date: 06/11/2021
8
-
author: WilliamDAssafMSFT
9
-
ms.author: wiassaf
7
+
ms.date: 02/29/2024
8
+
author: midesa
9
+
ms.author: midesa
10
10
ms.custom: mode-other
11
11
---
12
12
@@ -27,7 +27,7 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
27
27
1. Open your workspace and select **Learn** from the home page.
28
28
1. In the **Knowledge center**, select **Browse gallery**.
29
29
1. In the gallery, select **Notebooks**.
30
-
1. Find and select the notebook "Data Exploration and ML Modeling - NYC taxi predict using Spark MLib".
30
+
1. Find and select a notebook from the gallery.
31
31
32
32
:::image type="content" source="media\quickstart-gallery-sample-notebook\gallery-select-ml-notebook.png" alt-text="Select the machine learning sample notebook in the gallery.":::
33
33
@@ -38,24 +38,6 @@ This notebook demonstrates the basic steps used in creating a model: **data impo
38
38
39
39
1. In the **Attach to** menu in the open notebook, select your Apache Spark pool.
40
40
41
-
## Run the notebook
42
-
43
-
The notebook is divided into multiple cells that each perform a specific function.
44
-
You can manually run each cell, running cells sequentially, or select **Run all** to run all the cells.
45
-
46
-
Here are descriptions for each of the cells in the notebook:
47
-
48
-
1. Import PySpark functions that the notebook uses.
49
-
1.**Ingest Date** - Ingest data from the Azure Open Dataset **NycTlcYellow** into a local dataframe for processing. The code extracts data within a specific time period - you can modify the start and end dates to get different data.
50
-
1. Downsample the dataset to make development faster. You can modify this step to change the sample size or the sampling seed.
51
-
1.**Exploratory Data Analysis** - Display charts to view the data. This can give you an idea what data prep might be needed before creating the model.
52
-
1.**Data Prep and Featurization** - Filter out outlier data discovered through visualization and create some useful derived variables.
53
-
1.**Data Prep and Featurization Part 2** - Drop unneeded columns and create some additional features.
54
-
1.**Encoding** - Convert string variables to numbers that the Logistic Regression model is expecting.
55
-
1.**Generation of Testing and Training Data Sets** - Split the data into separate testing and training data sets. You can modify the fraction and randomizing seed used to split the data.
56
-
1.**Train the Model** - Train a Logistic Regression model and display its "Area under ROC" metric to see how well the model is working. This step also saves the trained model in case you want to use it elsewhere.
57
-
1.**Evaluate and Visualize** - Plot the model's ROC curve to further evaluate the model.
58
-
59
41
## Save the notebook
60
42
61
43
To save your notebook by selecting **Publish** on the workspace command bar.
Copy file name to clipboardExpand all lines: articles/synapse-analytics/machine-learning/quickstart-integrate-azure-machine-learning.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ ms.service: synapse-analytics
5
5
ms.subservice: machine-learning
6
6
ms.topic: quickstart
7
7
ms.reviewer: sngun, garye
8
-
ms.date: 12/16/2021
8
+
ms.date: 02/29/2024
9
9
author: nelgson
10
10
ms.author: negust
11
11
ms.custom: mode-other
@@ -16,6 +16,7 @@ ms.custom: mode-other
16
16
> **IMPORTANT, PLEASE NOTE THE BELOW LIMITATIONS:**
17
17
> -**The Azure ML integration is not currently supported in Synapse Workspaces with Data Exfiltration Protection.** If you are **not** using data exfiltration protection and want to connect to Azure ML using private endpoints, you can set up a managed AzureML private endpoint in your Synapse workspace. [Read more about managed private endpoints](../security/how-to-create-managed-private-endpoints.md)
18
18
> -**AzureML linked service is not supported with self hosted integration runtimes.** This applies to Synapse workspaces with and without Data Exfiltration Protection.
19
+
> -**The Azure Synapse Spark 3.3 and 3.4 runtimes do not support using the Azure ML Linked Service to authenticate to the Azure Machine Learning MLFlow tracking URI.** To learn more about the limitations on these runtimes, see [Azure Synapse Runtime for Apache Spark 3.3](../spark/apache-spark-33-runtime.md) and [Azure Synapse Runtime for Apache Spark 3.4](../spark//apache-spark-34-runtime.md)
19
20
20
21
In this quickstart, you'll link an Azure Synapse Analytics workspace to an Azure Machine Learning workspace. Linking these workspaces allows you to leverage Azure Machine Learning from various experiences in Synapse.
Copy file name to clipboardExpand all lines: articles/synapse-analytics/spark/apache-spark-data-visualization-tutorial.md
+11-8Lines changed: 11 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,14 +35,17 @@ Create an Apache Spark Pool by following the [Create an Apache Spark pool tutori
35
35
3. Because the raw data is in a Parquet format, you can use the Spark context to pull the file into memory as a DataFrame directly. Create a Spark DataFrame by retrieving the data via the Open Datasets API. Here, we use the Spark DataFrame *schema on read* properties to infer the datatypes and schema.
4. After the data is read, we'll want to do some initial filtering to clean the dataset. We might remove unneeded columns and add columns that extract important information. In addition, we'll filter out anomalies within the dataset.
2. The downside to simple filtering is that, from a statistical perspective, it might introduce bias into the data. Another approach is to use the sampling built into Spark.
0 commit comments