Skip to content

Commit 54cef28

Browse files
committed
updates
1 parent f07587b commit 54cef28

File tree

1 file changed

+332
-0
lines changed

1 file changed

+332
-0
lines changed
Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
---
2+
title: Getting started with Azure Synapse Analytics
3+
description: Steps by steps to quickly understand basic concepts in Azure Synapse
4+
services: synapse-analytics
5+
author: saveenr
6+
ms.author: jrasnick
7+
manager: julieMSFT
8+
ms.reviewer: jrasnick
9+
ms.service: synapse-analytics
10+
ms.topic: quickstart
11+
ms.date: 05/19/2020
12+
---
13+
14+
# Getting Started with Azure Synapse Analytics
15+
This tutorial guides you through all the basic steps needed to use Azure Synapse Analytics.
16+
17+
## Prepare a storage account for use with a Synapse workspace
18+
* Open the [Azure Portal](https://portal.azure.com)
19+
* Create a new Storage account with the following settings:
20+
* In the **Basics** tab
21+
* **Storage account name** - you can give it any name. In this document we'll refer to it as `contosolake`
22+
* **Account kind** - must be set to `StorageV2`
23+
* **Location** - you can pick any location but its recommended your Synapse workspace and ADLSGEN2 account are in the same region
24+
* In the **Advanced** tab
25+
* **Data Lake Storage Gen2** - set to `Enabled`. Azure Synapse only works with storage accounts where this setting is enabled.
26+
* Once the storage account is created, perform these role assignments or ensure they are already assigned
27+
* Assign yourself to the **Owner** role on the storage account
28+
* Assign yourself to the **Storage Blob Data Owner** role on the Storage Account
29+
* Create a container. You can give it any name. In this document we will use the name 'users`
30+
* Click **Review + create**. Click **Create**.
31+
32+
## Create a Synapse workspace
33+
* Open the [Azure Portal](https://portal.azure.com) and at the top search for `Synapse`.
34+
* In the search results under **Services**, click **Azure Synapse Analytics (workspaces preview)**
35+
* Click **+ Add**
36+
* Key settings in the **Basics** tab:
37+
* **Workspace name** - you can call it anything. In this document we will use `myworkspace`
38+
* **Region** - match the region of the storage account
39+
* Under **Select Data Lake Storage Gen 2** select the account and container you prevpoiusly creates
40+
* NOTE: The storage account chosen here will be referred to as the "primary" storage account of the Synapse workspace
41+
* Click **Review + create**. Click **Create**. Your workspace will be ready in a few minutes.
42+
43+
## Verify the Synapse workspace MSI has access to the storage account
44+
This may have already been done for you. In any case, you should verify.
45+
46+
* Open the [Azure Portal](https://portal.azure.com) open the primary storage account chosen for your workpace
47+
* Ensure that the following assignment exists or create it if it doesn't
48+
* Assign 'myworkspace' - it will always have the same name os the worksapce to the Storage Blob Data Contributor role on the storage account
49+
50+
## Launch Synapse Studio
51+
Once your Synapse workspace is created, you have two ways to open Synapse Studio:
52+
* Open your Synapse workspace in the [Azure portal](https://portal.azure.com) and at the top of the **Overview** section click **Launch Synapse Studio**
53+
* Directly go to https://web.azuresynapse.net and log in to your workspace.
54+
55+
## Create a SQL pool
56+
* In Synapse Studio, on the left side navigate to **Manage > SQL pools**
57+
* NOTE: All Synapse workspaces come with a pre-created pool called **SQL on-demand**.
58+
* Click **+New** and enter these settings:
59+
* For **SQL pool name** enter `SQLDB1`
60+
* For **Performance level** use `DW100C`
61+
* Click **Review+create** and then click **Create**
62+
* Your pool will be ready in a few minutes
63+
64+
NOTE:
65+
* A Synapse SQL pool corresponds to what used to be called an "Azure SQL Data Warehouse"
66+
* A SQL pool consumes billable resources as long as it's runing. So, you can pause the pool when needed to reduce costs
67+
* When your SQL pool is created it will be associated with a SQL pool database also called **SQLDB1**.
68+
69+
## Create a Apache Spark pool
70+
71+
* In Synapse Studio, on the left side click **Manage > Apache Spark pools**
72+
* Click **+New** and enter these settings:
73+
* For **Apache Spark pool name** enter `Spark1`
74+
* For **Node size** select `Small`
75+
* For **Number of nodes** set the minimum to 3 and the maximum to 3
76+
* Click **Review+create** and then click **Create**
77+
* Your Spark pool will be ready in a few seconds
78+
79+
NOTE:
80+
* Despite the name, a spark pool is not like a SQL pool. It's just some some basic metadata that you use to inform
81+
the Synapse workspace how to interact with Spark.
82+
* Because they are metadata Spark pools cannot be started or stopped.
83+
* When you do any Spark activity in Synapse, you specify a spark pool to use. The pool informs SYnapse how many Spark resources to use. You pay only for the resources thar are used. When you actively stop using the pool the reources will automatically time-out and be recycled.
84+
* NOTE: Spark Databases are independently created from Spark pools. A workspace always has a Spark DB caleld **default** and you can create additional Spark databases.
85+
86+
## SQL on-demand pools
87+
SQL on-demand is a special kind of SQL pool that is alwways available with a Synapse workspace. It allows you to work with SQL without having to create or think avout managing a Synapse SQL pool.
88+
89+
NOTE:
90+
* Unlike the other kinds of pools, billing for SQL on-demand is based on the amount of data scanned to run the query - and not the number of resources used to execute the query.
91+
* SQL on-demand also has its own kind of SQL on-demand databases that exist independently from any SQL on-demand pool
92+
* Currently a workspace always has exactly one SQL on-demand pool named **SQL on-demand**.
93+
94+
## Load the NYC Taxi Sample data into the SQLDB1 database
95+
96+
* In Synapse Studio, in the top-most blue menu, click on the **?** icon.
97+
* Select **Getting started > Getting started hub**
98+
* In the card labelled **Query sample data** select the SQL pool named `SQLDB1`
99+
* Click **Query data**. You will see a notification saying "Loading sample data" which will appear and then disappear.
100+
* You'll see alight-blue notification bar near the top of Synapse studio indicating that data is being loaded into SQLDB1. Wait until it turns green then dismiss it.
101+
102+
## Explore the NYC taxi data in the SQL Pool
103+
104+
* In Synapse Studio, navigate to the **Data** hub
105+
* Navigate to **SQLDB1 > Tables**. You'll see several tables have been loaded.
106+
* Right-click on the **dbo.Trip** table and select **New SQL Script > Select TOP 100 Rows**
107+
* A new SQL script will be created and automaticall run
108+
* Notice that at the top of the SQL script **Connect to** is automatically set to the SQL pool called SQLDB1
109+
* Replace the text of the SQL script with this code and run it.
110+
```
111+
SELECT PassengerCount,
112+
SUM(TripDistanceMiles) as SumTripDistance,
113+
AVG(TripDistanceMiles) as AvgTripDistance
114+
FROM dbo.Trip
115+
WHERE TripDistanceMiles > 0 AND PassengerCount > 0
116+
GROUP BY PassengerCount
117+
ORDER BY PassengerCount
118+
```
119+
* This query shows how the total trip distances and average trip distance relate to the number of passengers
120+
* In the SQL script result window change the **View** to **Chart** to see a visualization of the results as a line chart
121+
122+
## Create a Spark Ddatabase adnd load the NYC taxi data into it
123+
We have data available in a SQL pool DB. Now we load it into a Spark database.
124+
125+
* In Synapse Studio, navigate to the **Develop hub"
126+
* Click **+** and select **Notebook**
127+
* At the top of the notebook, set the **Attach to** value to `Spark1`
128+
* Click **Add code** to add a notebook code cell and paste the text below:
129+
```
130+
%% spark
131+
spark.sql("CREATE DATABASE IF NOT EXISTS nyctaxi")
132+
val df = spark.read.sqlanalytics("SQLDB1.dbo.Trip")
133+
df.write.mode("overwrite").saveAsTable("nyctaxi.trip")
134+
```
135+
* Navigate to the Data hub, click on **Databases** and select **Refresh**
136+
* Now you should see these databases:
137+
* SQLDB (SQL pool)
138+
* nyctaxi (Spark)
139+
140+
## Analyze the NYC Taxi data using Spark and notebooks
141+
* Return to your notebook
142+
* Create a new code cell, enter the text below, adn run the cell
143+
```
144+
%%pyspark
145+
df = spark.sql("SELECT * FROM nyctaxi.trip")
146+
display(df)
147+
```
148+
* Run this code to perform the same analysis we did earlier with the SQL pool
149+
```
150+
%%pyspark
151+
df = spark.sql("""
152+
SELECT PassengerCount,
153+
SUM(TripDistanceMiles) as SumTripDistance,
154+
AVG(TripDistanceMiles) as AvgTripDistance
155+
FROM nyctaxi.trip
156+
WHERE TripDistanceMiles > 0 AND PassengerCount > 0
157+
GROUP BY PassengerCount
158+
ORDER BY PassengerCount
159+
""")
160+
display(df)
161+
df.write.saveAsTable("nyctaxi.passengercountstats")
162+
```
163+
* In the cell results, click on **Chart** to see the data visualized
164+
165+
## Customize data visualization data with Spark and notebooks
166+
167+
With spark notebooks you can control exactly how render charts. The following
168+
code shows a simple example using the popular libraries matplotlib and seaborn. It will
169+
render the same chart you saw when running the SQL queries earlier.
170+
171+
```
172+
%%pyspark
173+
import matplotlib.pyplot
174+
import seaborn
175+
176+
seaborn.set(style = "whitegrid")
177+
df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
178+
df = df.toPandas()
179+
seaborn.lineplot(x="PassengerCount", y="SumTripDistance" , data = df)
180+
seaborn.lineplot(x="PassengerCount", y="AvgTripDistance" , data = df)
181+
matplotlib.pyplot.show()
182+
```
183+
184+
## Load data from a Spark table into a SQL Pool table
185+
186+
Earlier we copied data from a SQL pool DB into a Spark DB. Using
187+
Spark we aggregated the data into the nyctaxi.passengercountstats.
188+
Now run the cell below in a notebook and it will copy the aggregated table back into
189+
the SQL pool DB.
190+
191+
192+
```
193+
%%spark
194+
val df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
195+
df.write.sqlanalytics("SQLDB1.dbo.PassengerCountStats", Constants.INTERNAL )
196+
```
197+
198+
## Analyze NYC taxi data in Spark databases using SQL-on demand
199+
200+
* Tables in Spark databases are automatically visible and queryable by SQL on-demand
201+
* In Synapse Studio navigate to the **Develop** hub and create a new SQL script
202+
* Set **Connect to** to **SQL on-demand**
203+
* Paste the following text into the script
204+
```
205+
SELECT *
206+
FROM nyctaxi.dbo.passengercountstats
207+
```
208+
* Click **Run**
209+
* NOTE: THe first time you run this it will take about 10 seconds for SQL on-demand to gather SQL resources needed to run
210+
your queries. Every subsequent query will not require this time.
211+
212+
213+
## Use pipeline to orchestrate activities
214+
215+
You can orchestrate a wide variety of tasks in Azure Synapse. In this section, you'll see how easy it is.
216+
217+
* In Synapse Studio, navigate to the Orchestrate hub
218+
* Click **+** then select **Pipeline**. A new pipeline will be created,
219+
* Navigate to the Develop hub and find any of the notebooks you previously created
220+
* Drag that notebook into the pipeline
221+
* In the pipeline click **Add trigger > New/edit**
222+
* In** Choose trigger** click **New**, and then in Recurrence set the trigger to run every 1 hour.
223+
* Click **OK**
224+
* Click **Publish All** and the pipeline will run every hour
225+
* If you want to make the pipeline run now without waiting for the next hour click **Add trigger > New/edit**
226+
227+
## Working with data in a storage account
228+
So far, we've covered scenarios were data resided in databases. Now we'll show how Synapse Analytics can analyze
229+
simple files in a storage account. In this scenario we'll use the storage account and container that we linked the workspace to.
230+
231+
The name of the storage account: contosolake
232+
The name of the container in the storage account: users
233+
234+
### Creating CSV and Parquet files in your Storage account
235+
Run the the following code in a notebook. It creates a CSV and parquet data in the storage account
236+
237+
```
238+
%%pyspark
239+
df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
240+
df = df.repartition(1) # This ensure we'll get a single file during write()
241+
df.write.mode("overwrite").csv("/NYCTaxi/PassengerCountStats.csv")
242+
df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats.parquet")
243+
```
244+
245+
### Analyzing data in a storage account
246+
247+
* In Synapse Studio, navigate to the **Data** hub
248+
* Select **Linked**
249+
* Navigate to **Storage accounts > workspaceame (Primary - contosolake)**
250+
* Click on **users (Primary)"**
251+
* You should see a folder called `NYCTaxi'. Inside you should see two folders 'PassengerCountStats.csv' and 'PassengerCountStats.parquet'
252+
* Navigate into the `PassengerCountStats.parquet' folder
253+
* Right-click on the parquet file inside, and select New notebook, it will create a notebook with a cell like this:
254+
```
255+
%%pyspark
256+
data_path = spark.read.load('abfss://[email protected]/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', format='parquet')
257+
data_path.show(100)
258+
```
259+
260+
* Run the cell to analyze the parquet file with spark
261+
* Right-click on the parquet file inside, and select New **SQL script > SELECT TOP 100 rows**, it will create a notebook with a cell like this:
262+
```
263+
SELECT TOP 100 *
264+
FROM OPENROWSET(
265+
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet',
266+
FORMAT='PARQUET'
267+
) AS [r];
268+
```
269+
270+
* The script will be attached to **SQL on-demand** run the script. Notice that it infers the schema from the parquet file.
271+
272+
273+
## Visualize data with Power BI
274+
275+
Your data can now be easily analyzed and visualized in Power BI. Synapse offers a unique integration which allows you to link a Power BI workspace to you Synapse workspace. Before starting, frist follow the steps in this [quickstart](quickstart-power-bi.md) to link your Power BI workspace.
276+
277+
### Create a PowerBI Workspace and link it to your Synapse Workspace
278+
* Log into powerbi.microsoft.com
279+
* Create a new PowerBI workspace called `NYCTaxiWorkspace1`
280+
* In Synapse Studio, navigate to the **Manage > Linked Services**
281+
* Click **+ New** and click **Connect to PowerBI** and set these fields:
282+
* Set **Name** to `NYCTaxiWorkspace1`
283+
* Set **Workspace name** to `NYCTaxiWorkspace1`
284+
* Click **Create**
285+
286+
### Create a PowerBI dataset that uses data in your Synapse workspace
287+
* In Synapse Studio, navigate to the **Develop > Power BI **
288+
* Navigate to **NYCTaxiWorkspace1 > Power BI datasets** and click **New Power BI dataset**
289+
* Hover over the SQLDB1 database and select **Download .pbids file**
290+
* Open the downloaded `.pbids` file. This will launch Power BI desktop and automatically connect it to SQLDB1 in your synapse workspace.
291+
* If you see a dialog appear called **SQL server database**:
292+
* Select **Microsoft account**
293+
* Click **Sign in** and log in
294+
* Click **Connect**
295+
* The **Navigator** dialog will open. When it does check the **PassengerCountStats** table and click **Load**
296+
* A **Connection settings** dialog will appear. Select **DirectQuery** and click **OK**
297+
* Click on the **Report** button on the left
298+
* Add **Line chart** to your report
299+
* Drag the **PasssengerCount** column to **Visualizations > Axis**
300+
* Drag the **SumTripDistance** and **AvgTripDistance** columns to **Visualizations > Values**
301+
* In the **Home** tab, click **Publish**
302+
* It will ask you if you want to save your changes. Click **Save**.
303+
* It will ask you to pick a filename. Choose `PassengerAnalysis.pbix` and click **Save**
304+
* It will ask you to **Select a destination** select `NYCTaxiWorkspace1` and click **Select**
305+
* Wait for publishing to finish
306+
307+
### Configure authentitication for your dataset
308+
* Open https://powerbi.microsoft.com and **Sign in**
309+
* At the left, under **Workspaces** select the the `NYCTaxiWorkspace1` workspace that you published to
310+
* Inside that workspace you should see a dataset called `Passenger Analysis` and a report called `Passenger Analysis` and
311+
* Hover over the `PassengerAnalysis` dataset and click the icon with the three dots and select **Settings**
312+
* In **Data source credentials** set the Authentication method to **OAuth2** and click **Sign in**
313+
314+
### Edit a report report in Synapse Studio
315+
* Go back to Synapse Studio and click **Close and refresh** now you shold see
316+
* Under **Power BI datasets**, a new dataset called **PassengerAnalysis**.
317+
* Under **Power BI datasets**, a new report called **PassengerAnalysis**.
318+
* CLick on the **PassengerAnalysis** report.
319+
* It won't show anything because you still need to configure authentication for the dataset
320+
* In SynapseStudio, navigate to **Develop > PowerBI > Your workspace name > Power BI reports**
321+
* Close any windows showing the PowerBI report
322+
* Refresh the **Power BI reports** node
323+
* Click on the report and now you can edit the report directly within Synapse Studio
324+
325+
## Monitor activites
326+
327+
* In Synapse Studio, Navigate to the monitor hub.
328+
* In this location you can see a history of all the activites taking place in the workspace and which ones are active now.
329+
* Explore the **Pipeline runs**, **Apache Spark applications**, and **SQL requests** and you can see what you've already done in the workspace.
330+
331+
332+

0 commit comments

Comments
 (0)