Skip to content

Commit 553a805

Browse files
committed
Merge branch 'patch-3' of https://github.com/saveenr/azure-docs-pr into 20200521_getstarted
2 parents 672f937 + 400bada commit 553a805

File tree

5 files changed

+116
-116
lines changed

5 files changed

+116
-116
lines changed

articles/synapse-analytics/get-started.md

Lines changed: 69 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ This tutorial will guide you through all the basic steps needed to setup and use
3535
|**Data Lake Storage Gen2**|`Enabled`| Azure Synapse only works with storage accounts where this setting is enabled.|
3636
||||
3737

38-
* Once the storage account is created, make these role assignments or ensure they are already assigned. While in the storage account, select **Access control (IAM)** from the left navigation.
38+
* Once the storage account is created, select **Access control (IAM)** from the left navigation. Then assign the following roles or ensure they are already assigned.
3939
* Assign yourself to the **Owner** role on the storage account
4040
* Assign yourself to the **Storage Blob Data Owner** role on the Storage Account
4141
* From the left navigation, select **Containers** and create a container. You can give it any name. Accept the default **Public access level**. In this document, we will call the container `users`. Select **Create**.
@@ -56,7 +56,9 @@ This tutorial will guide you through all the basic steps needed to setup and use
5656
* Under **Select Data Lake Storage Gen 2** select the account and container you previously created
5757

5858
> [!NOTE]
59-
> The storage account chosen here will be referred to as the "primary" storage account of the Synapse workspace
59+
> We refer to the storage account chosen hereas the "primary" storage account of the Synapse workspace. This account
60+
> Is used for storing data in Apache spark tables and for logs created when Spark pools are created or Spark applications
61+
> run.
6062
6163
* Select **Review + create**. Select **Create**. Your workspace will be ready in a few minutes.
6264

@@ -65,9 +67,9 @@ This tutorial will guide you through all the basic steps needed to setup and use
6567
This may have already been done for you. In any case, you should verify.
6668

6769
* Open the [Azure portal](https://portal.azure.com) open the primary storage account chosen for your workspace.
68-
* Ensure that the following assignment exists or create it if it doesn't
69-
* Storage Blob Data Contributor role on the storage account to your workspace.
70-
* To assign this role to the workspace select the Storage Blob Data Contributor role, leave the default **Assign access to** and in the **Select** box type the name of your workspace. Select **Save**.
70+
* Select **Access control (IAM)** from the left navigation. Then assign the following roles or ensure they are already assigned.
71+
* Assign the workspace identity to the **Storage Blob Data Contributor** role on the storage account. The workspace identity has the same name as the workspace. In this document, the workspace name is `myworkspace` so the workspace identity is `myworkspaced`
72+
* Select **Save**.
7173

7274
## Launch Synapse Studio
7375

@@ -86,15 +88,15 @@ Once your Synapse workspace is created, you have two ways to open Synapse Studio
8688
|**SQL pool name**| `SQLDB1`|
8789
|**Performance level**|`DW100C`|
8890
* Select **Review+create** and then select **Create**.
89-
* Your pool will be ready in a few minutes.
91+
* Your SQL pool will be ready in a few minutes.
9092

9193
> [!NOTE]
9294
> A Synapse SQL pool corresponds to what used to be called an "Azure SQL Data Warehouse"
9395
9496
* A SQL pool consumes billable resources as long as it's running. So, you can pause the pool when needed to reduce costs.
9597
* When your SQL pool is created, it will be associated with a SQL pool database also called **SQLDB1**.
9698

97-
## Create an Apache Spark pool for Azure Synapse Analytics
99+
## Create an Apache Spark pool
98100

99101
* In Synapse Studio, on the left side select **Manage > Apache Spark pools**
100102
* Select **+New** and enter these settings:
@@ -117,12 +119,9 @@ Once your Synapse workspace is created, you have two ways to open Synapse Studio
117119
> [!NOTE]
118120
> Spark databases are independently created from Spark pools. A workspace always has a Spark DB called **default** and you can create additional Spark databases.
119121
120-
## SQL on-demand pools
122+
## The SQL on-demand pool
121123

122-
SQL on-demand is a special kind of SQL pool that is always available with a Synapse workspace. It allows you to work with SQL without having to create or think about managing a Synapse SQL pool.
123-
124-
> [!NOTE]
125-
> Unlike the other kinds of pools, billing for SQL on-demand is based on the amount of data scanned to run the query - and not the number of resources used to execute the query.
124+
Every workspace comes with a pre-built and undeleteable pool called **SQL on-demand**. The SQL on-demand pool allows you to work with SQL without having to create or think about managing a Synapse SQL pool. Unlike the other kinds of pools, billing for SQL on-demand is based on the amount of data scanned to run the query - and not the number of resources used to execute the query.
126125

127126
* SQL on-demand also has its own kind of SQL on-demand databases that exist independently from any SQL on-demand pool.
128127
* Currently a workspace always has exactly one SQL on-demand pool named **SQL on-demand**.
@@ -157,11 +156,11 @@ SQL on-demand is a special kind of SQL pool that is always available with a Syna
157156
* This query shows how the total trip distances and average trip distance relate to the number of passengers
158157
* In the SQL script result window change the **View** to **Chart** to see a visualization of the results as a line chart
159158

160-
## Create a Spark database and load the NYC taxi data into it
159+
## Load the NYC Taxi Sample data into the Spark nyctaxi database
161160

162-
We have data available in a SQL pool database. Now we load it into a Spark database.
161+
We have data available in a table in `SQLDB1`. Now we load it into a Spark database named 'nyctaxi`.
163162
164-
* In Synapse Studio, navigate to the **Develop hub"
163+
* In Synapse Studio, navigate to the **Develop** hub
165164
* Select **+** and select **Notebook**
166165
* At the top of the notebook, set the **Attach to** value to `Spark1`
167166
* Select **Add code** to add a notebook code cell and paste the text below:
@@ -173,23 +172,24 @@ We have data available in a SQL pool database. Now we load it into a Spark datab
173172
df.write.mode("overwrite").saveAsTable("nyctaxi.trip")
174173
```
175174
176-
* Navigate to the Data hub, right-click on databases and select **Refresh**
175+
* Navigate to the **Data** hub, right-click on **Databases** and select **Refresh**
177176
* Now you should see these databases:
178177
* SQLDB (SQL pool)
179178
* nyctaxi (Spark)
180179
181180
## Analyze the NYC Taxi data using Spark and notebooks
182181
183182
* Return to your notebook
184-
* Create a new code cell, enter the text below, and run the cell
183+
* Create a new code cell, enter the text below, and run the cell to example the NYC taxi data we loaded into the `nyctaxi` Spark DB.
185184
186185
```py
187186
%%pyspark
188187
df = spark.sql("SELECT * FROM nyctaxi.trip")
189188
display(df)
190189
```
191190
192-
* Run this code to perform the same analysis we did earlier with the SQL pool
191+
* Run the following code to perform the same analysis we did earlier with the SQL pool `SQLDB1`. This code
192+
also saves the results of the analysis into a table called `nyctaxi.passengercountstats` and visualizes the results.
193193
194194
```py
195195
%%pyspark
@@ -210,9 +210,9 @@ We have data available in a SQL pool database. Now we load it into a Spark datab
210210
211211
## Customize data visualization data with Spark and notebooks
212212
213-
With spark notebooks you can control exactly how render charts. The following
214-
code shows a simple example using the popular libraries matplotlib and sea-born. It will
215-
render the same chart you saw when running the SQL queries earlier.
213+
With notebooks you can control how render charts. The following
214+
code shows a simple example using the popular libraries `matplotlib` and `seaborn`. It will
215+
render the same kind od line chart you saw when running the SQL queries earlier.
216216
217217
```py
218218
%%pyspark
@@ -229,39 +229,39 @@ matplotlib.pyplot.show()
229229
230230
## Load data from a Spark table into a SQL pool table
231231
232-
Earlier we copied data from a SQL pool database into a Spark DB. Using
233-
Spark, we aggregated the data into the nyctaxi.passengercountstats.
234-
Now run the cell below in a notebook and it will copy the aggregated table back into
235-
the SQL pool database.
232+
Earlier we copied data from a SQL pool table `SQLDB1.dbo.Trip` into a Spark table `nyctaxi.trip`. Then, using
233+
Spark, we aggregated the data into the the Spark table `nyctaxi.passengercountstats`. Now we will copy the data
234+
from `nyctaxi.passengercountstats` into a SQL pool table called `SQLDB1.dbo.PassengerCountStats`.
235+
236+
Run the cell below in your notebook. It will copy the aggregated Spark table back into
237+
the SQL pool table.
236238
237239
```scala
238240
%%spark
239241
val df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
240242
df.write.sqlanalytics("SQLDB1.dbo.PassengerCountStats", Constants.INTERNAL )
241243
```
242244
243-
## Analyze NYC taxi data in Spark databases using SQL-on demand
245+
## Analyze NYC taxi data in Spark databases using SQL on-demand
244246
245-
* Tables in Spark databases are automatically visible and queryable by SQL on-demand
246-
* In Synapse Studio navigate to the Develop hub and create a new SQL script
247+
* Tables in Spark databases are automatically visible and queryable by SQL on-demand.
248+
* In Synapse Studio, navigate to the **Develop** hub and create a new SQL script
247249
* Set **Connect to** to **SQL on-demand**
248-
* Paste the following text into the script:
250+
* Paste the following text into the script and run the script.
249251
250252
```sql
251253
SELECT *
252254
FROM nyctaxi.dbo.passengercountstats
253255
```
254-
255-
* Select **Run**
256-
* NOTE: THe first time you run this it will take about 10 seconds for SQL on-demand to gather SQL resources needed to run your queries. Subsequent queries will not require this time.
256+
* NOTE: The first time you run a query that uses SQL on-deman, it will take about 10 seconds for SQL on-demand to gather SQL resources needed to run your queries. Subsequent queries will not require this time and be much faster.
257257
258-
## Use pipelines to orchestrate activities
258+
## Orchestrate activities with pipelines
259259
260260
You can orchestrate a wide variety of tasks in Azure Synapse. In this section, you'll see how easy it is.
261261

262-
* In Synapse Studio, navigate to the Orchestrate hub.
262+
* In Synapse Studio, navigate to the **Orchestrate** hub.
263263
* Select **+** then select **Pipeline**. A new pipeline will be created.
264-
* Navigate to the Develop hub and find any of the notebooks you previously created.
264+
* Navigate to the Develop hub and find the notebook you previously created.
265265
* Drag that notebook into the pipeline.
266266
* In the pipeline select **Add trigger > New/edit**.
267267
* In **Choose trigger** select **New**, and then in recurrence set the trigger to run every 1 hour.
@@ -271,14 +271,14 @@ You can orchestrate a wide variety of tasks in Azure Synapse. In this section, y
271271

272272
## Working with data in a storage account
273273

274-
So far, we've covered scenarios were data resided in databases. Now we'll show how Azure Synapse can analyze simple files in a storage account. In this scenario we'll use the storage account and container that we linked the workspace to.
274+
So far, we've covered scenarios were data resided in databases in the workspace. Now we'll show how to work with files in storage accounts. In this scenario, we'll use the primary storage account of the workspace and container we specified when creating the workspace.
275275
276-
The name of the storage account: contosolake
277-
The name of the container in the storage account: users
276+
* The name of the storage account: `contosolake`
277+
* The name of the container in the storage account: `users`
278278
279279
### Creating CSV and Parquet files in your Storage account
280280
281-
Run the the following code in a notebook. It creates a CSV and parquet data in the storage account
281+
Run the the following code in a notebook. It creates a CSV file and a parquet file in the storage account
282282
283283
```py
284284
%%pyspark
@@ -292,39 +292,47 @@ df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats.parquet")
292292

293293
* In Synapse Studio, navigate to the **Data** hub
294294
* Select **Linked**
295-
* Navigate to **Storage accounts > workspacename (Primary - contosolake)**
295+
* Navigate to **Storage accounts > myworkspace (Primary - contosolake)**
296296
* Select **users (Primary)"**
297-
* You should see a folder called `NYCTaxi'. Inside you should see two folders 'PassengerCountStats.csv' and 'PassengerCountStats.parquet'.
297+
* You should see a folder called `NYCTaxi' and inside . Inside you should see two folders 'PassengerCountStats.csv' and 'PassengerCountStats.parquet'.
298298
* Navigate into the `PassengerCountStats.parquet' folder.
299-
* Right-click on the parquet file inside, and select new notebook, it will create a notebook with a cell like this:
299+
* Right-click on the parquet file inside, and select **new notebook**, it will create a notebook with a cell like this:
300300

301301
```py
302302
%%pyspark
303303
data_path = spark.read.load('abfss://[email protected]/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', format='parquet')
304304
data_path.show(100)
305305
```
306306

307-
* Run the cell to analyze the parquet file with spark.
308-
* Right-click on the parquet file inside, and select New **SQL script > SELECT TOP 100 rows**, it will create a notebook with a cell like this:
307+
* Run the cell.
308+
* Right-click on the parquet file inside, and select **New SQL script > SELECT TOP 100 rows**, it will create a SQL script like this:
309309

310-
```py
310+
```sql
311311
SELECT TOP 100 *
312312
FROM OPENROWSET(
313313
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet',
314314
FORMAT='PARQUET'
315315
) AS [r];
316316
```
317317

318-
* The script will be attached to **SQL on-demand** run the script. Notice that it infers the schema from the parquet file.
318+
* In the script the **Attach to** field will be set to **SQL on-demand**.
319+
* Run the script.
319320

320321
## Visualize data with Power BI
321322

322-
Your data can now be easily analyzed and visualized in Power BI. Synapse offers a unique integration which allows you to link a Power BI workspace to you Synapse workspace. Before starting, first follow the steps in this [quickstart](quickstart-power-bi.md) to link your Power BI workspace.
323+
From the NYX taxi data, we created arregated datasets in two tables:
324+
* `nyctaxi.passengercountstats`
325+
* `SQLDB1.dbo.PassengerCountStats`
326+
327+
You can link a Power BI workspace to you Synapse workspace. This allows you to easily get data into your PowerBI worksapce and you can edit your PowerBI reports directly in your Synapse workspace.
323328

324-
### Create a Power BI Workspace and link it to your Synapse Workspace
329+
### Create a Power BI Workspace
325330

326331
* Log into [powerbi.microsoft.com](https://powerbi.microsoft.com/).
327332
* Create a new Power BI workspace called `NYCTaxiWorkspace1`.
333+
334+
### Link your Synapse Workspace to your new PowerBI workspace
335+
328336
* In Synapse Studio, navigate to the **Manage > Linked Services**.
329337
* Select **+ New** and select **Connect to Power BI** and set these fields:
330338

@@ -340,8 +348,9 @@ Your data can now be easily analyzed and visualized in Power BI. Synapse offers
340348

341349
* In Synapse Studio, navigate to the **Develop > Power BI**.
342350
* Navigate to **NYCTaxiWorkspace1 > Power BI datasets** and select **New Power BI dataset**.
343-
* Hover over the SQLDB1 database and select **Download .pbids file**.
344-
* Open the downloaded `.pbids` file. This will launch Power BI desktop and automatically connect it to SQLDB1 in your synapse workspace.
351+
* Hover over the `SQLDB1` database and select **Download .pbids file**.
352+
* Open the downloaded `.pbids` file.
353+
* This will launch Power BI desktop and automatically connect it to `SQLDB1` in your synapse workspace.
345354
* If you see a dialog appear called **SQL server database**:
346355
* Select **Microsoft account**.
347356
* Select **Sign in** and log in.
@@ -361,22 +370,21 @@ Your data can now be easily analyzed and visualized in Power BI. Synapse offers
361370
### Configure authentication for your dataset
362371

363372
* Open [powerbi.microsoft.com](https://powerbi.microsoft.com/) and **Sign in**
364-
* At the left, under **Workspaces** select the the `NYCTaxiWorkspace1` workspace that you published to.
373+
* At the left, under **Workspaces** select the the `NYCTaxiWorkspace1` workspace.
365374
* Inside that workspace you should see a dataset called `Passenger Analysis` and a report called `Passenger Analysis`.
366375
* Hover over the `PassengerAnalysis` dataset and select the icon with the three dots and select **Settings**.
367-
* In **Data source credentials** set the Authentication method to **OAuth2** and select **Sign in**.
376+
* In **Data source credentials** set the **Authentication method** to **OAuth2** and select **Sign in**.
368377

369378
### Edit a report report in Synapse Studio
370379

371-
* Go back to Synapse Studio and select **Close and refresh** now you should see:
372-
* Under **Power BI datasets**, a new dataset called **PassengerAnalysis**.
373-
* Under **Power BI datasets**, a new report called **PassengerAnalysis**.
374-
* CLick on the **PassengerAnalysis** report.
375-
* It won't show anything because you still need to configure authentication for the dataset.
376-
* In SynapseStudio, navigate to **Develop > Power BI > Your workspace name > Power BI reports**.
377-
* Close any windows showing the Power BI report.
378-
* Refresh the **Power BI reports** node.
379-
* Select the report and now you can edit the report directly within Synapse Studio.
380+
* Go back to Synapse Studio and select **Close and refresh**
381+
* Navigate to the **Devlop** hub
382+
* Hover over **power BI** and click on the three Refresh the **Power BI reports** node.
383+
* Now under the **Power BI** you should see:
384+
* Under **NYCTaxiWorkspace1 > Power BI datasets**, a new dataset called **PassengerAnalysis**.
385+
* Under **NYCTaxiWorkspace1 > Power BI reports**, a new report called **PassengerAnalysis**.
386+
* Click on the **PassengerAnalysis** report.
387+
* The report will open and now you can edit the report directly within Synapse Studio.
380388

381389
## Monitor activities
382390

0 commit comments

Comments
 (0)