|
| 1 | +--- |
| 2 | +title: 'Tutorial: Getting started with Azure Synapse Analytics' |
| 3 | +description: Steps by steps to quickly understand basic concepts in Azure Synapse |
| 4 | +services: synapse-analytics |
| 5 | +author: saveenr |
| 6 | +ms.author: saveenr |
| 7 | +manager: julieMSFT |
| 8 | +ms.reviewer: jrasnick |
| 9 | +ms.service: synapse-analytics |
| 10 | +ms.topic: quickstart |
| 11 | +ms.date: 05/19/2020 |
| 12 | +--- |
| 13 | + |
| 14 | +# Getting Started with Azure Synapse Analytics |
| 15 | + |
| 16 | +This tutorial will guide you through all the basic steps needed to setup and use Azure Synapse Analytics. |
| 17 | + |
| 18 | +## Prepare a storage account for use with a Synapse workspace |
| 19 | + |
| 20 | +* Open the [Azure portal](https://portal.azure.com) |
| 21 | +* Create a new storage account with the following settings: |
| 22 | + * In the **Basics** tab |
| 23 | + |
| 24 | + |Setting | Suggested value | Description | |
| 25 | + |---|---|---| |
| 26 | + |**Storage account name**| You can give it any name.|In this document, we'll refer to it as `contosolake`. |
| 27 | + |**Account kind**|Must be set to `StorageV2`|| |
| 28 | + |**Location**|You can pick any location| We recommend your Synapse workspace and Azure Data Lake Storage (ADLS) Gen2 account are in the same region.| |
| 29 | + |||| |
| 30 | + |
| 31 | + * In the **Advanced** tab |
| 32 | + |
| 33 | + |Setting | Suggested value | Description | |
| 34 | + |---|---|---| |
| 35 | + |**Data Lake Storage Gen2**|`Enabled`| Azure Synapse only works with storage accounts where this setting is enabled.| |
| 36 | + |||| |
| 37 | + |
| 38 | +* Once the storage account is created, make these role assignments or ensure they are already assigned. While in the storage account, select **Access control (IAM)** from the left navigation. |
| 39 | + * Assign yourself to the **Owner** role on the storage account |
| 40 | + * Assign yourself to the **Storage Blob Data Owner** role on the Storage Account |
| 41 | +* From the left navigation, select **Containers** and create a container. You can give it any name. Accept the default **Public access level**. In this document, we will call the container `users`. Select **Create**. |
| 42 | + |
| 43 | +## Create a Synapse workspace |
| 44 | + |
| 45 | +* Open the [Azure portal](https://portal.azure.com) and at the top search for `Synapse`. |
| 46 | +* In the search results under **Services**, select **Azure Synapse Analytics (workspaces preview)** |
| 47 | +* Select **+ Add** |
| 48 | +* **Basics** tab: |
| 49 | + |
| 50 | + |Setting | Suggested value | Description | |
| 51 | + |---|---|---| |
| 52 | + |**Workspace name**|You can call it anything.| In this document, we will use `myworkspace` |
| 53 | + |**Region**|Match the region of the storage account|| |
| 54 | + ||| |
| 55 | + |
| 56 | +* Under **Select Data Lake Storage Gen 2** select the account and container you previously created |
| 57 | + |
| 58 | +> [!NOTE] |
| 59 | +> The storage account chosen here will be referred to as the "primary" storage account of the Synapse workspace |
| 60 | +
|
| 61 | +* Select **Review + create**. Select **Create**. Your workspace will be ready in a few minutes. |
| 62 | + |
| 63 | +## Verify the Synapse workspace MSI has access to the storage account |
| 64 | + |
| 65 | +This may have already been done for you. In any case, you should verify. |
| 66 | + |
| 67 | +* Open the [Azure portal](https://portal.azure.com) open the primary storage account chosen for your workspace. |
| 68 | +* Ensure that the following assignment exists or create it if it doesn't |
| 69 | + * Storage Blob Data Contributor role on the storage account to your workspace. |
| 70 | + * To assign this role to the workspace select the Storage Blob Data Contributor role, leave the default **Assign access to** and in the **Select** box type the name of your workspace. Select **Save**. |
| 71 | + |
| 72 | +## Launch Synapse Studio |
| 73 | + |
| 74 | +Once your Synapse workspace is created, you have two ways to open Synapse Studio: |
| 75 | +* Open your Synapse workspace in the [Azure portal](https://portal.azure.com) and at the top of the **Overview** section select **Launch Synapse Studio** |
| 76 | +* Directly go to https://web.azuresynapse.net and login to your workspace. |
| 77 | + |
| 78 | +## Create a SQL pool |
| 79 | + |
| 80 | +* In Synapse Studio, on the left side navigate to **Manage > SQL pools** |
| 81 | +* NOTE: All Synapse workspaces come with a pre-created pool called **SQL on-demand**. |
| 82 | +* Select **+New** and enter these settings: |
| 83 | + |
| 84 | + |Setting | Suggested value | |
| 85 | + |---|---|---| |
| 86 | + |**SQL pool name**| `SQLDB1`| |
| 87 | + |**Performance level**|`DW100C`| |
| 88 | +* Select **Review+create** and then select **Create**. |
| 89 | +* Your pool will be ready in a few minutes. |
| 90 | + |
| 91 | +> [!NOTE] |
| 92 | +> A Synapse SQL pool corresponds to what used to be called an "Azure SQL Data Warehouse" |
| 93 | +
|
| 94 | +* A SQL pool consumes billable resources as long as it's running. So, you can pause the pool when needed to reduce costs. |
| 95 | +* When your SQL pool is created, it will be associated with a SQL pool database also called **SQLDB1**. |
| 96 | + |
| 97 | +## Create an Apache Spark pool for Azure Synapse Analytics |
| 98 | + |
| 99 | +* In Synapse Studio, on the left side select **Manage > Apache Spark pools** |
| 100 | +* Select **+New** and enter these settings: |
| 101 | + |
| 102 | + |Setting | Suggested value | |
| 103 | + |---|---|---| |
| 104 | + |**Apache Spark pool name**|`Spark1` |
| 105 | + |**Node size**| `Small`| |
| 106 | + |**Number of nodes**| Set the minimum to 3 and the maximum to 3| |
| 107 | + ||| |
| 108 | + |
| 109 | +* Select **Review+create** and then select **Create**. |
| 110 | +* Your Apache Spark pool will be ready in a few seconds. |
| 111 | + |
| 112 | +> [!NOTE] |
| 113 | +> Despite the name, an Apache Spark pool is not like a SQL pool. It's just some basic metadata that you use to inform the Synapse workspace how to interact with Spark. |
| 114 | +
|
| 115 | +* Because they are metadata Spark pools cannot be started or stopped. |
| 116 | +* When you do any Spark activity in Synapse, you specify a Spark pool to use. The pool informs Synapse how many Spark resources to use. You pay only for the resources thar are used. When you actively stop using the pool the resources will automatically time out and be recycled. |
| 117 | +> [!NOTE] |
| 118 | +> Spark databases are independently created from Spark pools. A workspace always has a Spark DB called **default** and you can create additional Spark databases. |
| 119 | +
|
| 120 | +## SQL on-demand pools |
| 121 | + |
| 122 | +SQL on-demand is a special kind of SQL pool that is always available with a Synapse workspace. It allows you to work with SQL without having to create or think about managing a Synapse SQL pool. |
| 123 | + |
| 124 | +> [!NOTE] |
| 125 | +> Unlike the other kinds of pools, billing for SQL on-demand is based on the amount of data scanned to run the query - and not the number of resources used to execute the query. |
| 126 | +
|
| 127 | +* SQL on-demand also has its own kind of SQL on-demand databases that exist independently from any SQL on-demand pool. |
| 128 | +* Currently a workspace always has exactly one SQL on-demand pool named **SQL on-demand**. |
| 129 | + |
| 130 | +## Load the NYC Taxi Sample data into the SQLDB1 database |
| 131 | + |
| 132 | +* In Synapse Studio, in the top-most blue menu, select the **?** icon. |
| 133 | +* Select **Getting started > Getting started hub** |
| 134 | +* In the card labeled **Query sample data** select the SQL pool named `SQLDB1` |
| 135 | +* Select **Query data**. You will see a notification saying "Loading sample data" which will appear and then disappear. |
| 136 | +* You'll see a light-blue notification bar near the top of Synapse Studio indicating that data is being loaded into SQLDB1. Wait until it turns green then dismiss it. |
| 137 | + |
| 138 | +## Explore the NYC taxi data in the SQL Pool |
| 139 | + |
| 140 | +* In Synapse Studio, navigate to the **Data** hub |
| 141 | +* Navigate to **SQLDB1 > Tables**. You'll see several tables have been loaded. |
| 142 | +* Right-click on the **dbo.Trip** table and select **New SQL Script > Select TOP 100 Rows** |
| 143 | +* A new SQL script will be created and automatically run. |
| 144 | +* Notice that at the top of the SQL script **Connect to** is automatically set to the SQL pool called SQLDB1. |
| 145 | +* Replace the text of the SQL script with this code and run it. |
| 146 | + |
| 147 | + ```sql |
| 148 | + SELECT PassengerCount, |
| 149 | + SUM(TripDistanceMiles) as SumTripDistance, |
| 150 | + AVG(TripDistanceMiles) as AvgTripDistance |
| 151 | + FROM dbo.Trip |
| 152 | + WHERE TripDistanceMiles > 0 AND PassengerCount > 0 |
| 153 | + GROUP BY PassengerCount |
| 154 | + ORDER BY PassengerCount |
| 155 | + ``` |
| 156 | + |
| 157 | +* This query shows how the total trip distances and average trip distance relate to the number of passengers |
| 158 | +* In the SQL script result window change the **View** to **Chart** to see a visualization of the results as a line chart |
| 159 | + |
| 160 | +## Create a Spark database and load the NYC taxi data into it |
| 161 | + |
| 162 | +We have data available in a SQL pool database. Now we load it into a Spark database. |
| 163 | + |
| 164 | +* In Synapse Studio, navigate to the **Develop hub" |
| 165 | +* Select **+** and select **Notebook** |
| 166 | +* At the top of the notebook, set the **Attach to** value to `Spark1` |
| 167 | +* Select **Add code** to add a notebook code cell and paste the text below: |
| 168 | +
|
| 169 | + ```scala |
| 170 | + %% spark |
| 171 | + spark.sql("CREATE DATABASE IF NOT EXISTS nyctaxi") |
| 172 | + val df = spark.read.sqlanalytics("SQLDB1.dbo.Trip") |
| 173 | + df.write.mode("overwrite").saveAsTable("nyctaxi.trip") |
| 174 | + ``` |
| 175 | +
|
| 176 | + * Navigate to the Data hub, right-click on databases and select **Refresh** |
| 177 | + * Now you should see these databases: |
| 178 | + * SQLDB (SQL pool) |
| 179 | + * nyctaxi (Spark) |
| 180 | + |
| 181 | + ## Analyze the NYC Taxi data using Spark and notebooks |
| 182 | +
|
| 183 | + * Return to your notebook |
| 184 | + * Create a new code cell, enter the text below, and run the cell |
| 185 | +
|
| 186 | + ```py |
| 187 | + %%pyspark |
| 188 | + df = spark.sql("SELECT * FROM nyctaxi.trip") |
| 189 | + display(df) |
| 190 | + ``` |
| 191 | +
|
| 192 | + * Run this code to perform the same analysis we did earlier with the SQL pool |
| 193 | +
|
| 194 | + ```py |
| 195 | + %%pyspark |
| 196 | + df = spark.sql(""" |
| 197 | + SELECT PassengerCount, |
| 198 | + SUM(TripDistanceMiles) as SumTripDistance, |
| 199 | + AVG(TripDistanceMiles) as AvgTripDistance |
| 200 | + FROM nyctaxi.trip |
| 201 | + WHERE TripDistanceMiles > 0 AND PassengerCount > 0 |
| 202 | + GROUP BY PassengerCount |
| 203 | + ORDER BY PassengerCount |
| 204 | + """) |
| 205 | + display(df) |
| 206 | + df.write.saveAsTable("nyctaxi.passengercountstats") |
| 207 | + ``` |
| 208 | +
|
| 209 | + * In the cell results, select **Chart** to see the data visualized |
| 210 | + |
| 211 | +## Customize data visualization data with Spark and notebooks |
| 212 | +
|
| 213 | +With spark notebooks you can control exactly how render charts. The following |
| 214 | +code shows a simple example using the popular libraries matplotlib and sea-born. It will |
| 215 | +render the same chart you saw when running the SQL queries earlier. |
| 216 | +
|
| 217 | +```py |
| 218 | +%%pyspark |
| 219 | +import matplotlib.pyplot |
| 220 | +import seaborn |
| 221 | +
|
| 222 | +seaborn.set(style = "whitegrid") |
| 223 | +df = spark.sql("SELECT * FROM nyctaxi.passengercountstats") |
| 224 | +df = df.toPandas() |
| 225 | +seaborn.lineplot(x="PassengerCount", y="SumTripDistance" , data = df) |
| 226 | +seaborn.lineplot(x="PassengerCount", y="AvgTripDistance" , data = df) |
| 227 | +matplotlib.pyplot.show() |
| 228 | +``` |
| 229 | + |
| 230 | +## Load data from a Spark table into a SQL pool table |
| 231 | +
|
| 232 | +Earlier we copied data from a SQL pool database into a Spark DB. Using |
| 233 | +Spark, we aggregated the data into the nyctaxi.passengercountstats. |
| 234 | +Now run the cell below in a notebook and it will copy the aggregated table back into |
| 235 | +the SQL pool database. |
| 236 | +
|
| 237 | +```scala |
| 238 | +%%spark |
| 239 | +val df = spark.sql("SELECT * FROM nyctaxi.passengercountstats") |
| 240 | +df.write.sqlanalytics("SQLDB1.dbo.PassengerCountStats", Constants.INTERNAL ) |
| 241 | +``` |
| 242 | +
|
| 243 | +## Analyze NYC taxi data in Spark databases using SQL-on demand |
| 244 | +
|
| 245 | +* Tables in Spark databases are automatically visible and queryable by SQL on-demand |
| 246 | +* In Synapse Studio navigate to the Develop hub and create a new SQL script |
| 247 | +* Set **Connect to** to **SQL on-demand** |
| 248 | +* Paste the following text into the script: |
| 249 | +
|
| 250 | + ```sql |
| 251 | + SELECT * |
| 252 | + FROM nyctaxi.dbo.passengercountstats |
| 253 | + ``` |
| 254 | +
|
| 255 | +* Select **Run** |
| 256 | +* NOTE: THe first time you run this it will take about 10 seconds for SQL on-demand to gather SQL resources needed to run your queries. Subsequent queries will not require this time. |
| 257 | + |
| 258 | +## Use pipelines to orchestrate activities |
| 259 | +
|
| 260 | +You can orchestrate a wide variety of tasks in Azure Synapse. In this section, you'll see how easy it is. |
| 261 | +
|
| 262 | +* In Synapse Studio, navigate to the Orchestrate hub. |
| 263 | +* Select **+** then select **Pipeline**. A new pipeline will be created. |
| 264 | +* Navigate to the Develop hub and find any of the notebooks you previously created. |
| 265 | +* Drag that notebook into the pipeline. |
| 266 | +* In the pipeline select **Add trigger > New/edit**. |
| 267 | +* In **Choose trigger** select **New**, and then in recurrence set the trigger to run every 1 hour. |
| 268 | +* Select **OK**. |
| 269 | +* Select **Publish All** and the pipeline will run every hour. |
| 270 | +* If you want to make the pipeline run now without waiting for the next hour select **Add trigger > New/edit**. |
| 271 | +
|
| 272 | +## Working with data in a storage account |
| 273 | +
|
| 274 | +So far, we've covered scenarios were data resided in databases. Now we'll show how Azure Synapse can analyze simple files in a storage account. In this scenario we'll use the storage account and container that we linked the workspace to. |
| 275 | +
|
| 276 | +The name of the storage account: contosolake |
| 277 | +The name of the container in the storage account: users |
| 278 | +
|
| 279 | +### Creating CSV and Parquet files in your Storage account |
| 280 | +
|
| 281 | +Run the the following code in a notebook. It creates a CSV and parquet data in the storage account |
| 282 | +
|
| 283 | +```py |
| 284 | +%%pyspark |
| 285 | +df = spark.sql("SELECT * FROM nyctaxi.passengercountstats") |
| 286 | +df = df.repartition(1) # This ensure we'll get a single file during write() |
| 287 | +df.write.mode("overwrite").csv("/NYCTaxi/PassengerCountStats.csv") |
| 288 | +df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats.parquet") |
| 289 | +``` |
| 290 | +
|
| 291 | +### Analyzing data in a storage account |
| 292 | +
|
| 293 | +* In Synapse Studio, navigate to the **Data** hub |
| 294 | +* Select **Linked** |
| 295 | +* Navigate to **Storage accounts > workspacename (Primary - contosolake)** |
| 296 | +* Select **users (Primary)"** |
| 297 | +* You should see a folder called `NYCTaxi'. Inside you should see two folders 'PassengerCountStats.csv' and 'PassengerCountStats.parquet'. |
| 298 | +* Navigate into the `PassengerCountStats.parquet' folder. |
| 299 | +* Right-click on the parquet file inside, and select new notebook, it will create a notebook with a cell like this: |
| 300 | +
|
| 301 | + ```py |
| 302 | + %%pyspark |
| 303 | + data_path = spark.read.load('abfss://users@contosolake.dfs.core.windows.net/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', format='parquet') |
| 304 | + data_path.show(100) |
| 305 | + ``` |
| 306 | +
|
| 307 | +* Run the cell to analyze the parquet file with spark. |
| 308 | +* Right-click on the parquet file inside, and select New **SQL script > SELECT TOP 100 rows**, it will create a notebook with a cell like this: |
| 309 | +
|
| 310 | + ```py |
| 311 | + SELECT TOP 100 * |
| 312 | + FROM OPENROWSET( |
| 313 | + BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', |
| 314 | + FORMAT='PARQUET' |
| 315 | + ) AS [r]; |
| 316 | + ``` |
| 317 | + |
| 318 | +* The script will be attached to **SQL on-demand** run the script. Notice that it infers the schema from the parquet file. |
| 319 | +
|
| 320 | +## Visualize data with Power BI |
| 321 | +
|
| 322 | +Your data can now be easily analyzed and visualized in Power BI. Synapse offers a unique integration which allows you to link a Power BI workspace to you Synapse workspace. Before starting, first follow the steps in this [quickstart](quickstart-power-bi.md) to link your Power BI workspace. |
| 323 | +
|
| 324 | +### Create a Power BI Workspace and link it to your Synapse Workspace |
| 325 | +
|
| 326 | +* Log into [powerbi.microsoft.com](https://powerbi.microsoft.com/). |
| 327 | +* Create a new Power BI workspace called `NYCTaxiWorkspace1`. |
| 328 | +* In Synapse Studio, navigate to the **Manage > Linked Services**. |
| 329 | +* Select **+ New** and select **Connect to Power BI** and set these fields: |
| 330 | +
|
| 331 | + |Setting | Suggested value | |
| 332 | + |---|---|---| |
| 333 | + |**Name**|`NYCTaxiWorkspace1`| |
| 334 | + |**Workspace name**|`NYCTaxiWorkspace1`| |
| 335 | + ||| |
| 336 | + |
| 337 | +* Select **Create**. |
| 338 | +
|
| 339 | +### Create a Power BI dataset that uses data in your Synapse workspace |
| 340 | +
|
| 341 | +* In Synapse Studio, navigate to the **Develop > Power BI**. |
| 342 | +* Navigate to **NYCTaxiWorkspace1 > Power BI datasets** and select **New Power BI dataset**. |
| 343 | +* Hover over the SQLDB1 database and select **Download .pbids file**. |
| 344 | +* Open the downloaded `.pbids` file. This will launch Power BI desktop and automatically connect it to SQLDB1 in your synapse workspace. |
| 345 | +* If you see a dialog appear called **SQL server database**: |
| 346 | + * Select **Microsoft account**. |
| 347 | + * Select **Sign in** and log in. |
| 348 | + * Select **Connect**. |
| 349 | +* The **Navigator** dialog will open. When it does check the **PassengerCountStats** table and select **Load**. |
| 350 | +* A **Connection settings** dialog will appear. Select **DirectQuery** and select **OK** |
| 351 | +* Select the **Report** button on the left. |
| 352 | +* Add **Line chart** to your report. |
| 353 | + * Drag the **PasssengerCount** column to **Visualizations > Axis** |
| 354 | + * Drag the **SumTripDistance** and **AvgTripDistance** columns to **Visualizations > Values**. |
| 355 | +* In the **Home** tab, select **Publish**. |
| 356 | +* It will ask you if you want to save your changes. Select **Save**. |
| 357 | +* It will ask you to pick a filename. Choose `PassengerAnalysis.pbix` and select **Save**. |
| 358 | +* It will ask you to **Select a destination** select `NYCTaxiWorkspace1` and select **Select**. |
| 359 | +* Wait for publishing to finish. |
| 360 | +
|
| 361 | +### Configure authentication for your dataset |
| 362 | +
|
| 363 | +* Open [powerbi.microsoft.com](https://powerbi.microsoft.com/) and **Sign in** |
| 364 | +* At the left, under **Workspaces** select the the `NYCTaxiWorkspace1` workspace that you published to. |
| 365 | +* Inside that workspace you should see a dataset called `Passenger Analysis` and a report called `Passenger Analysis`. |
| 366 | +* Hover over the `PassengerAnalysis` dataset and select the icon with the three dots and select **Settings**. |
| 367 | +* In **Data source credentials** set the Authentication method to **OAuth2** and select **Sign in**. |
| 368 | +
|
| 369 | +### Edit a report report in Synapse Studio |
| 370 | +
|
| 371 | +* Go back to Synapse Studio and select **Close and refresh** now you should see: |
| 372 | + * Under **Power BI datasets**, a new dataset called **PassengerAnalysis**. |
| 373 | + * Under **Power BI datasets**, a new report called **PassengerAnalysis**. |
| 374 | +* CLick on the **PassengerAnalysis** report. |
| 375 | + * It won't show anything because you still need to configure authentication for the dataset. |
| 376 | +* In SynapseStudio, navigate to **Develop > Power BI > Your workspace name > Power BI reports**. |
| 377 | +* Close any windows showing the Power BI report. |
| 378 | +* Refresh the **Power BI reports** node. |
| 379 | +* Select the report and now you can edit the report directly within Synapse Studio. |
| 380 | + |
| 381 | +## Monitor activities |
| 382 | + |
| 383 | +* In Synapse Studio, Navigate to the monitor hub. |
| 384 | +* In this location you can see a history of all the activities taking place in the workspace and which ones are active now. |
| 385 | +* Explore the **Pipeline runs**, **Apache Spark applications**, and **SQL requests** and you can see what you've already done in the workspace. |
| 386 | +
|
| 387 | +## Next steps |
| 388 | +
|
| 389 | +Learn more about [Azure Synapse Analytics (preview)](overview-what-is.md) |
| 390 | +
|
0 commit comments