|
| 1 | +--- |
| 2 | +title: Getting started with Azure Synapse Analytics |
| 3 | +description: Steps by steps to quickly understand basic concepts in Azure Synapse |
| 4 | +services: synapse-analytics |
| 5 | +author: saveenr |
| 6 | +ms.author: jrasnick |
| 7 | +manager: julieMSFT |
| 8 | +ms.reviewer: jrasnick |
| 9 | +ms.service: synapse-analytics |
| 10 | +ms.topic: quickstart |
| 11 | +ms.date: 05/19/2020 |
| 12 | +--- |
| 13 | + |
| 14 | +# Getting Started with Azure Synapse Analytics |
| 15 | +This tutorial guides you through all the basic steps needed to use Azure Synapse Analytics. |
| 16 | + |
| 17 | +## Prepare a storage account for use with a Synapse workspace |
| 18 | +* Open the [Azure Portal](https://portal.azure.com) |
| 19 | +* Create a new Storage account with the following settings: |
| 20 | + * In the **Basics** tab |
| 21 | + * **Storage account name** - you can give it any name. In this document we'll refer to it as `contosolake` |
| 22 | + * **Account kind** - must be set to `StorageV2` |
| 23 | + * **Location** - you can pick any location but its recommended your Synapse workspace and ADLSGEN2 account are in the same region |
| 24 | + * In the **Advanced** tab |
| 25 | + * **Data Lake Storage Gen2** - set to `Enabled`. Azure Synapse only works with storage accounts where this setting is enabled. |
| 26 | +* Once the storage account is created, perform these role assignments or ensure they are already assigned |
| 27 | + * Assign yourself to the **Owner** role on the storage account |
| 28 | + * Assign yourself to the **Storage Blob Data Owner** role on the Storage Account |
| 29 | +* Create a container. You can give it any name. In this document we will use the name 'users` |
| 30 | +* Click **Review + create**. Click **Create**. |
| 31 | + |
| 32 | +## Create a Synapse workspace |
| 33 | +* Open the [Azure Portal](https://portal.azure.com) and at the top search for `Synapse`. |
| 34 | +* In the search results under **Services**, click **Azure Synapse Analytics (workspaces preview)** |
| 35 | +* Click **+ Add** |
| 36 | +* Key settings in the **Basics** tab: |
| 37 | + * **Workspace name** - you can call it anything. In this document we will use `myworkspace` |
| 38 | + * **Region** - match the region of the storage account |
| 39 | + * Under **Select Data Lake Storage Gen 2** select the account and container you prevpoiusly creates |
| 40 | + * NOTE: The storage account chosen here will be referred to as the "primary" storage account of the Synapse workspace |
| 41 | +* Click **Review + create**. Click **Create**. Your workspace will be ready in a few minutes. |
| 42 | + |
| 43 | +## Verify the Synapse workspace MSI has access to the storage account |
| 44 | +This may have already been done for you. In any case, you should verify. |
| 45 | + |
| 46 | +* Open the [Azure Portal](https://portal.azure.com) open the primary storage account chosen for your workpace |
| 47 | +* Ensure that the following assignment exists or create it if it doesn't |
| 48 | + * Assign 'myworkspace' - it will always have the same name os the worksapce to the Storage Blob Data Contributor role on the storage account |
| 49 | + |
| 50 | +## Launch Synapse Studio |
| 51 | +Once your Synapse workspace is created, you have two ways to open Synapse Studio: |
| 52 | +* Open your Synapse workspace in the [Azure portal](https://portal.azure.com) and at the top of the **Overview** section click **Launch Synapse Studio** |
| 53 | +* Directly go to https://web.azuresynapse.net and log in to your workspace. |
| 54 | + |
| 55 | +## Create a SQL pool |
| 56 | +* In Synapse Studio, on the left side navigate to **Manage > SQL pools** |
| 57 | +* NOTE: All Synapse workspaces come with a pre-created pool called **SQL on-demand**. |
| 58 | +* Click **+New** and enter these settings: |
| 59 | + * For **SQL pool name** enter `SQLDB1` |
| 60 | + * For **Performance level** use `DW100C` |
| 61 | +* Click **Review+create** and then click **Create** |
| 62 | +* Your pool will be ready in a few minutes |
| 63 | + |
| 64 | +NOTE: |
| 65 | +* A Synapse SQL pool corresponds to what used to be called an "Azure SQL Data Warehouse" |
| 66 | +* A SQL pool consumes billable resources as long as it's runing. So, you can pause the pool when needed to reduce costs |
| 67 | +* When your SQL pool is created it will be associated with a SQL pool database also called **SQLDB1**. |
| 68 | + |
| 69 | +## Create a Apache Spark pool |
| 70 | + |
| 71 | +* In Synapse Studio, on the left side click **Manage > Apache Spark pools** |
| 72 | +* Click **+New** and enter these settings: |
| 73 | + * For **Apache Spark pool name** enter `Spark1` |
| 74 | + * For **Node size** select `Small` |
| 75 | + * For **Number of nodes** set the minimum to 3 and the maximum to 3 |
| 76 | +* Click **Review+create** and then click **Create** |
| 77 | +* Your Spark pool will be ready in a few seconds |
| 78 | + |
| 79 | +NOTE: |
| 80 | +* Despite the name, a spark pool is not like a SQL pool. It's just some some basic metadata that you use to inform |
| 81 | + the Synapse workspace how to interact with Spark. |
| 82 | +* Because they are metadata Spark pools cannot be started or stopped. |
| 83 | +* When you do any Spark activity in Synapse, you specify a spark pool to use. The pool informs SYnapse how many Spark resources to use. You pay only for the resources thar are used. When you actively stop using the pool the reources will automatically time-out and be recycled. |
| 84 | +* NOTE: Spark Databases are independently created from Spark pools. A workspace always has a Spark DB caleld **default** and you can create additional Spark databases. |
| 85 | + |
| 86 | +## SQL on-demand pools |
| 87 | +SQL on-demand is a special kind of SQL pool that is alwways available with a Synapse workspace. It allows you to work with SQL without having to create or think avout managing a Synapse SQL pool. |
| 88 | + |
| 89 | +NOTE: |
| 90 | +* Unlike the other kinds of pools, billing for SQL on-demand is based on the amount of data scanned to run the query - and not the number of resources used to execute the query. |
| 91 | +* SQL on-demand also has its own kind of SQL on-demand databases that exist independently from any SQL on-demand pool |
| 92 | +* Currently a workspace always has exactly one SQL on-demand pool named **SQL on-demand**. |
| 93 | + |
| 94 | +## Load the NYC Taxi Sample data into the SQLDB1 database |
| 95 | + |
| 96 | +* In Synapse Studio, in the top-most blue menu, click on the **?** icon. |
| 97 | +* Select **Getting started > Getting started hub** |
| 98 | +* In the card labelled **Query sample data** select the SQL pool named `SQLDB1` |
| 99 | +* Click **Query data**. You will see a notification saying "Loading sample data" which will appear and then disappear. |
| 100 | +* You'll see alight-blue notification bar near the top of Synapse studio indicating that data is being loaded into SQLDB1. Wait until it turns green then dismiss it. |
| 101 | + |
| 102 | +## Explore the NYC taxi data in the SQL Pool |
| 103 | + |
| 104 | +* In Synapse Studio, navigate to the **Data** hub |
| 105 | +* Navigate to **SQLDB1 > Tables**. You'll see several tables have been loaded. |
| 106 | +* Right-click on the **dbo.Trip** table and select **New SQL Script > Select TOP 100 Rows** |
| 107 | +* A new SQL script will be created and automaticall run |
| 108 | +* Notice that at the top of the SQL script **Connect to** is automatically set to the SQL pool called SQLDB1 |
| 109 | +* Replace the text of the SQL script with this code and run it. |
| 110 | + ``` |
| 111 | + SELECT PassengerCount, |
| 112 | + SUM(TripDistanceMiles) as SumTripDistance, |
| 113 | + AVG(TripDistanceMiles) as AvgTripDistance |
| 114 | + FROM dbo.Trip |
| 115 | + WHERE TripDistanceMiles > 0 AND PassengerCount > 0 |
| 116 | + GROUP BY PassengerCount |
| 117 | + ORDER BY PassengerCount |
| 118 | + ``` |
| 119 | +* This query shows how the total trip distances and average trip distance relate to the number of passengers |
| 120 | +* In the SQL script result window change the **View** to **Chart** to see a visualization of the results as a line chart |
| 121 | +
|
| 122 | +## Create a Spark Ddatabase adnd load the NYC taxi data into it |
| 123 | +We have data available in a SQL pool DB. Now we load it into a Spark database. |
| 124 | +
|
| 125 | +* In Synapse Studio, navigate to the **Develop hub" |
| 126 | +* Click **+** and select **Notebook** |
| 127 | +* At the top of the notebook, set the **Attach to** value to `Spark1` |
| 128 | +* Click **Add code** to add a notebook code cell and paste the text below: |
| 129 | + ``` |
| 130 | + %% spark |
| 131 | + spark.sql("CREATE DATABASE IF NOT EXISTS nyctaxi") |
| 132 | + val df = spark.read.sqlanalytics("SQLDB1.dbo.Trip") |
| 133 | + df.write.mode("overwrite").saveAsTable("nyctaxi.trip") |
| 134 | + ``` |
| 135 | + * Navigate to the Data hub, click on **Databases** and select **Refresh** |
| 136 | + * Now you should see these databases: |
| 137 | + * SQLDB (SQL pool) |
| 138 | + * nyctaxi (Spark) |
| 139 | + |
| 140 | + ## Analyze the NYC Taxi data using Spark and notebooks |
| 141 | + * Return to your notebook |
| 142 | + * Create a new code cell, enter the text below, adn run the cell |
| 143 | + ``` |
| 144 | + %%pyspark |
| 145 | + df = spark.sql("SELECT * FROM nyctaxi.trip") |
| 146 | + display(df) |
| 147 | + ``` |
| 148 | + * Run this code to perform the same analysis we did earlier with the SQL pool |
| 149 | + ``` |
| 150 | + %%pyspark |
| 151 | + df = spark.sql(""" |
| 152 | + SELECT PassengerCount, |
| 153 | + SUM(TripDistanceMiles) as SumTripDistance, |
| 154 | + AVG(TripDistanceMiles) as AvgTripDistance |
| 155 | + FROM nyctaxi.trip |
| 156 | + WHERE TripDistanceMiles > 0 AND PassengerCount > 0 |
| 157 | + GROUP BY PassengerCount |
| 158 | + ORDER BY PassengerCount |
| 159 | + """) |
| 160 | + display(df) |
| 161 | + df.write.saveAsTable("nyctaxi.passengercountstats") |
| 162 | + ``` |
| 163 | + * In the cell results, click on **Chart** to see the data visualized |
| 164 | + |
| 165 | +## Customize data visualization data with Spark and notebooks |
| 166 | +
|
| 167 | +With spark notebooks you can control exactly how render charts. The following |
| 168 | +code shows a simple example using the popular libraries matplotlib and seaborn. It will |
| 169 | +render the same chart you saw when running the SQL queries earlier. |
| 170 | +
|
| 171 | + ``` |
| 172 | + %%pyspark |
| 173 | + import matplotlib.pyplot |
| 174 | + import seaborn |
| 175 | +
|
| 176 | + seaborn.set(style = "whitegrid") |
| 177 | + df = spark.sql("SELECT * FROM nyctaxi.passengercountstats") |
| 178 | + df = df.toPandas() |
| 179 | + seaborn.lineplot(x="PassengerCount", y="SumTripDistance" , data = df) |
| 180 | + seaborn.lineplot(x="PassengerCount", y="AvgTripDistance" , data = df) |
| 181 | + matplotlib.pyplot.show() |
| 182 | + ``` |
| 183 | +
|
| 184 | +## Load data from a Spark table into a SQL Pool table |
| 185 | +
|
| 186 | +Earlier we copied data from a SQL pool DB into a Spark DB. Using |
| 187 | +Spark we aggregated the data into the nyctaxi.passengercountstats. |
| 188 | +Now run the cell below in a notebook and it will copy the aggregated table back into |
| 189 | +the SQL pool DB. |
| 190 | +
|
| 191 | +
|
| 192 | + ``` |
| 193 | + %%spark |
| 194 | + val df = spark.sql("SELECT * FROM nyctaxi.passengercountstats") |
| 195 | + df.write.sqlanalytics("SQLDB1.dbo.PassengerCountStats", Constants.INTERNAL ) |
| 196 | + ``` |
| 197 | +
|
| 198 | +## Analyze NYC taxi data in Spark databases using SQL-on demand |
| 199 | +
|
| 200 | +* Tables in Spark databases are automatically visible and queryable by SQL on-demand |
| 201 | +* In Synapse Studio navigate to the **Develop** hub and create a new SQL script |
| 202 | +* Set **Connect to** to **SQL on-demand** |
| 203 | +* Paste the following text into the script |
| 204 | + ``` |
| 205 | + SELECT * |
| 206 | + FROM nyctaxi.dbo.passengercountstats |
| 207 | + ``` |
| 208 | +* Click **Run** |
| 209 | +* NOTE: THe first time you run this it will take about 10 seconds for SQL on-demand to gather SQL resources needed to run |
| 210 | + your queries. Every subsequent query will not require this time. |
| 211 | + |
| 212 | + |
| 213 | +## Use pipeline to orchestrate activities |
| 214 | +
|
| 215 | +You can orchestrate a wide variety of tasks in Azure Synapse. In this section, you'll see how easy it is. |
| 216 | +
|
| 217 | +* In Synapse Studio, navigate to the Orchestrate hub |
| 218 | +* Click **+** then select **Pipeline**. A new pipeline will be created, |
| 219 | +* Navigate to the Develop hub and find any of the notebooks you previously created |
| 220 | +* Drag that notebook into the pipeline |
| 221 | +* In the pipeline click **Add trigger > New/edit** |
| 222 | +* In** Choose trigger** click **New**, and then in Recurrence set the trigger to run every 1 hour. |
| 223 | +* Click **OK** |
| 224 | +* Click **Publish All** and the pipeline will run every hour |
| 225 | +* If you want to make the pipeline run now without waiting for the next hour click **Add trigger > New/edit** |
| 226 | +
|
| 227 | +## Working with data in a storage account |
| 228 | +So far, we've covered scenarios were data resided in databases. Now we'll show how Synapse Analytics can analyze |
| 229 | +simple files in a storage account. In this scenario we'll use the storage account and container that we linked the workspace to. |
| 230 | +
|
| 231 | +The name of the storage account: contosolake |
| 232 | +The name of the container in the storage account: users |
| 233 | +
|
| 234 | +### Creating CSV and Parquet files in your Storage account |
| 235 | +Run the the following code in a notebook. It creates a CSV and parquet data in the storage account |
| 236 | +
|
| 237 | + ``` |
| 238 | + %%pyspark |
| 239 | + df = spark.sql("SELECT * FROM nyctaxi.passengercountstats") |
| 240 | + df = df.repartition(1) # This ensure we'll get a single file during write() |
| 241 | + df.write.mode("overwrite").csv("/NYCTaxi/PassengerCountStats.csv") |
| 242 | + df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats.parquet") |
| 243 | + ``` |
| 244 | +
|
| 245 | +### Analyzing data in a storage account |
| 246 | +
|
| 247 | +* In Synapse Studio, navigate to the **Data** hub |
| 248 | +* Select **Linked** |
| 249 | +* Navigate to **Storage accounts > workspaceame (Primary - contosolake)** |
| 250 | +* Click on **users (Primary)"** |
| 251 | +* You should see a folder called `NYCTaxi'. Inside you should see two folders 'PassengerCountStats.csv' and 'PassengerCountStats.parquet' |
| 252 | +* Navigate into the `PassengerCountStats.parquet' folder |
| 253 | +* Right-click on the parquet file inside, and select New notebook, it will create a notebook with a cell like this: |
| 254 | + ``` |
| 255 | + %%pyspark |
| 256 | + data_path = spark.read.load('abfss://[email protected]/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', format='parquet') |
| 257 | + data_path.show(100) |
| 258 | + ``` |
| 259 | +
|
| 260 | +* Run the cell to analyze the parquet file with spark |
| 261 | +* Right-click on the parquet file inside, and select New **SQL script > SELECT TOP 100 rows**, it will create a notebook with a cell like this: |
| 262 | + ``` |
| 263 | + SELECT TOP 100 * |
| 264 | + FROM OPENROWSET( |
| 265 | + BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', |
| 266 | + FORMAT='PARQUET' |
| 267 | + ) AS [r]; |
| 268 | + ``` |
| 269 | + |
| 270 | +* The script will be attached to **SQL on-demand** run the script. Notice that it infers the schema from the parquet file. |
| 271 | +
|
| 272 | +
|
| 273 | +## Visualize data with Power BI |
| 274 | +
|
| 275 | +Your data can now be easily analyzed and visualized in Power BI. Synapse offers a unique integration which allows you to link a Power BI workspace to you Synapse workspace. Before starting, frist follow the steps in this [quickstart](quickstart-power-bi.md) to link your Power BI workspace. |
| 276 | +
|
| 277 | +### Create a PowerBI Workspace and link it to your Synapse Workspace |
| 278 | +* Log into powerbi.microsoft.com |
| 279 | +* Create a new PowerBI workspace called `NYCTaxiWorkspace1` |
| 280 | +* In Synapse Studio, navigate to the **Manage > Linked Services** |
| 281 | +* Click **+ New** and click **Connect to PowerBI** and set these fields: |
| 282 | + * Set **Name** to `NYCTaxiWorkspace1` |
| 283 | + * Set **Workspace name** to `NYCTaxiWorkspace1` |
| 284 | +* Click **Create** |
| 285 | +
|
| 286 | +### Create a PowerBI dataset that uses data in your Synapse workspace |
| 287 | +* In Synapse Studio, navigate to the **Develop > Power BI ** |
| 288 | +* Navigate to **NYCTaxiWorkspace1 > Power BI datasets** and click **New Power BI dataset** |
| 289 | +* Hover over the SQLDB1 database and select **Download .pbids file** |
| 290 | +* Open the downloaded `.pbids` file. This will launch Power BI desktop and automatically connect it to SQLDB1 in your synapse workspace. |
| 291 | +* If you see a dialog appear called **SQL server database**: |
| 292 | + * Select **Microsoft account** |
| 293 | + * Click **Sign in** and log in |
| 294 | + * Click **Connect** |
| 295 | +* The **Navigator** dialog will open. When it does check the **PassengerCountStats** table and click **Load** |
| 296 | +* A **Connection settings** dialog will appear. Select **DirectQuery** and click **OK** |
| 297 | +* Click on the **Report** button on the left |
| 298 | +* Add **Line chart** to your report |
| 299 | + * Drag the **PasssengerCount** column to **Visualizations > Axis** |
| 300 | + * Drag the **SumTripDistance** and **AvgTripDistance** columns to **Visualizations > Values** |
| 301 | +* In the **Home** tab, click **Publish** |
| 302 | +* It will ask you if you want to save your changes. Click **Save**. |
| 303 | +* It will ask you to pick a filename. Choose `PassengerAnalysis.pbix` and click **Save** |
| 304 | +* It will ask you to **Select a destination** select `NYCTaxiWorkspace1` and click **Select** |
| 305 | +* Wait for publishing to finish |
| 306 | +
|
| 307 | +### Configure authentitication for your dataset |
| 308 | +* Open https://powerbi.microsoft.com and **Sign in** |
| 309 | +* At the left, under **Workspaces** select the the `NYCTaxiWorkspace1` workspace that you published to |
| 310 | +* Inside that workspace you should see a dataset called `Passenger Analysis` and a report called `Passenger Analysis` and |
| 311 | +* Hover over the `PassengerAnalysis` dataset and click the icon with the three dots and select **Settings** |
| 312 | +* In **Data source credentials** set the Authentication method to **OAuth2** and click **Sign in** |
| 313 | +
|
| 314 | +### Edit a report report in Synapse Studio |
| 315 | +* Go back to Synapse Studio and click **Close and refresh** now you shold see |
| 316 | + * Under **Power BI datasets**, a new dataset called **PassengerAnalysis**. |
| 317 | + * Under **Power BI datasets**, a new report called **PassengerAnalysis**. |
| 318 | +* CLick on the **PassengerAnalysis** report. |
| 319 | + * It won't show anything because you still need to configure authentication for the dataset |
| 320 | +* In SynapseStudio, navigate to **Develop > PowerBI > Your workspace name > Power BI reports** |
| 321 | +* Close any windows showing the PowerBI report |
| 322 | +* Refresh the **Power BI reports** node |
| 323 | +* Click on the report and now you can edit the report directly within Synapse Studio |
| 324 | +
|
| 325 | +## Monitor activites |
| 326 | +
|
| 327 | +* In Synapse Studio, Navigate to the monitor hub. |
| 328 | +* In this location you can see a history of all the activites taking place in the workspace and which ones are active now. |
| 329 | +* Explore the **Pipeline runs**, **Apache Spark applications**, and **SQL requests** and you can see what you've already done in the workspace. |
| 330 | +
|
| 331 | +
|
| 332 | +
|
0 commit comments