Skip to content

Commit 4df5b18

Browse files
author
Saveen Reddy
authored
Update get-started.md
1 parent f448889 commit 4df5b18

File tree

1 file changed

+45
-0
lines changed

1 file changed

+45
-0
lines changed

articles/synapse-analytics/get-started.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,51 @@ You can orchestrate a wide variety of tasks in Azure Synapse. In this section, y
209209
* Click **Publish All** and the pipeline will run every hour
210210
* If you want to make the pipeline run now without waiting for the next hour click **Add trigger > New/edit**
211211
212+
## Working with data in a storage account
213+
So far, we've covered scenarios were data resided in databases. Now we'll show how Synapse Analytics can analyze
214+
simple files in a storage account. In this scenario we'll use the storage account and container that we linked the workspace to.
215+
216+
The name of the storage account: contosolake
217+
The name of the container in the storage account: users
218+
219+
### Creating CSV and Parquet files in your Storage account
220+
Run the the following code in a notebook. It creates a CSV and parquet data in the storage account
221+
222+
```%%pyspark
223+
df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
224+
df = df.repartition(1) # This ensure we'll get a single file during write()
225+
df.write.mode("overwrite").csv("/NYCTaxi/PassengerCountStats.csv")
226+
df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats.parquet")
227+
```
228+
229+
### Analyzing data in a storage account
230+
231+
* In Synapse Studio, navigate to the **Data** hub
232+
* Select **Linked**
233+
* Navigate to **Storage accounts > workspaceame (Primary - contosolake)**
234+
* Click on **users (Primary)"**
235+
* You should see a folder called `NYCTaxi'. Inside you should see two folders 'PassengerCountStats.csv' and 'PassengerCountStats.parquet'
236+
* Navigate into the `PassengerCountStats.parquet' folder
237+
* Right-click on the parquet file inside, and select New notebook, it will create a notebook with a cell like this:
238+
```
239+
%%pyspark
240+
data_path = spark.read.load('abfss://[email protected]/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', format='parquet')
241+
data_path.show(100)
242+
```
243+
* Run the cell to analyze the parquet file with spark
244+
* Right-click on the parquet file inside, and select New **SQL script > SELECT TOP 100 rows**, it will create a notebook with a cell like this:
245+
```
246+
SELECT
247+
TOP 100 *
248+
FROM
249+
OPENROWSET(
250+
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet',
251+
FORMAT='PARQUET'
252+
) AS [r];
253+
254+
```
255+
* The script will be attached to **SQL on-demand** run the script. Notice that it infers the schema from the parquet file.
256+
212257
213258
## Visualize data with Power BI
214259

0 commit comments

Comments
 (0)