@@ -209,6 +209,51 @@ You can orchestrate a wide variety of tasks in Azure Synapse. In this section, y
209
209
* Click **Publish All** and the pipeline will run every hour
210
210
* If you want to make the pipeline run now without waiting for the next hour click **Add trigger > New/edit**
211
211
212
+ ## Working with data in a storage account
213
+ So far, we've covered scenarios were data resided in databases. Now we'll show how Synapse Analytics can analyze
214
+ simple files in a storage account. In this scenario we'll use the storage account and container that we linked the workspace to.
215
+
216
+ The name of the storage account: contosolake
217
+ The name of the container in the storage account: users
218
+
219
+ ### Creating CSV and Parquet files in your Storage account
220
+ Run the the following code in a notebook. It creates a CSV and parquet data in the storage account
221
+
222
+ ```%%pyspark
223
+ df = spark.sql("SELECT * FROM nyctaxi.passengercountstats")
224
+ df = df.repartition(1) # This ensure we'll get a single file during write()
225
+ df.write.mode("overwrite").csv("/NYCTaxi/PassengerCountStats.csv")
226
+ df.write.mode("overwrite").parquet("/NYCTaxi/PassengerCountStats.parquet")
227
+ ```
228
+
229
+ ### Analyzing data in a storage account
230
+
231
+ * In Synapse Studio, navigate to the **Data** hub
232
+ * Select **Linked**
233
+ * Navigate to **Storage accounts > workspaceame (Primary - contosolake)**
234
+ * Click on **users (Primary)"**
235
+ * You should see a folder called `NYCTaxi'. Inside you should see two folders 'PassengerCountStats.csv' and 'PassengerCountStats.parquet'
236
+ * Navigate into the `PassengerCountStats.parquet' folder
237
+ * Right-click on the parquet file inside, and select New notebook, it will create a notebook with a cell like this:
238
+ ```
239
+ %%pyspark
240
+ data_path = spark.read.load('abfss://[email protected] /NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet', format='parquet')
241
+ data_path.show(100)
242
+ ```
243
+ * Run the cell to analyze the parquet file with spark
244
+ * Right-click on the parquet file inside, and select New **SQL script > SELECT TOP 100 rows**, it will create a notebook with a cell like this:
245
+ ```
246
+ SELECT
247
+ TOP 100 *
248
+ FROM
249
+ OPENROWSET(
250
+ BULK 'https://contosolake.dfs.core.windows.net/users/NYCTaxi/PassengerCountStats.parquet/part-00000-1f251a58-d8ac-4972-9215-8d528d490690-c000.snappy.parquet',
251
+ FORMAT='PARQUET'
252
+ ) AS [r];
253
+
254
+ ```
255
+ * The script will be attached to **SQL on-demand** run the script. Notice that it infers the schema from the parquet file.
256
+
212
257
213
258
## Visualize data with Power BI
214
259
0 commit comments