Skip to content

Commit ccd2d1b

Browse files
committed
Acrolinx
1 parent 131be2d commit ccd2d1b

File tree

1 file changed

+16
-15
lines changed

1 file changed

+16
-15
lines changed

articles/data-factory/tutorial-data-flow-delta-lake.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
22
title: Delta lake ETL with data flows
3-
description: This tutorial provides step-by-step instructions for using data flows to transform and analyze data in delta lake
3+
description: This tutorial provides step-by-step instructions for using data flows to transform and analyze data in delta lake
44
author: kromerm
55
ms.author: makromer
66
ms.service: data-factory
77
ms.subservice: data-flows
88
ms.topic: conceptual
9-
ms.date: 05/15/2024
9+
ms.date: 06/24/2024
1010
---
1111

1212
# Transform data in delta lake using mapping data flows
@@ -15,13 +15,13 @@ ms.date: 05/15/2024
1515

1616
If you're new to Azure Data Factory, see [Introduction to Azure Data Factory](introduction.md).
1717

18-
In this tutorial, you'll use the data flow canvas to create data flows that allow you to analyze and transform data in Azure Data Lake Storage (ADLS) Gen2 and store it in Delta Lake.
18+
In this tutorial, you use the data flow canvas to create data flows that allow you to analyze and transform data in Azure Data Lake Storage (ADLS) Gen2 and store it in Delta Lake.
1919

2020
## Prerequisites
2121
* **Azure subscription**. If you don't have an Azure subscription, create a [free Azure account](https://azure.microsoft.com/free/) before you begin.
2222
* **Azure storage account**. You use ADLS storage as a *source* and *sink* data stores. If you don't have a storage account, see [Create an Azure storage account](../storage/common/storage-account-create.md) for steps to create one.
2323

24-
The file that we are transforming in this tutorial is MoviesDB.csv, which can be found [here](https://github.com/kromerm/adfdataflowdocs/blob/master/sampledata/moviesDB2.csv). To retrieve the file from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file. To upload the file to your storage account, see [Upload blobs with the Azure portal](../storage/blobs/storage-quickstart-blobs-portal.md). The examples will be referencing a container named 'sample-data'.
24+
The file that we're transforming in this tutorial is MoviesDB.csv, which can be found [here](https://github.com/kromerm/adfdataflowdocs/blob/master/sampledata/moviesDB2.csv). To retrieve the file from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file. To upload the file to your storage account, see [Upload blobs with the Azure portal](../storage/blobs/storage-quickstart-blobs-portal.md). The examples are referencing a container named 'sample-data'.
2525

2626
## Create a data factory
2727

@@ -46,7 +46,7 @@ In this step, you create a data factory and open the Data Factory UX to create a
4646

4747
## Create a pipeline with a data flow activity
4848

49-
In this step, you'll create a pipeline that contains a data flow activity.
49+
In this step, you create a pipeline that contains a data flow activity.
5050

5151
1. On the home page, select **Orchestrate**.
5252

@@ -56,7 +56,7 @@ In this step, you'll create a pipeline that contains a data flow activity.
5656
1. In the **Activities** pane, expand the **Move and Transform** accordion. Drag and drop the **Data Flow** activity from the pane to the pipeline canvas.
5757

5858
:::image type="content" source="media/tutorial-data-flow/activity1.png" alt-text="Screenshot that shows the pipeline canvas where you can drop the Data Flow activity.":::
59-
1. In the **Adding Data Flow** pop-up, select **Create new Data Flow** and then name your data flow **DeltaLake**. Click Finish when done.
59+
1. In the **Adding Data Flow** pop-up, select **Create new Data Flow** and then name your data flow **DeltaLake**. Select Finish when done.
6060

6161
:::image type="content" source="media/tutorial-data-flow/activity2.png" alt-text="Screenshot that shows where you name your data flow when you create a new data flow.":::
6262
1. In the top bar of the pipeline canvas, slide the **Data Flow debug** slider on. Debug mode allows for interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and users are recommended to turn on debug first if they plan to do Data Flow development. For more information, see [Debug Mode](concepts-data-flow-debug-mode.md).
@@ -65,13 +65,13 @@ In this step, you'll create a pipeline that contains a data flow activity.
6565

6666
## Build transformation logic in the data flow canvas
6767

68-
You will generate two data flows in this tutorial. The first data flow is a simple source to sink to generate a new Delta Lake from the movies CSV file from above. Lastly, you'll create this flow design below to update data in Delta Lake.
68+
You generate two data flows in this tutorial. The first data flow is a simple source to sink to generate a new Delta Lake from the movies CSV file. Lastly, you create the flow design that follows to update data in Delta Lake.
6969

7070
:::image type="content" source="media/data-flow/data-flow-tutorial-6.png" alt-text="Final flow":::
7171

7272
### Tutorial objectives
7373

74-
1. Take the MoviesCSV dataset source from above, and form a new Delta Lake from it.
74+
1. Use the MoviesCSV dataset source from the prerequisites, and form a new Delta Lake from it.
7575
1. Build the logic to updated ratings for 1988 movies to '1'.
7676
1. Delete all movies from 1950.
7777
1. Insert new movies for 2021 by duplicating the movies from 1960.
@@ -103,7 +103,7 @@ You will generate two data flows in this tutorial. The first data flow is a simp
103103
:::image type="content" source="media/tutorial-data-flow-delta-lake/select-sink-details.png" alt-text="Screenshot showing the Sink details for an inline delta dataset.":::
104104

105105
1. Choose a folder name in your storage container where you would like the service to create the Delta Lake.
106-
1. Finally, navigate back the pipeline designer and select **Debug** to execute the pipeline in debug mode with just this data flow activity on the canvas. This will generate your new Delta Lake in Azure Data Lake Storage Gen2.
106+
1. Finally, navigate back the pipeline designer and select **Debug** to execute the pipeline in debug mode with just this data flow activity on the canvas. This generates your new Delta Lake in Azure Data Lake Storage Gen2.
107107
1. Now, from the Factory Resources menu on the left of the screen, select **+** to add a new resource, and then select **Data flow**.
108108

109109
:::image type="content" source="media/concepts-data-flow-overview/new-data-flow.png" alt-text="Screenshot showing where to create a new data flow in the data factory.":::
@@ -125,16 +125,17 @@ You will generate two data flows in this tutorial. The first data flow is a simp
125125
1. Your alter row policies should look like this.
126126

127127
:::image type="content" source="media/data-flow/data-flow-tutorial-3.png" alt-text="Alter row":::
128-
129-
1. Now that you’ve set the proper policy for each alter row type, check that the proper update rules have been set on the sink transformation
128+
129+
1. Now that you set the proper policy for each alter row type, check that the proper update rules were set on the sink transformation
130130

131131
:::image type="content" source="media/data-flow/data-flow-tutorial-4.png" alt-text="Sink":::
132-
133-
1. Here we are using the Delta Lake sink to your ADLS Gen2 data lake and allowing inserts, updates, deletes.
134-
1. Note that the Key Columns are a composite key made up of the Movie primary key column and year column. This is because we created fake 2021 movies by duplicating the 1960 rows. This avoids collisions when looking up the existing rows by providing uniqueness.
132+
133+
1. Here we're using the Delta Lake sink to your Azure Data Lake Storage Gen2 data lake and allowing inserts, updates, deletes.
134+
1. Note that the key columns are a composite key made up of the Movie primary key column and year column. This is because we created fake 2021 movies by duplicating the 1960 rows. This avoids collisions when looking up the existing rows by providing uniqueness.
135135

136136
### Download completed sample
137-
[Here is a sample solution for the Delta pipeline with a data flow for update/delete rows in the lake:](https://github.com/kromerm/adfdataflowdocs/blob/master/sampledata/DeltaPipeline.zip)
137+
138+
Here's a [sample solution for the Delta pipeline](https://github.com/kromerm/adfdataflowdocs/blob/master/sampledata/DeltaPipeline.zip) with a data flow for update/delete rows in the lake.
138139

139140
## Related content
140141

0 commit comments

Comments
 (0)