Skip to content

Commit f4b4900

Browse files
committed
tutorial-incremental-copy-lastmodified-copy-data-tool
1 parent 167733b commit f4b4900

File tree

1 file changed

+58
-59
lines changed

1 file changed

+58
-59
lines changed
Lines changed: 58 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
---
2-
title: Data tool to copy new and updated files incrementally
3-
titleSuffix: Azure Data Factory
2+
title: Data tool to copy new and updated files incrementally
43
description: Create an Azure data factory and then use the Copy Data tool to incrementally load new files based on LastModifiedDate.
54
services: data-factory
65
author: dearandyxu
@@ -17,14 +16,14 @@ ms.date: 3/18/2020
1716

1817
# Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool
1918

20-
In this tutorial, you'll use the Azure portal to create a data factory. Then, you'll use the Copy Data tool to create a pipeline that incrementally copies new and changed files only, based on their **LastModifiedDate** from Azure Blob storage to Azure Blob storage.
19+
In this tutorial, you'll use the Azure portal to create a data factory. You'll then use the Copy Data tool to create a pipeline that incrementally copies new and changed files only, from Azure Blob storage to Azure Blob storage. It uses `LastModifiedDate` to determine which files to copy.
2120

22-
By doing so, ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and copy the new and updated file only since last time to the destination store. Please note that if you let ADF scan huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file scanning is time consuming as well.
21+
After you complete the steps here, Azure Data Factory will scan all the files in the source store, apply the file filter by `LastModifiedDate`, and copy to the destination store only files that are new or have been updated since last time. Note that if Data Factory scans large numbers of files, you should still expect long durations because file scanning is time consuming, even when the amount of data copied is reduced.
2322

2423
> [!NOTE]
25-
> If you're new to Azure Data Factory, see [Introduction to Azure Data Factory](introduction.md).
24+
> If you're new to Data Factory, see [Introduction to Azure Data Factory](introduction.md).
2625
27-
In this tutorial, you will perform the following tasks:
26+
In this tutorial, you'll complete these tasks:
2827

2928
> [!div class="checklist"]
3029
> * Create a data factory.
@@ -34,101 +33,101 @@ In this tutorial, you will perform the following tasks:
3433
## Prerequisites
3534

3635
* **Azure subscription**: If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/) before you begin.
37-
* **Azure storage account**: Use Blob storage as the _source_ and _sink_ data store. If you don't have an Azure storage account, see the instructions in [Create a storage account](../storage/common/storage-account-create.md).
36+
* **Azure Storage account**: Use Blob storage for the source and sink data stores. If you don't have an Azure Storage account, follow the instructions in [Create a storage account](../storage/common/storage-account-create.md).
3837

39-
### Create two containers in Blob storage
38+
## Create two containers in Blob storage
4039

41-
Prepare your Blob storage for the tutorial by performing these steps.
40+
Prepare your Blob storage for the tutorial by completing these steps:
4241

43-
1. Create a container named **source**. You can use various tools to perform this task, such as [Azure Storage Explorer](https://storageexplorer.com/).
42+
1. Create a container named **source**. You can use various tools to perform this task, like [Azure Storage Explorer](https://storageexplorer.com/).
4443

4544
2. Create a container named **destination**.
4645

4746
## Create a data factory
4847

49-
1. On the left menu, select **Create a resource** > **Data + Analytics** > **Data Factory**:
48+
1. In the left pane, select **Create a resource**. Select **Analytics** > **Data Factory**:
5049

51-
![Data Factory selection in the "New" pane](./media/doc-common-process/new-azure-data-factory-menu.png)
50+
![Select Data Factory](./media/doc-common-process/new-azure-data-factory-menu.png)
5251

5352
2. On the **New data factory** page, under **Name**, enter **ADFTutorialDataFactory**.
5453

55-
The name for your data factory must be _globally unique_. You might receive the following error message:
54+
The name for your data factory must be globally unique. You might receive this error message:
5655

57-
![New data factory error message](./media/doc-common-process/name-not-available-error.png)
56+
![Name not available error message](./media/doc-common-process/name-not-available-error.png)
5857

5958
If you receive an error message about the name value, enter a different name for the data factory. For example, use the name _**yourname**_**ADFTutorialDataFactory**. For the naming rules for Data Factory artifacts, see [Data Factory naming rules](naming-rules.md).
60-
3. Select the Azure **subscription** in which you'll create the new data factory.
61-
4. For **Resource Group**, take one of the following steps:
59+
3. Under **Subscription**, select the Azure subscription in which you'll create the new data factory.
60+
4. Under **Resource Group**, take one of these steps:
6261

63-
* Select **Use existing** and select an existing resource group from the drop-down list.
62+
* Select **Use existing** and then select an existing resource group in the list.
6463

65-
* Select **Create new** and enter the name of a resource group.
64+
* Select **Create new** and then enter a name for the resource group.
6665

6766
To learn about resource groups, see [Use resource groups to manage your Azure resources](../azure-resource-manager/management/overview.md).
6867

69-
5. Under **version**, select **V2**.
70-
6. Under **location**, select the location for the data factory. Only supported locations are displayed in the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example, Azure HDInsight) that your data factory uses can be in other locations and regions.
68+
5. Under **Version**, select **V2**.
69+
6. Under **Location**, select the location for the data factory. Only supported locations appear in the list. The data stores (for example, Azure Storage and Azure SQL Database) and computes (for example, Azure HDInsight) that your data factory uses can be in other locations and regions.
7170
8. Select **Create**.
72-
9. After creation is finished, the **Data Factory** home page is displayed.
73-
10. To open the Azure Data Factory user interface (UI) on a separate tab, select the **Author & Monitor** tile.
71+
9. After the data factory is created, the data factory home page appears.
72+
10. To open the Azure Data Factory user interface (UI) on a separate tab, select the **Author & Monitor** tile:
7473

7574
![Data factory home page](./media/doc-common-process/data-factory-home-page.png)
7675

7776
## Use the Copy Data tool to create a pipeline
7877

79-
1. On the **Let's get started** page, select the **Copy Data** title to open the Copy Data tool.
78+
1. On the **Let's get started** page, select the **Copy Data** tile to open the Copy Data tool:
8079

81-
![Copy Data tool tile](./media/doc-common-process/get-started-page.png)
80+
![Copy Data tile](./media/doc-common-process/get-started-page.png)
8281

8382
2. On the **Properties** page, take the following steps:
8483

8584
a. Under **Task name**, enter **DeltaCopyFromBlobPipeline**.
8685

87-
b. Under **Task cadence** or **Task schedule**, select **Run regularly on schedule**.
86+
b. Under **Task cadence or Task schedule**, select **Run regularly on schedule**.
8887

89-
c. Under **Trigger Type**, select **Tumbling Window**.
88+
c. Under **Trigger type**, select **Tumbling window**.
9089

9190
d. Under **Recurrence**, enter **15 Minute(s)**.
9291

9392
e. Select **Next**.
9493

95-
The Data Factory UI creates a pipeline with the specified task name.
94+
Data Factory creates a pipeline with the specified task name.
9695

97-
![Properties page](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/copy-data-tool-properties-page.png)
96+
![Copy data properties page](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/copy-data-tool-properties-page.png)
9897

99-
3. On the **Source data store** page, complete the following steps:
98+
3. On the **Source data store** page, complete these steps:
10099

101-
a. Select **+ Create new connection**, to add a connection.
100+
a. Select **Create new connection** to add a connection.
102101

103-
b. Select **Azure Blob Storage** from the gallery, and then select **Continue**.
102+
b. Select **Azure Blob Storage** from the gallery, and then select **Continue**:
104103

105-
![Source data store page](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/source-data-store-page-select-blob.png)
104+
![Select Azure Blog Storage](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/source-data-store-page-select-blob.png)
106105

107-
c. On the **New Linked Service(Azure Blob Storage)** page, select your storage account from the **Storage account name** list. Test connection and then select **Create**.
106+
c. On the **New Linked Service (Azure Blob Storage)** page, select your storage account from the **Storage account name** list. Test the connection and then select **Create**.
108107

109-
d. Select the newly created linked service and then select **Next**.
108+
d. Select the new linked service and then select **Next**:
110109

111-
![Source data store page](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/source-data-store-page-select-linkedservice.png)
110+
![Select the new linked service](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/source-data-store-page-select-linkedservice.png)
112111

113112
4. On the **Choose the input file or folder** page, complete the following steps:
114113

115-
a. Browse and select the **source** folder, and then select **Choose**.
114+
a. Browse for and select the **source** folder, and then select **Choose**.
116115

117116
![Choose the input file or folder](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/choose-input-file-folder.png)
118117

119118
b. Under **File loading behavior**, select **Incremental load: LastModifiedDate**.
120119

121-
c. Check **Binary copy** and select **Next**.
120+
c. Select **Binary copy** and then select **Next**:
122121

123-
![Choose the input file or folder](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/check-binary-copy.png)
122+
![Choose the input file or folder page](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/check-binary-copy.png)
124123

125-
5. On the **Destination data store** page, select the **AzureBlobStorage** that you created. This is the same storage account as the source data store. Then select **Next**.
124+
5. On the **Destination data store** page, select the **AzureBlobStorage** service that you created. This is the same storage account as the source data store. Then select **Next**.
126125

127126
6. On the **Choose the output file or folder** page, complete the following steps:
128127

129-
a. Browse and select the **destination** folder, and then select **Choose**.
128+
a. Browse for and select the **destination** folder, and then select **Choose**:
130129

131-
![Choose the output file or folder](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/choose-output-file-folder.png)
130+
![Choose the output file or folder page](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/choose-output-file-folder.png)
132131

133132
b. Select **Next**.
134133

@@ -142,43 +141,43 @@ Prepare your Blob storage for the tutorial by performing these steps.
142141

143142
![Deployment page](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/deployment-page.png)
144143

145-
10. Notice that the **Monitor** tab on the left is automatically selected. The application switches to the **Monitor** tab. You see the status of the pipeline. Select **Refresh** to refresh the list. Click the link under **PIPELINE NAME** to view activity run details or rerun the pipeline.
144+
10. Notice that the **Monitor** tab on the left is automatically selected. The application switches to the **Monitor** tab. You see the status of the pipeline. Select **Refresh** to refresh the list. Select the link under **PIPELINE NAME** to view activity run details or to run the pipeline again.
146145

147-
![Refresh list and select View Activity Runs](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs1.png)
146+
![Refresh the list and view activity run details](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs1.png)
148147

149-
11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the copy operation, select the **Details** link (eyeglasses icon) under the **ACTIVITY NAME** column. For details about the properties, see [Copy Activity overview](copy-activity-overview.md).
148+
11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the copy operation, select the **Details** link (the eyeglasses icon) in the **ACTIVITY NAME** column. For details about the properties, see [Copy activity overview](copy-activity-overview.md).
150149

151-
![Copy activity is in pipeline](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs2.png)
150+
![Copy activity in the pipeline](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs2.png)
152151

153-
Because there is no file in the **source** container in your Blob storage account, you will not see any file copied to the **destination** container in your Blob storage account.
152+
Because there are no files in the source container in your Blob storage account, you won't see any files copied to the destination container in the account:
154153

155-
![No file in source container or destination container](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs3.png)
154+
![No files in source container or destination container](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs3.png)
156155

157-
12. Create an empty text file and name it **file1.txt**. Upload this text file to the **source** container in your storage account. You can use various tools to perform these tasks, such as [Azure Storage Explorer](https://storageexplorer.com/).
156+
12. Create an empty text file and name it **file1.txt**. Upload this text file to the source container in your storage account. You can use various tools to perform these tasks, like [Azure Storage Explorer](https://storageexplorer.com/).
158157

159-
![Create file1.txt and upload to source container](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs3-1.png)
158+
![Create file1.txt and upload it to the source container](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs3-1.png)
160159

161-
13. To go back to the **Pipeline Runs** view, select **All pipeline runs**, and wait for the same pipeline to be triggered again automatically.
160+
13. To go back to the **Pipeline runs** view, select **All pipeline runs**, and wait for the same pipeline to be automatically triggered again.
162161

163-
![Select All Pipeline Runs](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs4.png)
162+
![Select All pipeline runs](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs4.png)
164163

165-
14. When the second pipeline run completes, follow the same steps mentioned above to review the activity run details.
164+
14. When the second pipeline run completes, follow the same steps mentioned previously to review the activity run details.
166165

167-
You will see that one file (file1.txt) has been copied from the **source** container to the **destination** container of your Blob storage account.
166+
You'll see that one file (file1.txt) has been copied from the source container to the destination container of your Blob storage account:
168167

169-
![File1.txt has been copied from source container to destination container](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs6.png)
168+
![file1.txt has been copied from the source container to the destination container](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs6.png)
170169

171-
15. Create another empty text file and name it **file2.txt**. Upload this text file to the **source** container in your Blob storage account.
170+
15. Create another empty text file and name it **file2.txt**. Upload this text file to the source container in your Blob storage account.
172171

173-
16. Repeat steps 13 and 14 for this second text file. You will see that only the new file (file2.txt) has been copied from the **source** container to the **destination** container of your storage account in the next pipeline run.
172+
16. Repeat steps 13 and 14 for the second text file. You'll see that only the new file (file2.txt) has been copied from the source container to the destination container of your storage account during this pipeline run.
174173

175-
You can also verify this by using [Azure Storage Explorer](https://storageexplorer.com/) to scan the files.
174+
You can also verify that only one file has been copied by using [Azure Storage Explorer](https://storageexplorer.com/) to scan the files:
176175

177-
![Scan files using Azure Storage Explorer](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs8.png)
176+
![Scan files by using Azure Storage Explorer](./media/tutorial-incremental-copy-lastmodified-copy-data-tool/monitor-pipeline-runs8.png)
178177

179178

180179
## Next steps
181-
Advance to the following tutorial to learn about transforming data by using an Apache Spark cluster on Azure:
180+
Go to the following tutorial to learn how to transform data by using an Apache Spark cluster on Azure:
182181

183182
> [!div class="nextstepaction"]
184183
>[Transform data in the cloud by using an Apache Spark cluster](tutorial-transform-data-spark-portal.md)

0 commit comments

Comments
 (0)