You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Data tool to copy new and updated files incrementally
3
-
titleSuffix: Azure Data Factory
2
+
title: Data tool to copy new and updated files incrementally
4
3
description: Create an Azure data factory and then use the Copy Data tool to incrementally load new files based on LastModifiedDate.
5
4
services: data-factory
6
5
author: dearandyxu
@@ -17,14 +16,14 @@ ms.date: 3/18/2020
17
16
18
17
# Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool
19
18
20
-
In this tutorial, you'll use the Azure portal to create a data factory. Then, you'll use the Copy Data tool to create a pipeline that incrementally copies new and changed files only, based on their **LastModifiedDate**from Azure Blob storage to Azure Blob storage.
19
+
In this tutorial, you'll use the Azure portal to create a data factory. You'll then use the Copy Data tool to create a pipeline that incrementally copies new and changed files only, from Azure Blob storage to Azure Blob storage. It uses `LastModifiedDate` to determine which files to copy.
21
20
22
-
By doing so, ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and copy the new and updated file only since last time to the destination store. Please note that if you let ADF scan huge amounts of files but only copy a few files to destination, you would still expect the long duration due to file scanning is time consuming as well.
21
+
After you complete the steps here, Azure Data Factory will scan all the files in the source store, apply the file filter by `LastModifiedDate`, and copy to the destination store only files that are new or have been updated since last time. Note that if Data Factory scans large numbers of files, you should still expect long durations because file scanning is time consuming, even when the amount of data copied is reduced.
23
22
24
23
> [!NOTE]
25
-
> If you're new to Azure Data Factory, see [Introduction to Azure Data Factory](introduction.md).
24
+
> If you're new to Data Factory, see [Introduction to Azure Data Factory](introduction.md).
26
25
27
-
In this tutorial, you will perform the following tasks:
26
+
In this tutorial, you'll complete these tasks:
28
27
29
28
> [!div class="checklist"]
30
29
> * Create a data factory.
@@ -34,101 +33,101 @@ In this tutorial, you will perform the following tasks:
34
33
## Prerequisites
35
34
36
35
***Azure subscription**: If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/) before you begin.
37
-
***Azure storage account**: Use Blob storage as the _source_ and _sink_ data store. If you don't have an Azure storage account, see the instructions in [Create a storage account](../storage/common/storage-account-create.md).
36
+
***Azure Storage account**: Use Blob storage for the source and sink data stores. If you don't have an Azure Storage account, follow the instructions in [Create a storage account](../storage/common/storage-account-create.md).
38
37
39
-
###Create two containers in Blob storage
38
+
## Create two containers in Blob storage
40
39
41
-
Prepare your Blob storage for the tutorial by performing these steps.
40
+
Prepare your Blob storage for the tutorial by completing these steps:
42
41
43
-
1. Create a container named **source**. You can use various tools to perform this task, such as[Azure Storage Explorer](https://storageexplorer.com/).
42
+
1. Create a container named **source**. You can use various tools to perform this task, like[Azure Storage Explorer](https://storageexplorer.com/).
44
43
45
44
2. Create a container named **destination**.
46
45
47
46
## Create a data factory
48
47
49
-
1.On the left menu, select **Create a resource** >**Data + Analytics** > **Data Factory**:
48
+
1.In the left pane, select **Create a resource**. Select**Analytics** > **Data Factory**:
50
49
51
-

50
+

52
51
53
52
2. On the **New data factory** page, under **Name**, enter **ADFTutorialDataFactory**.
54
53
55
-
The name for your data factory must be _globally unique_. You might receive the following error message:
54
+
The name for your data factory must be globally unique. You might receive this error message:
56
55
57
-

56
+

58
57
59
58
If you receive an error message about the name value, enter a different name for the data factory. For example, use the name _**yourname**_**ADFTutorialDataFactory**. For the naming rules for Data Factory artifacts, see [Data Factory naming rules](naming-rules.md).
60
-
3.Select the Azure **subscription** in which you'll create the new data factory.
61
-
4.For**Resource Group**, take one of the following steps:
59
+
3.Under **Subscription**, select the Azure subscription in which you'll create the new data factory.
60
+
4.Under**Resource Group**, take one of these steps:
62
61
63
-
* Select **Use existing** and select an existing resource group from the drop-down list.
62
+
* Select **Use existing** and then select an existing resource group in the list.
64
63
65
-
* Select **Create new** and enter the name of a resource group.
64
+
* Select **Create new** and then enter a name for the resource group.
66
65
67
66
To learn about resource groups, see [Use resource groups to manage your Azure resources](../azure-resource-manager/management/overview.md).
68
67
69
-
5. Under **version**, select **V2**.
70
-
6. Under **location**, select the location for the data factory. Only supported locations are displayed in the drop-down list. The data stores (for example, Azure Storage and SQL Database) and computes (for example, Azure HDInsight) that your data factory uses can be in other locations and regions.
68
+
5. Under **Version**, select **V2**.
69
+
6. Under **Location**, select the location for the data factory. Only supported locations appear in the list. The data stores (for example, Azure Storage and Azure SQL Database) and computes (for example, Azure HDInsight) that your data factory uses can be in other locations and regions.
71
70
8. Select **Create**.
72
-
9. After creation is finished, the **Data Factory** home page is displayed.
73
-
10. To open the Azure Data Factory user interface (UI) on a separate tab, select the **Author & Monitor** tile.
71
+
9. After the data factory is created, the data factory home page appears.
72
+
10. To open the Azure Data Factory user interface (UI) on a separate tab, select the **Author & Monitor** tile:
74
73
75
74

76
75
77
76
## Use the Copy Data tool to create a pipeline
78
77
79
-
1. On the **Let's get started** page, select the **Copy Data**title to open the Copy Data tool.
78
+
1. On the **Let's get started** page, select the **Copy Data**tile to open the Copy Data tool:
80
79
81
-

80
+

82
81
83
82
2. On the **Properties** page, take the following steps:
84
83
85
84
a. Under **Task name**, enter **DeltaCopyFromBlobPipeline**.
86
85
87
-
b. Under **Task cadence** or **Task schedule**, select **Run regularly on schedule**.
86
+
b. Under **Task cadence or Task schedule**, select **Run regularly on schedule**.
88
87
89
-
c. Under **Trigger Type**, select **Tumbling Window**.
88
+
c. Under **Trigger type**, select **Tumbling window**.
90
89
91
90
d. Under **Recurrence**, enter **15 Minute(s)**.
92
91
93
92
e. Select **Next**.
94
93
95
-
The Data Factory UI creates a pipeline with the specified task name.
94
+
Data Factory creates a pipeline with the specified task name.

98
97
99
-
3. On the **Source data store** page, complete the following steps:
98
+
3. On the **Source data store** page, complete these steps:
100
99
101
-
a. Select **+ Create new connection**, to add a connection.
100
+
a. Select **Create new connection** to add a connection.
102
101
103
-
b. Select **Azure Blob Storage** from the gallery, and then select **Continue**.
102
+
b. Select **Azure Blob Storage** from the gallery, and then select **Continue**:
104
103
105
-

104
+

106
105
107
-
c. On the **New Linked Service(Azure Blob Storage)** page, select your storage account from the **Storage account name** list. Test connection and then select **Create**.
106
+
c. On the **New Linked Service(Azure Blob Storage)** page, select your storage account from the **Storage account name** list. Test the connection and then select **Create**.
108
107
109
-
d. Select the newly created linked service and then select **Next**.
108
+
d. Select the new linked service and then select **Next**:
110
109
111
-

110
+

112
111
113
112
4. On the **Choose the input file or folder** page, complete the following steps:
114
113
115
-
a. Browse and select the **source** folder, and then select **Choose**.
114
+
a. Browse for and select the **source** folder, and then select **Choose**.
116
115
117
116

118
117
119
118
b. Under **File loading behavior**, select **Incremental load: LastModifiedDate**.
120
119
121
-
c. Check**Binary copy** and select **Next**.
120
+
c. Select**Binary copy** and then select **Next**:
122
121
123
-

122
+

124
123
125
-
5. On the **Destination data store** page, select the **AzureBlobStorage** that you created. This is the same storage account as the source data store. Then select **Next**.
124
+
5. On the **Destination data store** page, select the **AzureBlobStorage**service that you created. This is the same storage account as the source data store. Then select **Next**.
126
125
127
126
6. On the **Choose the output file or folder** page, complete the following steps:
128
127
129
-
a. Browse and select the **destination** folder, and then select **Choose**.
128
+
a. Browse for and select the **destination** folder, and then select **Choose**:
130
129
131
-

130
+

132
131
133
132
b. Select **Next**.
134
133
@@ -142,43 +141,43 @@ Prepare your Blob storage for the tutorial by performing these steps.
10. Notice that the **Monitor** tab on the left is automatically selected. The application switches to the **Monitor** tab. You see the status of the pipeline. Select **Refresh** to refresh the list. Click the link under **PIPELINE NAME** to view activity run details or rerun the pipeline.
144
+
10. Notice that the **Monitor** tab on the left is automatically selected. The application switches to the **Monitor** tab. You see the status of the pipeline. Select **Refresh** to refresh the list. Select the link under **PIPELINE NAME** to view activity run details or to run the pipeline again.
146
145
147
-

146
+

148
147
149
-
11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the copy operation, select the **Details** link (eyeglasses icon) under the **ACTIVITY NAME** column. For details about the properties, see [Copy Activity overview](copy-activity-overview.md).
148
+
11. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the copy operation, select the **Details** link (the eyeglasses icon) in the **ACTIVITY NAME** column. For details about the properties, see [Copy activity overview](copy-activity-overview.md).
150
149
151
-

150
+

152
151
153
-
Because there is no file in the **source** container in your Blob storage account, you will not see any file copied to the **destination** container in your Blob storage account.
152
+
Because there are no files in the source container in your Blob storage account, you won't see any files copied to the destination container in the account:
154
153
155
-

154
+

156
155
157
-
12. Create an empty text file and name it **file1.txt**. Upload this text file to the **source** container in your storage account. You can use various tools to perform these tasks, such as[Azure Storage Explorer](https://storageexplorer.com/).
156
+
12. Create an empty text file and name it **file1.txt**. Upload this text file to the source container in your storage account. You can use various tools to perform these tasks, like[Azure Storage Explorer](https://storageexplorer.com/).
158
157
159
-

158
+

160
159
161
-
13. To go back to the **Pipeline Runs** view, select **All pipeline runs**, and wait for the same pipeline to be triggered again automatically.
160
+
13. To go back to the **Pipeline runs** view, select **All pipeline runs**, and wait for the same pipeline to be automatically triggered again.
162
161
163
-

162
+

164
163
165
-
14. When the second pipeline run completes, follow the same steps mentioned above to review the activity run details.
164
+
14. When the second pipeline run completes, follow the same steps mentioned previously to review the activity run details.
166
165
167
-
You will see that one file (file1.txt) has been copied from the **source** container to the **destination** container of your Blob storage account.
166
+
You'll see that one file (file1.txt) has been copied from the source container to the destination container of your Blob storage account:
168
167
169
-

168
+

170
169
171
-
15. Create another empty text file and name it **file2.txt**. Upload this text file to the **source** container in your Blob storage account.
170
+
15. Create another empty text file and name it **file2.txt**. Upload this text file to the source container in your Blob storage account.
172
171
173
-
16. Repeat steps 13 and 14 for this second text file. You will see that only the new file (file2.txt) has been copied from the **source** container to the **destination** container of your storage account in the next pipeline run.
172
+
16. Repeat steps 13 and 14 for the second text file. You'll see that only the new file (file2.txt) has been copied from the source container to the destination container of your storage account during this pipeline run.
174
173
175
-
You can also verify this by using [Azure Storage Explorer](https://storageexplorer.com/) to scan the files.
174
+
You can also verify that only one file has been copied by using [Azure Storage Explorer](https://storageexplorer.com/) to scan the files:
176
175
177
-

176
+

178
177
179
178
180
179
## Next steps
181
-
Advance to the following tutorial to learn about transforming data by using an Apache Spark cluster on Azure:
180
+
Go to the following tutorial to learn how to transform data by using an Apache Spark cluster on Azure:
182
181
183
182
> [!div class="nextstepaction"]
184
183
>[Transform data in the cloud by using an Apache Spark cluster](tutorial-transform-data-spark-portal.md)
0 commit comments