You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/data-factory/solution-template-migration-s3-azure.md
+75-75Lines changed: 75 additions & 75 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,82 +25,82 @@ Use the templates to migrate petabytes of data consisting of hundreds of million
25
25
26
26
Data partition is recommended especially when migrating more than 10 TB of data. To partition the data, leverage the ‘prefix’ setting to filter the folders and files on Amazon S3 by name, and then each ADF copy job can copy one partition at a time. You can run multiple ADF copy jobs concurrently for better throughput.
27
27
28
-
Data migration normally requires one-time historical data migration plus periodically synchronizing the changes from AWS S3 to Azure. There are two templates below where one template covers one-time historical data migration and another template covers synchronizing the changes from AWS S3 to Azure.
28
+
Data migration normally requires one-time historical data migration plus periodically synchronizing the changes from AWS S3 to Azure. There are two templates below, where one template covers one-time historical data migration and another template covers synchronizing the changes from AWS S3 to Azure.
29
29
30
30
### For the template to migrate historical data from Amazon S3 to Azure Data Lake Storage Gen2
31
31
32
-
This template (*template name: migrate historical data from AWS S3 to Azure Data Lake Storage Gen2*) assumes that you have written a partition list in an external control table in Azure SQL Database. So it will use a Lookup activity to retrieve the partition list from the external control table, iterate over each partition and make each ADF copy job copy one partition at a time. Once any copy job completed, it uses Stored Procedure activity to update the status of copying each partition in control table.
32
+
This template (*template name: migrate historical data from AWS S3 to Azure Data Lake Storage Gen2*) assumes that you have written a partition list in an external control table in Azure SQL Database. So it will use a *Lookup* activity to retrieve the partition list from the external control table, iterate over each partition, and make each ADF copy job copy one partition at a time. Once any copy job completed, it uses *Stored Procedure* activity to update the status of copying each partition in control table.
33
33
34
34
The template contains five activities:
35
35
-**Lookup** retrieves the partitions which have not been copied to Azure Data Lake Storage Gen2 from an external control table. The table name is *s3_partition_control_table* and the query to load data from the table is *"SELECT PartitionPrefix FROM s3_partition_control_table WHERE SuccessOrFailure = 0"*.
36
-
-**ForEach** gets the partition list from the *Lookup activity* and iterates each partition to the *TriggerCopy activity*. You can set the *batchCount* to run multiple ADF copy jobs concurrently. We have set 2 in this template.
37
-
-**ExecutePipeline** executes *CopyFolderPartitionFromS3 pipeline*. The reason we create another pipeline to make each copy job copy a partition is because it will make you easy to rerun the failed copy job to reload that specific partition again from AWS S3. All other copy jobs loading other partitions will not be impacted.
36
+
-**ForEach** gets the partition list from the *Lookup* activity and iterates each partition to the *TriggerCopy* activity. You can set the *batchCount* to run multiple ADF copy jobs concurrently. We have set 2 in this template.
37
+
-**ExecutePipeline** executes *CopyFolderPartitionFromS3* pipeline. The reason we create another pipeline to make each copy job copy a partition is because it will make you easy to rerun the failed copy job to reload that specific partition again from AWS S3. All other copy jobs loading other partitions will not be impacted.
38
38
-**Copy** copies each partition from AWS S3 to Azure Data Lake Storage Gen2.
39
39
-**SqlServerStoredProcedure** update the status of copying each partition in control table.
40
40
41
41
The template contains two parameters:
42
-
-*AWS_S3_bucketName* is your bucket name on AWS S3 where you want to migrate data from. If you want to migrate data from multiple buckets on AWS S3, you can add one more column in your external control table to store the bucket name for each partition, and also update your pipeline to retrieve data from that column accordingly.
43
-
-*Azure_Storage_fileSystem* is your fileSystem name on Azure Data Lake Storage Gen2 where you want to migrate data to.
42
+
-**AWS_S3_bucketName** is your bucket name on AWS S3 where you want to migrate data from. If you want to migrate data from multiple buckets on AWS S3, you can add one more column in your external control table to store the bucket name for each partition, and also update your pipeline to retrieve data from that column accordingly.
43
+
-**Azure_Storage_fileSystem** is your fileSystem name on Azure Data Lake Storage Gen2 where you want to migrate data to.
44
44
45
45
### For the template to periodically copy delta data from Amazon S3 to Azure Data Lake Storage Gen2
46
46
47
-
This template (*template name: copy delta data from AWS S3 to Azure Data Lake Storage Gen2*) uses LastModifiedTime of each file to identify and copy the new or updated files only from AWS S3 to Azure. Be aware if your files or folders has already been time partitioned with timeslice information as part of the file or folder name on AWS S3 (for example, /yyyy/mm/dd/file.csv), you can go to this [tutorial](tutorial-incremental-copy-partitioned-file-name-copy-data-tool.md) to get the more performant approach for incremental loading new files.
48
-
This template assumes that you have written a partition list in an external control table in Azure SQL Database. So it will use a Lookup activity to retrieve the partition list from the external control table, iterate over each partition and make each ADF copy job copy one partition at a time. When each copy job starts to copy the files from AWS S3, it relies on LastModifiedTime property to identify and copy the new or updated files only. Once any copy job completed, it uses Stored Procedure activity to update the status of copying each partition in control table.
47
+
This template (*template name: copy delta data from AWS S3 to Azure Data Lake Storage Gen2*) uses LastModifiedTime of each file to copy the new or updated files only from AWS S3 to Azure. Be aware if your files or folders has already been time partitioned with timeslice information as part of the file or folder name on AWS S3 (for example, /yyyy/mm/dd/file.csv), you can go to this [tutorial](tutorial-incremental-copy-partitioned-file-name-copy-data-tool.md) to get the more performant approach for incremental loading new files.
48
+
This template assumes that you have written a partition list in an external control table in Azure SQL Database. So it will use a *Lookup* activity to retrieve the partition list from the external control table, iterate over each partition, and make each ADF copy job copy one partition at a time. When each copy job starts to copy the files from AWS S3, it relies on LastModifiedTime property to identify and copy the new or updated files only. Once any copy job completed, it uses *Stored Procedure* activity to update the status of copying each partition in control table.
49
49
50
50
The template contains seven activities:
51
51
-**Lookup** retrieves the partitions from an external control table. The table name is *s3_partition_delta_control_table* and the query to load data from the table is *"select distinct PartitionPrefix from s3_partition_delta_control_table"*.
52
-
-**ForEach** gets the partition list from the Lookup activity and iterates each partition to the *TriggerDeltaCopy activity*. You can set the *batchCount* to run multiple ADF copy jobs concurrently. We have set 2 in this template.
53
-
-**ExecutePipeline**execute*DeltaCopyFolderPartitionFromS3 pipeline*. The reason we create another pipeline to make each copy job copy a partition is because it will make you easy to rerun the failed copy job to reload that specific partition again from AWS S3. All other copy jobs loading other partitions will not be impacted.
52
+
-**ForEach** gets the partition list from the *Lookup* activity and iterates each partition to the *TriggerDeltaCopy* activity. You can set the *batchCount* to run multiple ADF copy jobs concurrently. We have set 2 in this template.
53
+
-**ExecutePipeline**executes*DeltaCopyFolderPartitionFromS3* pipeline. The reason we create another pipeline to make each copy job copy a partition is because it will make you easy to rerun the failed copy job to reload that specific partition again from AWS S3. All other copy jobs loading other partitions will not be impacted.
54
54
-**Lookup** retrieves the last copy job run time from the external control table so that the new or updated files can be identified via LastModifiedTime. The table name is *s3_partition_delta_control_table* and the query to load data from the table is *"select max(JobRunTime) as LastModifiedTime from s3_partition_delta_control_table where PartitionPrefix = '@{pipeline().parameters.prefixStr}' and SuccessOrFailure = 1"*.
55
55
-**Copy** copies new or changed files only for each partition from AWS S3 to Azure Data Lake Storage Gen2. The property of *modifiedDatetimeStart* is set to the last copy job run time. The property of *modifiedDatetimeEnd* is set to the current copy job run time. Be aware the time is applied to UTC time zone.
56
56
-**SqlServerStoredProcedure** update the status of copying each partition and copy run time in control table when it succeeds. The column of SuccessOrFailure is set to 1.
57
57
-**SqlServerStoredProcedure** update the status of copying each partition and copy run time in control table when it fails. The column of SuccessOrFailure is set to 0.
58
58
59
59
The template contains two parameters:
60
-
-*AWS_S3_bucketName* is your bucket name on AWS S3 where you want to migrate data from. If you want to migrate data from multiple buckets on AWS S3, you can add one more column in your external control table to store the bucket name for each partition, and also update your pipeline to retrieve data from that column accordingly.
61
-
-*Azure_Storage_fileSystem* is your fileSystem name on Azure Data Lake Storage Gen2 where you want to migrate data to.
60
+
-**AWS_S3_bucketName** is your bucket name on AWS S3 where you want to migrate data from. If you want to migrate data from multiple buckets on AWS S3, you can add one more column in your external control table to store the bucket name for each partition, and also update your pipeline to retrieve data from that column accordingly.
61
+
-**Azure_Storage_fileSystem** is your fileSystem name on Azure Data Lake Storage Gen2 where you want to migrate data to.
62
62
63
63
## How to use these two solution templates
64
64
65
65
### For the template to migrate historical data from Amazon S3 to Azure Data Lake Storage Gen2
66
66
67
67
1. Create a control table in Azure SQL Database to store the partition list of AWS S3.
68
68
69
-
> [!NOTE]
70
-
> The table name is s3_partition_control_table.
71
-
> The schema of the control table is PartitionPrefix and SuccessOrFailure, where PartitionPrefix is the prefix setting in S3 to filter the folders and files in Amazon S3 by name, and SuccessOrFailure is the status of copying each partition: 0 means this partition has not been copied to Azure and 1 means this partition has been copied to Azure successfully.
72
-
> There are 5 partitions defined in control table and the default status of copying each partition is 0.
73
-
74
-
```sql
75
-
CREATE TABLE [dbo].[s3_partition_control_table](
76
-
[PartitionPrefix] [varchar](255) NULL,
77
-
[SuccessOrFailure] [bit] NULL
78
-
)
79
-
80
-
INSERT INTO s3_partition_control_table (PartitionPrefix, SuccessOrFailure)
81
-
VALUES
82
-
('a', 0),
83
-
('b', 0),
84
-
('c', 0),
85
-
('d', 0),
86
-
('e', 0);
87
-
```
69
+
> [!NOTE]
70
+
> The table name is s3_partition_control_table.
71
+
> The schema of the control table is PartitionPrefix and SuccessOrFailure, where PartitionPrefix is the prefix setting in S3 to filter the folders and files in Amazon S3 by name, and SuccessOrFailure is the status of copying each partition: 0 means this partition has not been copied to Azure and 1 means this partition has been copied to Azure successfully.
72
+
> There are 5 partitions defined in control table and the default status of copying each partition is 0.
73
+
74
+
```sql
75
+
CREATE TABLE [dbo].[s3_partition_control_table](
76
+
[PartitionPrefix] [varchar](255) NULL,
77
+
[SuccessOrFailure] [bit] NULL
78
+
)
79
+
80
+
INSERT INTO s3_partition_control_table (PartitionPrefix, SuccessOrFailure)
81
+
VALUES
82
+
('a', 0),
83
+
('b', 0),
84
+
('c', 0),
85
+
('d', 0),
86
+
('e', 0);
87
+
```
88
88
89
89
2. Create a Stored Procedure on the same Azure SQL Database for control table.
90
90
91
-
> [!NOTE]
92
-
> The name of the Stored Procedure is sp_update_partition_success. It will be invoked by SqlServerStoredProcedure activity in your ADF pipeline.
91
+
> [!NOTE]
92
+
> The name of the Stored Procedure is sp_update_partition_success. It will be invoked by SqlServerStoredProcedure activity in your ADF pipeline.
SET [SuccessOrFailure] =1WHERE [PartitionPrefix] = @PartPrefix
101
-
END
102
-
GO
103
-
```
99
+
UPDATE s3_partition_control_table
100
+
SET [SuccessOrFailure] = 1 WHERE [PartitionPrefix] = @PartPrefix
101
+
END
102
+
GO
103
+
```
104
104
105
105
3. Go to the **Migrate historical data from AWS S3 to Azure Data Lake Storage Gen2** template. Input the connections to your external control table, AWS S3 as the data source store and Azure Data Lake Storage Gen2 as the destination store. Be aware that the external control table and the stored procedure are reference to the same connection.
106
106
@@ -127,42 +127,42 @@ The template contains two parameters:
127
127
128
128
1. Create a control table in Azure SQL Database to store the partition list of AWS S3.
129
129
130
-
> [!NOTE]
131
-
> The table name is s3_partition_delta_control_table.
132
-
> The schema of the control table is PartitionPrefix, JobRunTime and SuccessOrFailure, where PartitionPrefix is the prefix setting in S3 to filter the folders and files in Amazon S3 by name, JobRunTime is the datetime value when copy job run, and SuccessOrFailure is the status of copying each partition: 0 means this partition has not been copied to Azure and 1 means this partition has been copied to Azure successfully.
133
-
> There are 5 partitions defined in control table. The default value for JobRunTime can be the time when one-time historical data migration starts. The default status of copying each partition is 1.
INSERT INTO s3_partition_delta_control_table (PartitionPrefix, JobRunTime, SuccessOrFailure)
143
-
VALUES
144
-
('a','1/1/2019 12:00:00 AM',1),
145
-
('b','1/1/2019 12:00:00 AM',1),
146
-
('c','1/1/2019 12:00:00 AM',1),
147
-
('d','1/1/2019 12:00:00 AM',1),
148
-
('e','1/1/2019 12:00:00 AM',1);
149
-
```
130
+
> [!NOTE]
131
+
> The table name is s3_partition_delta_control_table.
132
+
> The schema of the control table is PartitionPrefix, JobRunTime and SuccessOrFailure, where PartitionPrefix is the prefix setting in S3 to filter the folders and files in Amazon S3 by name, JobRunTime is the datetime value when copy jobs run, and SuccessOrFailure is the status of copying each partition: 0 means this partition has not been copied to Azure and1 means this partition has been copied to Azure successfully.
133
+
> There are 5 partitions defined in control table. The default value for JobRunTime can be the time when one-time historical data migration starts. ADF copy activity will copy the files on AWS S3 which have been last modified after that time. The default status of copying each partition is 1.
INSERT INTO s3_partition_delta_control_table (PartitionPrefix, JobRunTime, SuccessOrFailure)
143
+
VALUES
144
+
('a','1/1/2019 12:00:00 AM',1),
145
+
('b','1/1/2019 12:00:00 AM',1),
146
+
('c','1/1/2019 12:00:00 AM',1),
147
+
('d','1/1/2019 12:00:00 AM',1),
148
+
('e','1/1/2019 12:00:00 AM',1);
149
+
```
150
150
151
151
2. Create a Stored Procedure on the same Azure SQL Database for control table.
152
152
153
-
> [!NOTE]
154
-
> The name of the Stored Procedure is sp_insert_partition_JobRunTime_success. It will be invoked by SqlServerStoredProcedure activity in your ADF pipeline.
153
+
> [!NOTE]
154
+
> The name of the Stored Procedure is sp_insert_partition_JobRunTime_success. It will be invoked by SqlServerStoredProcedure activity in your ADF pipeline.
155
155
156
-
```sql
157
-
CREATE PROCEDURE [dbo].[sp_insert_partition_JobRunTime_success] @PartPrefix varchar(255), @JobRunTime datetime, @SuccessOrFailure bit
158
-
AS
159
-
BEGIN
160
-
INSERT INTO s3_partition_delta_control_table (PartitionPrefix, JobRunTime, SuccessOrFailure)
161
-
VALUES
162
-
(@PartPrefix,@JobRunTime,@SuccessOrFailure)
163
-
END
164
-
GO
165
-
```
156
+
```sql
157
+
CREATE PROCEDURE [dbo].[sp_insert_partition_JobRunTime_success] @PartPrefix varchar(255), @JobRunTime datetime, @SuccessOrFailure bit
158
+
AS
159
+
BEGIN
160
+
INSERT INTO s3_partition_delta_control_table (PartitionPrefix, JobRunTime, SuccessOrFailure)
161
+
VALUES
162
+
(@PartPrefix,@JobRunTime,@SuccessOrFailure)
163
+
END
164
+
GO
165
+
```
166
166
167
167
168
168
3. Go to the **Copy delta data from AWS S3 to Azure Data Lake Storage Gen2** template. Input the connections to your external control table, AWS S3 as the data source store and Azure Data Lake Storage Gen2 as the destination store. Be aware that the external control table and the stored procedure are reference to the same connection.
@@ -185,7 +185,7 @@ The template contains two parameters:
185
185
186
186

187
187
188
-
8. You can also check the results from the control table by a query "select * from s3_partition_delta_control_table", you will see the output similar to the following example:
188
+
8. You can also check the results from the control table by a query *"select * from s3_partition_delta_control_table"*, you will see the output similar to the following example:
189
189
190
190

0 commit comments