Skip to content

Commit 80a5af2

Browse files
authored
Merge pull request #114666 from dearandyxu/master
create data consistency doc
2 parents cbbb582 + 887f216 commit 80a5af2

File tree

3 files changed

+348
-13
lines changed

3 files changed

+348
-13
lines changed

articles/data-factory/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -416,6 +416,8 @@
416416
href: copy-activity-schema-and-type-mapping.md
417417
- name: Fault tolerance
418418
href: copy-activity-fault-tolerance.md
419+
- name: Data consistency verification
420+
href: copy-activity-data-consistency.md
419421
- name: Format and compression support (legacy)
420422
href: supported-file-formats-and-compression-codecs-legacy.md
421423
- name: Transform data
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
title: Data consistency verification in copy activity
3+
description: 'Learn about how to enable data consistency verification in copy activity in Azure Data Factory.'
4+
services: data-factory
5+
documentationcenter: ''
6+
author: dearandyxu
7+
manager:
8+
ms.reviewer:
9+
10+
ms.service: data-factory
11+
ms.workload: data-services
12+
13+
14+
ms.topic: conceptual
15+
ms.date: 3/27/2020
16+
ms.author: yexu
17+
18+
---
19+
# Data consistency verification in copy activity (Preview)
20+
21+
[!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)]
22+
23+
When you move data from source to destination store, Azure Data Factory copy activity provides an option for you to do additional data consistency verification to ensure the data is not only successfully copied from source to destination store, but also verified to be consistent between source and destination store. Once inconsistent data have been found during the data movement, you can either abort the copy activity or continue to copy the rest by enabling fault tolerance setting to skip inconsistent data. You can get the skipped object names by enabling session log setting in copy activity.
24+
25+
> [!IMPORTANT]
26+
> This feature is currently in preview with the following limitations we are actively working on:
27+
>- Data consistency verification is available only on binary files copying between file-based stores with 'PreserveHierarchy' behavior in copy activity. For copying tabular data, data consistency verification is not available in copy activity yet.
28+
>- When you enable session log setting in copy activity to log the inconsistent files being skipped, the completeness of log file can not be 100% guaranteed if copy activity failed.
29+
>- The session log contains inconsistent files only, where the successfully copied files are not logged so far.
30+
31+
## Supported data stores
32+
33+
### Source data stores
34+
35+
- [Azure Blob storage](connector-azure-blob-storage.md)
36+
- [Azure Data Lake Storage Gen1](connector-azure-data-lake-store.md)
37+
- [Azure Data Lake Storage Gen2](connector-azure-data-lake-storage.md)
38+
- [Azure File Storage](connector-azure-file-storage.md)
39+
- [Amazon S3](connector-amazon-simple-storage-service.md)
40+
- [File System](connector-file-system.md)
41+
- [HDFS](connector-hdfs.md)
42+
43+
### Destination data stores
44+
45+
- [Azure Blob storage](connector-azure-blob-storage.md)
46+
- [Azure Data Lake Storage Gen1](connector-azure-data-lake-store.md)
47+
- [Azure Data Lake Storage Gen2](connector-azure-data-lake-storage.md)
48+
- [Azure File Storage](connector-azure-file-storage.md)
49+
- [File System](connector-file-system.md)
50+
51+
52+
## Configuration
53+
The following example provides a JSON definition to enable data consistency verification in Copy Activity:
54+
55+
```json
56+
"typeProperties": {
57+
"source": {
58+
"type": "BinarySource",
59+
"storeSettings": {
60+
"type": "AzureDataLakeStoreReadSettings",
61+
"recursive": true
62+
}
63+
},
64+
"sink": {
65+
"type": "BinarySink",
66+
"storeSettings": {
67+
"type": "AzureDataLakeStoreWriteSettings"
68+
}
69+
},
70+
"validateDataConsistency": true,
71+
"skipErrorFile": {
72+
"dataInconsistency": true
73+
},
74+
"logStorageSettings": {
75+
"linkedServiceName": {
76+
"referenceName": "ADLSGen2_storage",
77+
"type": "LinkedServiceReference"
78+
},
79+
"path": "/sessionlog/"
80+
}
81+
}
82+
```
83+
84+
Property | Description | Allowed values | Required
85+
-------- | ----------- | -------------- | --------
86+
validateDataConsistency | If you set true for this property, copy activity will check file size, lastModifiedDate, and MD5 checksum for each object copied from source to destination store to ensure the data consistency between source and destination store. Be aware the copy performance will be affected by enabling this option. | True<br/>False (default) | No
87+
dataInconsistency | One of the key-value pairs within skipErrorFile property bag to determine if you want to skip the inconsistent data.<br/> -True: you want to copy the rest by skipping inconsistent data.<br/> - False: you want to abort the copy activity once inconsistent data found.<br/>Be aware this property is only valid when you set validateDataConsistency as True. | True<br/>False (default) | No
88+
logStorageSettings | A group of properties that can be specified to enable session log to log skipped objects. | | No
89+
linkedServiceName | The linked service of [Azure Blob Storage](connector-azure-blob-storage.md#linked-service-properties) or [Azure Data Lake Storage Gen2](connector-azure-data-lake-storage.md#linked-service-properties) to store the session log files. | The names of an `AzureBlobStorage` or `AzureBlobFS` types linked service, which refers to the instance that you use to store the log files. | No
90+
path | The path of the log files. | Specify the path that you want to store the log files. If you do not provide a path, the service creates a container for you. | No
91+
92+
>[!NOTE]
93+
>- Data consistency is not supported in staging copy scenario.
94+
>- When copying binary files from any storage store to Azure Blob Storage or Azure Data Lake Storage Gen2, copy activity does file size and MD5 checksum verification to ensure the data consistency between source and destination stores.
95+
>- When copying binary files from any storage store to any storage stores other than Azure Blob Storage or Azure Data Lake Storage Gen2, copy activity does file size verification to ensure the data consistency between source and destination store.
96+
97+
98+
## Monitoring
99+
100+
### Output from copy activity
101+
After the copy activity runs completely, you can see the result of data consistency verification from the output of each copy activity run:
102+
103+
```json
104+
"output": {
105+
"dataRead": 695,
106+
"dataWritten": 186,
107+
"filesRead": 3,
108+
"filesWritten": 1,
109+
"filesSkipped": 2,
110+
"throughput": 297,
111+
"logPath": "https://myblobstorage.blob.core.windows.net//myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
112+
"dataConsistencyVerification":
113+
{
114+
"VerificationResult": "Verified",
115+
"InconsistentData": "Skipped"
116+
}
117+
}
118+
119+
```
120+
You can see the details of data consistency verification from "dataConsistencyVerification property".
121+
122+
Value of **VerificationResult**:
123+
- **Verified**: Your copied data has been verified to be consistent between source and destination store.
124+
- **NotVerified**: Your copied data has not been verified to be consistent because you have not enabled the validateDataConsistency in copy activity.
125+
- **Unsupported**: Your copied data has not been verified to be consistent because data consistency verification is not supported for this particular copy pair.
126+
127+
Value of **InconsistentData**:
128+
- **Found**: ADF copy activity has found inconsistent data.
129+
- **Skipped**: ADF copy activity has found and skipped inconsistent data.
130+
- **None**: ADF copy activity has not found any inconsistent data. It can be either because your data has been verified to be consistent between source and destination store or because you disabled validateDataConsistency in copy activity.
131+
132+
### Session log from copy activity
133+
134+
If you configure to log the inconsistent file, you can find the log file from this path: `https://[your-blob-account].blob.core.windows.net/[path-if-configured]/copyactivity-logs/[copy-activity-name]/[copy-activity-run-id]/[auto-generated-GUID].csv`. The log files will be the csv files.
135+
136+
The schema of a log file is as following:
137+
138+
Column | Description
139+
-------- | -----------
140+
Timestamp | The timestamp when ADF skips the inconsistent files.
141+
Level | The log level of this item. It will be in 'Warning' level for the item showing file skipping.
142+
OperationName | ADF copy activity operational behavior on each file. It will be 'FileSkip' to specify the file to be skipped.
143+
OperationItem | The file name to be skipped.
144+
Message | More information to illustrate why files being skipped.
145+
146+
The example of a log file is as following:
147+
```
148+
Timestamp, Level, OperationName, OperationItem, Message
149+
2020-02-26 06:22:56.3190846, Warning, FileSkip, "sample1.csv", "File is skipped after read 548000000 bytes: ErrorCode=DataConsistencySourceDataChanged,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Source file 'sample1.csv' is changed by other clients during the copy activity run.,Source=,'."
150+
```
151+
From the log file above, you can see sample1.csv has been skipped because it failed to be verified to be consistent between source and destination store. You can get more details about why sample1.csv becomes inconsistent is because it was being changed by other applications when ADF copy activity is copying at the same time.
152+
153+
154+
155+
## Next steps
156+
See the other Copy Activity articles:
157+
158+
- [Copy activity overview](copy-activity-overview.md)
159+
- [Copy activity fault tolerance](copy-activity-fault-tolerance.md)
160+
161+

0 commit comments

Comments
 (0)