|
1 | 1 | ---
|
2 | 2 | title: Copy data from/to a file system
|
3 | 3 | titleSuffix: Azure Data Factory & Azure Synapse
|
4 |
| -description: Learn how to copy data from file system to supported sink data stores (or) from supported source data stores to file system using an Azure Data Factory or Azure Synapse Analytics pipelines. |
| 4 | +description: Learn how to copy data from file system to supported sink data stores, or from supported source data stores to file system, using an Azure Data Factory or Azure Synapse Analytics pipelines. |
5 | 5 | author: jianleishen
|
6 | 6 | ms.service: data-factory
|
7 | 7 | ms.subservice: data-movement
|
@@ -179,9 +179,9 @@ The following properties are supported for file system under `storeSettings` set
|
179 | 179 | | OPTION 3: a list of files<br>- fileListPath | Indicates to copy a given file set. Point to a text file that includes a list of files you want to copy, one file per line, which is the relative path to the path configured in the dataset.<br/>When using this option, don't specify file name in dataset. See more examples in [File list examples](#file-list-examples). |No |
|
180 | 180 | | ***Additional settings:*** | | |
|
181 | 181 | | recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. When recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink. <br>Allowed values are **true** (default) and **false**.<br>This property doesn't apply when you configure `fileListPath`. |No |
|
182 |
| -| deleteFilesAfterCompletion | Indicates whether the binary files will be deleted from source store after successfully moving to the destination store. The file deletion is per file, so when copy activity fails, you'll see some files have already been copied to the destination and deleted from source, while others are still remaining on source store. <br/>This property is only valid in binary files copy scenario. The default value: false. |No | |
183 |
| -| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. <br>The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value are selected.<br/>This property doesn't apply when you configure `fileListPath`. | No | |
184 |
| -| modifiedDatetimeEnd | Same as above. | No | |
| 182 | +| deleteFilesAfterCompletion | Indicates whether the binary files are deleted from source store after successfully moving to the destination store. The file deletion is per file. This means that when the activity fails, you see some files already copied to the destination and deleted from source, while others still remain on source store. <br/>This property is only valid in binary files copy scenario. The default value: false. |No | |
| 183 | +| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. <br>The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of _YYYY-MM-DDTHH:mm:ssZ_. <br> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value are selected.<br/>This property doesn't apply when you configure `fileListPath`. | No | |
| 184 | +| modifiedDatetimeEnd | Same as modifiedDateTimeStart. | No | |
185 | 185 | | enablePartitionDiscovery | For files that are partitioned, specify whether to parse the partitions from the file path and add them as extra source columns.<br/>Allowed values are **false** (default) and **true**. | No |
|
186 | 186 | | partitionRootPath | When partition discovery is enabled, specify the absolute root path in order to read partitioned folders as data columns.<br/><br/>If it isn't specified, by default,<br/>- When you use file path in dataset or list of files on source, partition root path is the path configured in dataset.<br/>- When you use wildcard folder filter, partition root path is the subpath before the first wildcard.<br/><br/>For example, assuming you configure the path in dataset as "root/folder/year=2020/month=08/day=27":<br/>- If you specify partition root path as "root/folder/year=2020", copy activity generates two more columns `month` and `day` with value "08" and "27" respectively, in addition to the columns inside the files.<br/>- If partition root path isn't specified, no extra column is generated. | No |
|
187 | 187 | | maxConcurrentConnections |The upper limit of concurrent connections established to the data store during the activity run. Specify a value only when you want to limit concurrent connections.| No |
|
@@ -231,6 +231,9 @@ The following properties are supported for file system under `storeSettings` set
|
231 | 231 |
|
232 | 232 | [!INCLUDE [data-factory-v2-file-sink-formats](includes/data-factory-v2-file-sink-formats.md)]
|
233 | 233 |
|
| 234 | +> [!NOTE] |
| 235 | +> The _MergeFiles_ **copyBehavior** option is only available in Azure Data Factory pipelines and not Synapse Analytics pipelines. |
| 236 | +
|
234 | 237 | The following properties are supported for file system under `storeSettings` settings in format-based copy sink:
|
235 | 238 |
|
236 | 239 | | Property | Description | Required |
|
@@ -293,7 +296,7 @@ Assuming you have the following source folder structure and want to copy the fil
|
293 | 296 |
|
294 | 297 | | Sample source structure | Content in FileListToCopy.txt | Pipeline configuration |
|
295 | 298 | | ------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------------------ |
|
296 |
| -| root<br/> FolderA<br/> **File1.csv**<br/> File2.json<br/> Subfolder1<br/> **File3.csv**<br/> File4.json<br/> **File5.csv**<br/> Metadata<br/> FileListToCopy.txt | File1.csv<br>Subfolder1/File3.csv<br>Subfolder1/File5.csv | **In dataset:**<br>- Folder path: `root/FolderA`<br><br>**In copy activity source:**<br>- File list path: `root/Metadata/FileListToCopy.txt` <br><br>The file list path points to a text file in the same data store that includes a list of files you want to copy, one file per line with the relative path to the path configured in the dataset. | |
| 299 | +| root<br/> FolderA<br/> **File1.csv**<br/> File2.json<br/> Subfolder1<br/> **File3.csv**<br/> File4.json<br/> **File5.csv**<br/> Metadata<br/> FileListToCopy.txt | File1.csv<br>Subfolder1/File3.csv<br>Subfolder1/File5.csv | **In dataset:**<br>- Folder path: `root/FolderA`<br><br>**In copy activity source:**<br>- File list path: `root/Metadata/FileListToCopy.txt` <br><br>The file list path points to a text file in the same data store. It includes a list of files you want to copy. Each line contains the relative path to the file based on the root path configured in the dataset. | |
297 | 300 |
|
298 | 301 | ### recursive and copyBehavior examples
|
299 | 302 |
|
@@ -330,10 +333,10 @@ To learn details about the properties, check [Delete activity.](delete-activity.
|
330 | 333 | | Property | Description | Required |
|
331 | 334 | |:--- |:--- |:--- |
|
332 | 335 | | type | The type property of the dataset must be set to: **FileShare** |Yes |
|
333 |
| -| folderPath | Path to the folder. Wildcard filter is supported, allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br/><br/>Examples: rootfolder/subfolder/, see more examples in [Sample linked service and dataset definitions](#sample-linked-service-and-dataset-definitions) and [Folder and file filter examples](#folder-and-file-filter-examples). |No | |
| 336 | +| folderPath | Path to the folder. Wildcard filter is supported. Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br/><br/>Examples: rootfolder/subfolder/, see more examples in [Sample linked service and dataset definitions](#sample-linked-service-and-dataset-definitions) and [Folder and file filter examples](#folder-and-file-filter-examples). |No | |
334 | 337 | | fileName | **Name or wildcard filter** for the files under the specified "folderPath". If you don't specify a value for this property, the dataset points to all files in the folder. <br/><br/>For filter, allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character).<br/>- Example 1: `"fileName": "*.csv"`<br/>- Example 2: `"fileName": "???20180427.txt"`<br/>Use `^` to escape if your actual file name has wildcard or this escape char inside.<br/><br/>When fileName isn't specified for an output dataset and **preserveHierarchy** isn't specified in the activity sink, the copy activity automatically generates the file name with the following pattern: "*Data.[activity run ID GUID].[GUID if FlattenHierarchy].[format if configured].[compression if configured]*", for example "Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.gz"; if you copy from tabular source using table name instead of query, the name pattern is "*[table name].[format].[compression if configured]*", for example "MyTable.csv". |No |
|
335 |
| -| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> Be aware the overall performance of data movement are impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No | |
336 |
| -| modifiedDatetimeEnd | Files filter based on the attribute: Last Modified. The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> Be aware the overall performance of data movement are impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value are selected.| No | |
| 338 | +| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of _YYYY-MM-DDTHH:mm:ssZ_. <br/><br/> Be aware the overall performance of data movement are impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value are selected.| No | |
| 339 | +| modifiedDatetimeEnd | Files filter based on the attribute: Last Modified. The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> Be aware the overall performance of data movement are impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value are selected.| No | |
337 | 340 | | format | If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions.<br/><br/>If you want to parse or generate files with a specific format, the following file format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](supported-file-formats-and-compression-codecs-legacy.md#text-format), [Json Format](supported-file-formats-and-compression-codecs-legacy.md#json-format), [Avro Format](supported-file-formats-and-compression-codecs-legacy.md#avro-format), [Orc Format](supported-file-formats-and-compression-codecs-legacy.md#orc-format), and [Parquet Format](supported-file-formats-and-compression-codecs-legacy.md#parquet-format) sections. |No (only for binary copy scenario) |
|
338 | 341 | | compression | Specify the type and level of compression for the data. For more information, see [Supported file formats and compression codecs](supported-file-formats-and-compression-codecs-legacy.md#compression-support).<br/>Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**.<br/>Supported levels are: **Optimal** and **Fastest**. |No |
|
339 | 342 |
|
@@ -378,7 +381,7 @@ To learn details about the properties, check [Delete activity.](delete-activity.
|
378 | 381 | | Property | Description | Required |
|
379 | 382 | |:--- |:--- |:--- |
|
380 | 383 | | type | The type property of the copy activity source must be set to: **FileSystemSource** |Yes |
|
381 |
| -| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note when recursive is set to true and sink is file-based store, empty folder/sub-folder won't be copied/created at sink.<br/>Allowed values are: **true** (default), **false** | No | |
| 384 | +| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note when recursive is set to true and sink is file-based store, empty folder/sub-folder aren't be copied/created at sink.<br/>Allowed values are: **true** (default), **false** | No | |
382 | 385 | | maxConcurrentConnections |The upper limit of concurrent connections established to the data store during the activity run. Specify a value only when you want to limit concurrent connections.| No |
|
383 | 386 |
|
384 | 387 | **Example:**
|
@@ -418,7 +421,7 @@ To learn details about the properties, check [Delete activity.](delete-activity.
|
418 | 421 | | Property | Description | Required |
|
419 | 422 | |:--- |:--- |:--- |
|
420 | 423 | | type | The type property of the copy activity sink must be set to: **FileSystemSink** |Yes |
|
421 |
| -| copyBehavior | Defines the copy behavior when the source is files from file-based data store.<br/><br/>Allowed values are:<br/><b>- PreserveHierarchy (default)</b>: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder.<br/><b>- FlattenHierarchy</b>: all files from the source folder are in the first level of target folder. The target files have autogenerated name. <br/><b>- MergeFiles</b>: merges all files from the source folder to one file. No record deduplication is performed during the merge. If the File Name is specified, the merged file name would be the specified name; otherwise, would be autogenerated file name. | No | |
| 424 | +| copyBehavior | Defines the copy behavior when the source is files from file-based data store.<br/><br/>Allowed values are:<br/><b>- PreserveHierarchy (default)</b>: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder.<br/><b>- FlattenHierarchy</b>: all files from the source folder are in the first level of target folder. The target file names are autogenerated. <br/><b>- MergeFiles</b>: merges all files from the source folder to one file. No record deduplication is performed during the merge. If the File Name is specified, the merged file name would be the specified name; otherwise, would be autogenerated file name. | No | |
422 | 425 | | maxConcurrentConnections |The upper limit of concurrent connections established to the data store during the activity run. Specify a value only when you want to limit concurrent connections.| No |
|
423 | 426 |
|
424 | 427 | **Example:**
|
|
0 commit comments