You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This article outlines how to copy data from Google Cloud Storage. To learn about Azure Data Factory, read the [introductory article](introduction.md).
19
+
This article outlines how to copy data from Google Cloud Storage (GCS). To learn about Azure Data Factory, read the [introductory article](introduction.md).
20
20
21
21
## Supported capabilities
22
22
@@ -27,27 +27,22 @@ This Google Cloud Storage connector is supported for the following activities:
Specifically, this Google Cloud Storage connector supports copying files as-is or parsing files with the [supported file formats and compression codecs](supported-file-formats-and-compression-codecs.md).
31
-
32
-
>[!NOTE]
33
-
>Copying data from Google Cloud Storage leverages the [Amazon S3 connector](connector-amazon-simple-storage-service.md) with corresponding custom S3 endpoint, as Google Cloud Storage provides S3-compatible interoperability.
30
+
Specifically, this Google Cloud Storage connector supports copying files as-is or parsing files with the [supported file formats and compression codecs](supported-file-formats-and-compression-codecs.md). It leverages GCS's S3-compatible interoperability.
34
31
35
32
## Prerequisites
36
33
37
34
The following set-up is required on your Google Cloud Storage account:
38
35
39
36
1. Enable interoperability for your Google Cloud Storage account
40
-
2. Set the default project which contains the data you want to copy
41
-
3. Create an access key.
37
+
2. Set the default project which contains the data you want to copy from the target GCS bucket
38
+
3. Create a service account and define the right levels of permissions using Cloud IAM on GCP
39
+
4. Generate the access keys for this service account
42
40
43
41

44
42
45
43
## Required permissions
46
44
47
-
To copy data from Google Cloud Storage, make sure you have been granted the following permissions:
48
-
49
-
-**For copy activity execution:**: `s3:GetObject` and `s3:GetObjectVersion` for Object Operations.
50
-
-**For Data Factory GUI authoring**: `s3:ListAllMyBuckets` and `s3:ListBucket`/`s3:GetBucketLocation` for Bucket Operations permissions are additionally required for operations like test connection and browse/navigate file paths. If you don't want to grant these permission, skip test connection in linked service creation page and specify the path directly in dataset settings.
45
+
To copy data from Google Cloud Storage, make sure you have granted the needed permissions. The permissions defined in the service account may contain `storage.buckets.get`, `storage.buckets.list`, `storage.objects.get` for Object Operations.
51
46
52
47
## Getting started
53
48
@@ -64,7 +59,7 @@ The following properties are supported for Google Cloud Storage linked service:
64
59
| type | The type property must be set to **GoogleCloudStorage**. | Yes |
65
60
| accessKeyId | ID of the secret access key. To find the access key and secret, see [Prerequisites](#prerequisites). |Yes |
66
61
| secretAccessKey | The secret access key itself. Mark this field as a SecureString to store it securely in Data Factory, or [reference a secret stored in Azure Key Vault](store-credentials-in-key-vault.md). |Yes |
67
-
| serviceUrl | Specify the custom S3 endpoint as **`https://storage.googleapis.com`**. | Yes |
62
+
| serviceUrl | Specify the custom GCS endpoint as **`https://storage.googleapis.com`**. | Yes |
68
63
| connectVia | The [Integration Runtime](concepts-integration-runtime.md) to be used to connect to the data store. You can use Azure Integration Runtime or Self-hosted Integration Runtime (if your data store is located in private network). If not specified, it uses the default Azure Integration Runtime. |No |
69
64
70
65
Here is an example:
@@ -98,8 +93,8 @@ The following properties are supported for Google Cloud Storage under `location`
| type | The type property under `location` in dataset must be set to **AmazonS3Location**. | Yes |
102
-
| bucketName | The S3 bucket name. | Yes |
96
+
| type | The type property under `location` in dataset must be set to **GoogleCloudStorageLocation**. | Yes |
97
+
| bucketName | The GCS bucket name. | Yes |
103
98
| folderPath | The path to folder under the given bucket. If you want to use wildcard to filter folder, skip this setting and specify in activity source settings. | No |
104
99
| fileName | The file name under the given bucket + folderPath. If you want to use wildcard to filter files, skip this setting and specify in activity source settings. | No |
105
100
@@ -117,7 +112,7 @@ The following properties are supported for Google Cloud Storage under `location`
117
112
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
118
113
"typeProperties": {
119
114
"location": {
120
-
"type": "AmazonS3Location",
115
+
"type": "GoogleCloudStorageLocation",
121
116
"bucketName": "bucketname",
122
117
"folderPath": "folder/subfolder"
123
118
},
@@ -142,11 +137,12 @@ The following properties are supported for Google Cloud Storage under `storeSett
| type | The type property under `storeSettings` must be set to **AmazonS3ReadSettings**. | Yes |
140
+
| type | The type property under `storeSettings` must be set to **GoogleCloudStorageReadSettings**. | Yes |
146
141
|***Locate the files to copy:***|||
147
142
| OPTION 1: static path<br> | Copy from the given bucket or folder/file path specified in the dataset. If you want to copy all files from a bucket/folder, additionally specify `wildcardFileName` as `*`. ||
148
-
| OPTION 2: wildcard<br>- wildcardFolderPath | The folder path with wildcard characters under the given bucket configured in dataset to filter source folders. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | No |
149
-
| OPTION 2: wildcard<br>- wildcardFileName | The file name with wildcard characters under the given bucket + folderPath/wildcardFolderPath to filter source files. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | Yes |
143
+
| OPTION 2: GCS prefix<br>- prefix | Prefix for the GCS key name under the given bucket configured in dataset to filter source GCS files. GCS keys whose name starts with `bucket_in_dataset/this_prefix` are selected. It utilizes GCS's service side filter which provide better performance than wildcard filter. | No |
144
+
| OPTION 3: wildcard<br>- wildcardFolderPath | The folder path with wildcard characters under the given bucket configured in dataset to filter source folders. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | No |
145
+
| OPTION 3: wildcard<br>- wildcardFileName | The file name with wildcard characters under the given bucket + folderPath/wildcardFolderPath to filter source files. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | Yes |
150
146
| OPTION 3: a list of files<br>- fileListPath | Indicates to copy a given file set. Point to a text file that includes a list of files you want to copy, one file per line which is the relative path to the path configured in the dataset.<br/>When using this option, do not specify file name in dataset. See more examples in [File list examples](#file-list-examples). |No |
151
147
|***Additional settings:***|||
152
148
| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink. <br>Allowed values are **true** (default) and **false**.<br>This property doesn't apply when you configure `fileListPath`. |No |
@@ -181,7 +177,7 @@ The following properties are supported for Google Cloud Storage under `storeSett
181
177
"skipLineCount": 10
182
178
},
183
179
"storeSettings":{
184
-
"type": "AmazonS3ReadSettings",
180
+
"type": "GoogleCloudStorageReadSettings",
185
181
"recursive": true,
186
182
"wildcardFolderPath": "myfolder*A",
187
183
"wildcardFileName": "*.csv"
@@ -230,95 +226,7 @@ To learn details about the properties, check [Delete activity](delete-activity.m
230
226
231
227
## Legacy models
232
228
233
-
>[!NOTE]
234
-
>The following models are still supported as-is for backward compatibility. You are suggested to use the new model mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
235
-
236
-
### Legacy dataset model
237
-
238
-
| Property | Description | Required |
239
-
|:--- |:--- |:--- |
240
-
| type | The type property of the dataset must be set to: **AmazonS3Object**|Yes |
241
-
| bucketName | The S3 bucket name. Wildcard filter is not supported. |Yes for Copy/Lookup activity, No for GetMetadata activity |
242
-
| key | The **name or wildcard filter** of S3 object key under the specified bucket. Applies only when "prefix" property is not specified. <br/><br/>The wildcard filter is supported for both folder part and file name part. Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character).<br/>- Example 1: `"key": "rootfolder/subfolder/*.csv"`<br/>- Example 2: `"key": "rootfolder/subfolder/???20180427.txt"`<br/>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). Use `^` to escape if your actual folder/file name has wildcard or this escape char inside. |No |
243
-
| prefix | Prefix for the S3 object key. Objects whose keys start with this prefix are selected. Applies only when "key" property is not specified. |No |
244
-
| version | The version of the S3 object, if S3 versioning is enabled. |No |
245
-
| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. The files will be selected if their last modified time are within the time range between `modifiedDatetimeStart` and `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> The properties can be NULL which mean no file attribute filter will be applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No |
246
-
| modifiedDatetimeEnd | Files filter based on the attribute: Last Modified. The files will be selected if their last modified time are within the time range between `modifiedDatetimeStart` and `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> The properties can be NULL which mean no file attribute filter will be applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No |
247
-
| format | If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions.<br/><br/>If you want to parse or generate files with a specific format, the following file format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](supported-file-formats-and-compression-codecs-legacy.md#text-format), [Json Format](supported-file-formats-and-compression-codecs-legacy.md#json-format), [Avro Format](supported-file-formats-and-compression-codecs-legacy.md#avro-format), [Orc Format](supported-file-formats-and-compression-codecs-legacy.md#orc-format), and [Parquet Format](supported-file-formats-and-compression-codecs-legacy.md#parquet-format) sections. |No (only for binary copy scenario) |
248
-
| compression | Specify the type and level of compression for the data. For more information, see [Supported file formats and compression codecs](supported-file-formats-and-compression-codecs-legacy.md#compression-support).<br/>Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**.<br/>Supported levels are: **Optimal** and **Fastest**. |No |
249
-
250
-
>[!TIP]
251
-
>To copy all files under a folder, specify **bucketName** for bucket and **prefix** for folder part.<br>To copy a single file with a given name, specify **bucketName** for bucket and **key** for folder part plus file name.<br>To copy a subset of files under a folder, specify **bucketName** for bucket and **key** for folder part plus wildcard filter.
252
-
253
-
**Example: using prefix**
254
-
255
-
```json
256
-
{
257
-
"name": "GoogleCloudStorageDataset",
258
-
"properties": {
259
-
"type": "AmazonS3Object",
260
-
"linkedServiceName": {
261
-
"referenceName": "<linked service name>",
262
-
"type": "LinkedServiceReference"
263
-
},
264
-
"typeProperties": {
265
-
"bucketName": "testbucket",
266
-
"prefix": "testFolder/test",
267
-
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
268
-
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
269
-
"format": {
270
-
"type": "TextFormat",
271
-
"columnDelimiter": ",",
272
-
"rowDelimiter": "\n"
273
-
},
274
-
"compression": {
275
-
"type": "GZip",
276
-
"level": "Optimal"
277
-
}
278
-
}
279
-
}
280
-
}
281
-
```
282
-
283
-
### Legacy copy activity source model
284
-
285
-
| Property | Description | Required |
286
-
|:--- |:--- |:--- |
287
-
| type | The type property of the copy activity source must be set to: **FileSystemSource**|Yes |
288
-
| recursive | Indicates whether the data is read recursively from the sub folders or only from the specified folder. Note when recursive is set to true and sink is file-based store, empty folder/sub-folder will not be copied/created at sink.<br/>Allowed values are: **true** (default), **false**| No |
289
-
| maxConcurrentConnections | The number of the connections to connect to storage store concurrently. Specify only when you want to limit the concurrent connection to the data store. | No |
290
-
291
-
**Example:**
292
-
293
-
```json
294
-
"activities":[
295
-
{
296
-
"name": "CopyFromGoogleCloudStorage",
297
-
"type": "Copy",
298
-
"inputs": [
299
-
{
300
-
"referenceName": "<input dataset name>",
301
-
"type": "DatasetReference"
302
-
}
303
-
],
304
-
"outputs": [
305
-
{
306
-
"referenceName": "<output dataset name>",
307
-
"type": "DatasetReference"
308
-
}
309
-
],
310
-
"typeProperties": {
311
-
"source": {
312
-
"type": "FileSystemSource",
313
-
"recursive": true
314
-
},
315
-
"sink": {
316
-
"type": "<sink type>"
317
-
}
318
-
}
319
-
}
320
-
]
321
-
```
229
+
If you were using Amazon S3 connector to copy data from Google Cloud Storage, it is still supported as-is for backward compatibility. You are suggested to use the new model mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
322
230
323
231
## Next steps
324
232
For a list of data stores that are supported as sources and sinks by the copy activity in Azure Data Factory, see [supported data stores](copy-activity-overview.md#supported-data-stores-and-formats).
0 commit comments