Skip to content

Commit eb825a6

Browse files
authored
Merge pull request #115770 from linda33wj/master
Update ADF GCP connector article
2 parents d597e59 + cd13625 commit eb825a6

File tree

1 file changed

+17
-109
lines changed

1 file changed

+17
-109
lines changed

articles/data-factory/connector-google-cloud-storage.md

Lines changed: 17 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,14 @@ ms.reviewer: douglasl
99
ms.service: data-factory
1010
ms.workload: data-services
1111
ms.topic: conceptual
12-
ms.date: 05/15/2020
12+
ms.date: 05/19/2020
1313
ms.author: jingwang
1414

1515
---
1616
# Copy data from Google Cloud Storage using Azure Data Factory
1717
[!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)]
1818

19-
This article outlines how to copy data from Google Cloud Storage. To learn about Azure Data Factory, read the [introductory article](introduction.md).
19+
This article outlines how to copy data from Google Cloud Storage (GCS). To learn about Azure Data Factory, read the [introductory article](introduction.md).
2020

2121
## Supported capabilities
2222

@@ -27,27 +27,22 @@ This Google Cloud Storage connector is supported for the following activities:
2727
- [GetMetadata activity](control-flow-get-metadata-activity.md)
2828
- [Delete activity](delete-activity.md)
2929

30-
Specifically, this Google Cloud Storage connector supports copying files as-is or parsing files with the [supported file formats and compression codecs](supported-file-formats-and-compression-codecs.md).
31-
32-
>[!NOTE]
33-
>Copying data from Google Cloud Storage leverages the [Amazon S3 connector](connector-amazon-simple-storage-service.md) with corresponding custom S3 endpoint, as Google Cloud Storage provides S3-compatible interoperability.
30+
Specifically, this Google Cloud Storage connector supports copying files as-is or parsing files with the [supported file formats and compression codecs](supported-file-formats-and-compression-codecs.md). It leverages GCS's S3-compatible interoperability.
3431

3532
## Prerequisites
3633

3734
The following set-up is required on your Google Cloud Storage account:
3835

3936
1. Enable interoperability for your Google Cloud Storage account
40-
2. Set the default project which contains the data you want to copy
41-
3. Create an access key.
37+
2. Set the default project which contains the data you want to copy from the target GCS bucket
38+
3. Create a service account and define the right levels of permissions using Cloud IAM on GCP
39+
4. Generate the access keys for this service account
4240

4341
![Retrieve access key for Google Cloud Storage](media/connector-google-cloud-storage/google-storage-cloud-settings.png)
4442

4543
## Required permissions
4644

47-
To copy data from Google Cloud Storage, make sure you have been granted the following permissions:
48-
49-
- **For copy activity execution:**: `s3:GetObject` and `s3:GetObjectVersion` for Object Operations.
50-
- **For Data Factory GUI authoring**: `s3:ListAllMyBuckets` and `s3:ListBucket`/`s3:GetBucketLocation` for Bucket Operations permissions are additionally required for operations like test connection and browse/navigate file paths. If you don't want to grant these permission, skip test connection in linked service creation page and specify the path directly in dataset settings.
45+
To copy data from Google Cloud Storage, make sure you have granted the needed permissions. The permissions defined in the service account may contain `storage.buckets.get`, `storage.buckets.list`, `storage.objects.get` for Object Operations.
5146

5247
## Getting started
5348

@@ -64,7 +59,7 @@ The following properties are supported for Google Cloud Storage linked service:
6459
| type | The type property must be set to **GoogleCloudStorage**. | Yes |
6560
| accessKeyId | ID of the secret access key. To find the access key and secret, see [Prerequisites](#prerequisites). |Yes |
6661
| secretAccessKey | The secret access key itself. Mark this field as a SecureString to store it securely in Data Factory, or [reference a secret stored in Azure Key Vault](store-credentials-in-key-vault.md). |Yes |
67-
| serviceUrl | Specify the custom S3 endpoint as **`https://storage.googleapis.com`**. | Yes |
62+
| serviceUrl | Specify the custom GCS endpoint as **`https://storage.googleapis.com`**. | Yes |
6863
| connectVia | The [Integration Runtime](concepts-integration-runtime.md) to be used to connect to the data store. You can use Azure Integration Runtime or Self-hosted Integration Runtime (if your data store is located in private network). If not specified, it uses the default Azure Integration Runtime. |No |
6964

7065
Here is an example:
@@ -98,8 +93,8 @@ The following properties are supported for Google Cloud Storage under `location`
9893

9994
| Property | Description | Required |
10095
| ---------- | ------------------------------------------------------------ | -------- |
101-
| type | The type property under `location` in dataset must be set to **AmazonS3Location**. | Yes |
102-
| bucketName | The S3 bucket name. | Yes |
96+
| type | The type property under `location` in dataset must be set to **GoogleCloudStorageLocation**. | Yes |
97+
| bucketName | The GCS bucket name. | Yes |
10398
| folderPath | The path to folder under the given bucket. If you want to use wildcard to filter folder, skip this setting and specify in activity source settings. | No |
10499
| fileName | The file name under the given bucket + folderPath. If you want to use wildcard to filter files, skip this setting and specify in activity source settings. | No |
105100

@@ -117,7 +112,7 @@ The following properties are supported for Google Cloud Storage under `location`
117112
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
118113
"typeProperties": {
119114
"location": {
120-
"type": "AmazonS3Location",
115+
"type": "GoogleCloudStorageLocation",
121116
"bucketName": "bucketname",
122117
"folderPath": "folder/subfolder"
123118
},
@@ -142,11 +137,12 @@ The following properties are supported for Google Cloud Storage under `storeSett
142137

143138
| Property | Description | Required |
144139
| ------------------------ | ------------------------------------------------------------ | ----------------------------------------------------------- |
145-
| type | The type property under `storeSettings` must be set to **AmazonS3ReadSettings**. | Yes |
140+
| type | The type property under `storeSettings` must be set to **GoogleCloudStorageReadSettings**. | Yes |
146141
| ***Locate the files to copy:*** | | |
147142
| OPTION 1: static path<br> | Copy from the given bucket or folder/file path specified in the dataset. If you want to copy all files from a bucket/folder, additionally specify `wildcardFileName` as `*`. | |
148-
| OPTION 2: wildcard<br>- wildcardFolderPath | The folder path with wildcard characters under the given bucket configured in dataset to filter source folders. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | No |
149-
| OPTION 2: wildcard<br>- wildcardFileName | The file name with wildcard characters under the given bucket + folderPath/wildcardFolderPath to filter source files. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | Yes |
143+
| OPTION 2: GCS prefix<br>- prefix | Prefix for the GCS key name under the given bucket configured in dataset to filter source GCS files. GCS keys whose name starts with `bucket_in_dataset/this_prefix` are selected. It utilizes GCS's service side filter which provide better performance than wildcard filter. | No |
144+
| OPTION 3: wildcard<br>- wildcardFolderPath | The folder path with wildcard characters under the given bucket configured in dataset to filter source folders. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | No |
145+
| OPTION 3: wildcard<br>- wildcardFileName | The file name with wildcard characters under the given bucket + folderPath/wildcardFolderPath to filter source files. <br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | Yes |
150146
| OPTION 3: a list of files<br>- fileListPath | Indicates to copy a given file set. Point to a text file that includes a list of files you want to copy, one file per line which is the relative path to the path configured in the dataset.<br/>When using this option, do not specify file name in dataset. See more examples in [File list examples](#file-list-examples). |No |
151147
| ***Additional settings:*** | | |
152148
| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink. <br>Allowed values are **true** (default) and **false**.<br>This property doesn't apply when you configure `fileListPath`. |No |
@@ -181,7 +177,7 @@ The following properties are supported for Google Cloud Storage under `storeSett
181177
"skipLineCount": 10
182178
},
183179
"storeSettings":{
184-
"type": "AmazonS3ReadSettings",
180+
"type": "GoogleCloudStorageReadSettings",
185181
"recursive": true,
186182
"wildcardFolderPath": "myfolder*A",
187183
"wildcardFileName": "*.csv"
@@ -230,95 +226,7 @@ To learn details about the properties, check [Delete activity](delete-activity.m
230226

231227
## Legacy models
232228

233-
>[!NOTE]
234-
>The following models are still supported as-is for backward compatibility. You are suggested to use the new model mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
235-
236-
### Legacy dataset model
237-
238-
| Property | Description | Required |
239-
|:--- |:--- |:--- |
240-
| type | The type property of the dataset must be set to: **AmazonS3Object** |Yes |
241-
| bucketName | The S3 bucket name. Wildcard filter is not supported. |Yes for Copy/Lookup activity, No for GetMetadata activity |
242-
| key | The **name or wildcard filter** of S3 object key under the specified bucket. Applies only when "prefix" property is not specified. <br/><br/>The wildcard filter is supported for both folder part and file name part. Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character).<br/>- Example 1: `"key": "rootfolder/subfolder/*.csv"`<br/>- Example 2: `"key": "rootfolder/subfolder/???20180427.txt"`<br/>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). Use `^` to escape if your actual folder/file name has wildcard or this escape char inside. |No |
243-
| prefix | Prefix for the S3 object key. Objects whose keys start with this prefix are selected. Applies only when "key" property is not specified. |No |
244-
| version | The version of the S3 object, if S3 versioning is enabled. |No |
245-
| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. The files will be selected if their last modified time are within the time range between `modifiedDatetimeStart` and `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> The properties can be NULL which mean no file attribute filter will be applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No |
246-
| modifiedDatetimeEnd | Files filter based on the attribute: Last Modified. The files will be selected if their last modified time are within the time range between `modifiedDatetimeStart` and `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> The properties can be NULL which mean no file attribute filter will be applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No |
247-
| format | If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions.<br/><br/>If you want to parse or generate files with a specific format, the following file format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](supported-file-formats-and-compression-codecs-legacy.md#text-format), [Json Format](supported-file-formats-and-compression-codecs-legacy.md#json-format), [Avro Format](supported-file-formats-and-compression-codecs-legacy.md#avro-format), [Orc Format](supported-file-formats-and-compression-codecs-legacy.md#orc-format), and [Parquet Format](supported-file-formats-and-compression-codecs-legacy.md#parquet-format) sections. |No (only for binary copy scenario) |
248-
| compression | Specify the type and level of compression for the data. For more information, see [Supported file formats and compression codecs](supported-file-formats-and-compression-codecs-legacy.md#compression-support).<br/>Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**.<br/>Supported levels are: **Optimal** and **Fastest**. |No |
249-
250-
>[!TIP]
251-
>To copy all files under a folder, specify **bucketName** for bucket and **prefix** for folder part.<br>To copy a single file with a given name, specify **bucketName** for bucket and **key** for folder part plus file name.<br>To copy a subset of files under a folder, specify **bucketName** for bucket and **key** for folder part plus wildcard filter.
252-
253-
**Example: using prefix**
254-
255-
```json
256-
{
257-
"name": "GoogleCloudStorageDataset",
258-
"properties": {
259-
"type": "AmazonS3Object",
260-
"linkedServiceName": {
261-
"referenceName": "<linked service name>",
262-
"type": "LinkedServiceReference"
263-
},
264-
"typeProperties": {
265-
"bucketName": "testbucket",
266-
"prefix": "testFolder/test",
267-
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
268-
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
269-
"format": {
270-
"type": "TextFormat",
271-
"columnDelimiter": ",",
272-
"rowDelimiter": "\n"
273-
},
274-
"compression": {
275-
"type": "GZip",
276-
"level": "Optimal"
277-
}
278-
}
279-
}
280-
}
281-
```
282-
283-
### Legacy copy activity source model
284-
285-
| Property | Description | Required |
286-
|:--- |:--- |:--- |
287-
| type | The type property of the copy activity source must be set to: **FileSystemSource** |Yes |
288-
| recursive | Indicates whether the data is read recursively from the sub folders or only from the specified folder. Note when recursive is set to true and sink is file-based store, empty folder/sub-folder will not be copied/created at sink.<br/>Allowed values are: **true** (default), **false** | No |
289-
| maxConcurrentConnections | The number of the connections to connect to storage store concurrently. Specify only when you want to limit the concurrent connection to the data store. | No |
290-
291-
**Example:**
292-
293-
```json
294-
"activities":[
295-
{
296-
"name": "CopyFromGoogleCloudStorage",
297-
"type": "Copy",
298-
"inputs": [
299-
{
300-
"referenceName": "<input dataset name>",
301-
"type": "DatasetReference"
302-
}
303-
],
304-
"outputs": [
305-
{
306-
"referenceName": "<output dataset name>",
307-
"type": "DatasetReference"
308-
}
309-
],
310-
"typeProperties": {
311-
"source": {
312-
"type": "FileSystemSource",
313-
"recursive": true
314-
},
315-
"sink": {
316-
"type": "<sink type>"
317-
}
318-
}
319-
}
320-
]
321-
```
229+
If you were using Amazon S3 connector to copy data from Google Cloud Storage, it is still supported as-is for backward compatibility. You are suggested to use the new model mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
322230

323231
## Next steps
324232
For a list of data stores that are supported as sources and sinks by the copy activity in Azure Data Factory, see [supported data stores](copy-activity-overview.md#supported-data-stores-and-formats).

0 commit comments

Comments
 (0)