|
3 | 3 |
|
4 | 4 | Description |
5 | 5 | ----------- |
6 | | -Batch source to use Amazon S3 as a Source. |
7 | | - |
8 | | - |
9 | | -Use Case |
10 | | --------- |
11 | 6 | This source is used whenever you need to read from Amazon S3. |
12 | 7 | For example, you may want to read in log files from S3 every hour and then store |
13 | 8 | the logs in a TimePartitionedFileSet. |
14 | 9 |
|
15 | | - |
16 | 10 | Properties |
17 | 11 | ---------- |
18 | | -**referenceName:** This will be used to uniquely identify this source for lineage, annotating metadata, etc. |
19 | | - |
20 | | -**authenticationMethod:** Authentication method to access S3. Defaults to Access Credentials. |
21 | | - User need to have AWS environment only to use IAM role based authentication. URI scheme should be s3a:// for S3AFileSystem or s3n:// for S3NativeFileSystem. (Macro-enabled) |
22 | | - |
23 | | -**accessID:** Access ID of the Amazon S3 instance to connect to. Mandatory if authentication method is Access credentials. (Macro-enabled) |
| 12 | +**Reference Name:** Name used to uniquely identify this source for lineage, annotating metadata, etc. |
24 | 13 |
|
25 | | -**accessKey:** Access Key of the Amazon S3 instance to connect to. Mandatory if authentication method is Access credentials. (Macro-enabled) |
26 | | - |
27 | | -**path:** Path to file(s) to be read. If a directory is specified, |
28 | | -terminate the path name with a '/'. The path uses filename expansion (globbing) to read files. (Macro-enabled) |
| 14 | +**Path:** Path to read from. For example, s3a://<bucket>/path/to/input |
29 | 15 |
|
30 | 16 | **Format:** Format of the data to read. |
31 | 17 | The format must be one of 'avro', 'blob', 'csv', 'delimited', 'json', 'parquet', 'text', or 'tsv'. |
32 | 18 | If the format is 'blob', every input file will be read into a separate record. |
33 | 19 | The 'blob' format also requires a schema that contains a field named 'body' of type 'bytes'. |
34 | | -If the format is 'text', the schema must contain a field named 'body' of type 'string'. (Macro-enabled) |
| 20 | +If the format is 'text', the schema must contain a field named 'body' of type 'string'. |
| 21 | + |
| 22 | +**Delimiter:** Delimiter to use when the format is 'delimited'. This will be ignored for other formats. |
| 23 | + |
| 24 | +**Authentication Method:** Authentication method to access S3. The default value is Access Credentials. |
| 25 | +IAM can only be used if the plugin is run in an AWS environment, such as on EMR. |
| 26 | + |
| 27 | +**Access ID:** Amazon access ID required for authentication. |
35 | 28 |
|
36 | | -**fileRegex:** Regex to filter out files in the path. It accepts regular expression which is applied to the complete |
37 | | -path and returns the list of files that match the specified pattern. |
| 29 | +**Access Key:** Amazon access key required for authentication. |
38 | 30 |
|
39 | | -**maxSplitSize:** Maximum split-size for each mapper in the MapReduce Job. Defaults to 128MB. (Macro-enabled) |
| 31 | +**Maximum Split Size:** Maximum size in bytes for each input partition. |
| 32 | +Smaller partitions will increase the level of parallelism, but will require more resources and overhead. |
| 33 | +The default value is 128MB. |
40 | 34 |
|
41 | | -**ignoreNonExistingFolders:** Identify if path needs to be ignored or not, for case when directory or file does not |
42 | | -exists. If set to true it will treat the not present folder as 0 input and log a warning. Default is false. |
| 35 | +**Path Field:** Output field to place the path of the file that the record was read from. |
| 36 | +If not specified, the file path will not be included in output records. |
| 37 | +If specified, the field must exist in the output schema as a string. |
43 | 38 |
|
44 | | -**recursive:** Boolean value to determine if files are to be read recursively from the path. Default is false. |
| 39 | +**Path Filename Only:** Whether to only use the filename instead of the URI of the file path when a path field is given. |
| 40 | +The default value is false. |
45 | 41 |
|
| 42 | +**Read Files Recursively:** Whether files are to be read recursively from the path. The default value is false. |
46 | 43 |
|
47 | | -Example |
48 | | -------- |
49 | | -This example connects to Amazon S3 using Access Credentials and reads in files found in the specified directory while |
50 | | -using the stateful ``timefilter``, which ensures that each file is read only once. The ``timefilter`` |
51 | | -requires that files be named with either the convention "yy-MM-dd-HH..." (S3) or "...'.'yy-MM-dd-HH..." |
52 | | -(Cloudfront). The stateful metadata is stored in a table named 'timeTable'. With the maxSplitSize |
53 | | -set to 1MB, if the total size of the files being read is larger than 1MB, CDAP will |
54 | | -configure Hadoop to use one mapper per MB: |
| 44 | +**Allow Empty Input:** Whether to allow an input path that contains no data. When set to false, the plugin |
| 45 | +will error when there is no data to read. When set to true, no error will be thrown and zero records will be read. |
55 | 46 |
|
56 | | - { |
57 | | - "name": "S3", |
58 | | - "type": "batchsource", |
59 | | - "properties": { |
60 | | - "authenticationMethod": "Access Credentials", |
61 | | - "accessKey": "key", |
62 | | - "accessID": "ID", |
63 | | - "path": "s3a://path/to/logs/", |
64 | | - "fileRegex": "timefilter", |
65 | | - "timeTable": "timeTable", |
66 | | - "maxSplitSize": "1048576", |
67 | | - "ignoreNonExistingFolders": "false", |
68 | | - "recursive": "false" |
69 | | - } |
70 | | - } |
| 47 | +**File System Properties:** Additional properties to use with the InputFormat when reading the data. |
0 commit comments