Skip to content

Commit 463e6f7

Browse files
committed
Merge branch 'perHeaderDeduplication' into s3Integration
2 parents 9e2c708 + b4131fa commit 463e6f7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+705
-617
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,4 @@ fabric.properties
9898
## Pebble 2
9999
.lock*
100100
/data/
101+
/output/

README.md

Lines changed: 37 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -12,22 +12,6 @@ docker run --rm -t --network hadoop -v "$PWD/output:/output" radarbase/radar-hdf
1212
```
1313
if your docker cluster is running in the `hadoop` network and your output directory should be `./output`.
1414

15-
## Local build
16-
17-
This package requires at least Java JDK 8. Build the distribution with
18-
19-
```shell
20-
./gradlew build
21-
```
22-
23-
and install the package into `/usr/local` with for example
24-
```shell
25-
sudo mkdir -p /usr/local
26-
sudo tar -xzf build/distributions/radar-hdfs-restructure-0.5.7.tar.gz -C /usr/local --strip-components=1
27-
```
28-
29-
Now the `radar-hdfs-restructure` command should be available.
30-
3115
## Command line usage
3216

3317
When the application is installed, it can be used as follows:
@@ -46,32 +30,58 @@ radar-hdfs-restructure --help
4630
```
4731
Note that the options preceded by the `*` in the above output are required to run the app. Also note that there can be multiple input paths from which to read the files. Eg - `/topicAndroidNew/topic1 /topicAndroidNew/topic2 ...`. At least one input path is required.
4832

33+
Each argument, as well as much more, can be supplied in a config file. The default name of the config file is `restructure.yml`. Please refer to `restructure.yml` in the current directory for all available options. An alternative file can be specified with the `-F` flag.
34+
35+
### File Format
36+
4937
By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:
5038
```shell
5139
radar-hdfs-restructure --format json --nameservice <hdfs_node> --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
5240
```
5341

42+
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it. Deduplication can also be enabled or disabled per topic using the config file. If lines should be deduplicated using a subset of fields, e.g. only `sourceId` and `time` define a unique record and only the last record with duplicate values should be kept, then specify `topics: <topicName>: deduplicateFields: [sourceId, time]`.
43+
44+
### Compression
45+
5446
Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size.
5547
```
5648
radar-hdfs-restructure --compression gzip --nameservice <hdfs_node> --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
5749
```
5850

59-
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it.
51+
### Storage
52+
53+
When using local storage, to set the output user ID and group ID, specify the `-p local-uid=123` and `-p local-gid=12` properties.
54+
55+
### Service
56+
57+
To run the output generator as a service that will regularly poll the HDFS directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval or use the corresponding configuration file parameters.
58+
59+
## Local build
60+
61+
This package requires at least Java JDK 8. Build the distribution with
62+
63+
```shell
64+
./gradlew build
65+
```
6066

61-
To set the output user ID and group ID, specify the `-p local-uid=123` and `-p local-gid=12` properties.
67+
and install the package into `/usr/local` with for example
68+
```shell
69+
sudo mkdir -p /usr/local
70+
sudo tar -xzf build/distributions/radar-hdfs-restructure-0.5.7.tar.gz -C /usr/local --strip-components=1
71+
```
6272

63-
To run the output generator as a service that will regularly poll the HDFS directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval.
73+
Now the `radar-hdfs-restructure` command should be available.
6474

65-
## Extending the connector
75+
### Extending the connector
6676

6777
To implement alternative storage paths, storage drivers or storage formats, put your custom JAR in
6878
`$APP_DIR/lib/radar-hdfs-plugins`. To load them, use the following options:
6979

70-
| Option | Class | Behaviour | Default |
71-
| ----------------------- | ------------------------------------------- | ------------------------------------------ | ------------------------- |
72-
| `--path-factory` | `org.radarbase.hdfs.RecordPathFactory` | Factory to create output path names with. | ObservationKeyPathFactory |
73-
| `--storage-driver` | `org.radarbase.hdfs.data.StorageDriver` | Storage driver to use for storing data. | LocalStorageDriver |
74-
| `--format-factory` | `org.radarbase.hdfs.data.FormatFactory` | Factory for output formats. | FormatFactory |
75-
| `--compression-factory` | `org.radarbase.hdfs.data.CompressionFactory` | Factory class to use for data compression. | CompressionFactory |
80+
| Parameter | Base class | Behaviour | Default |
81+
| --------------------------- | --------------------------------------------------- | ------------------------------------------ | ------------------------- |
82+
| `paths: factory: ...` | `org.radarbase.hdfs.path.RecordPathFactory` | Factory to create output path names with. | ObservationKeyPathFactory |
83+
| `storage: factory: ...` | `org.radarbase.hdfs.storage.StorageDriver` | Storage driver to use for storing data. | LocalStorageDriver |
84+
| `format: factory: ...` | `org.radarbase.hdfs.format.FormatFactory` | Factory for output formats. | FormatFactory |
85+
| `compression: factory: ...` | `org.radarbase.hdfs.compression.CompressionFactory` | Factory class to use for data compression. | CompressionFactory |
7686

77-
To pass arguments to self-assigned plugins, use `-p arg1=value1 -p arg2=value2` command-line flags and read those arguments in the `Plugin#init(Map<String, String>)` method.
87+
The respective `<type>: properties: {}` configuration parameters can be used to provide custom configuration of the factory. This configuration will be passed to the `Plugin#init(Map<String, String>)` method.

build.gradle

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,15 @@ repositories {
3636
dependencies {
3737
api group: 'org.apache.avro', name: 'avro', version: avroVersion
3838
implementation group: 'com.fasterxml.jackson.core' , name: 'jackson-databind', version: jacksonVersion
39+
implementation group: 'com.fasterxml.jackson.dataformat' , name: 'jackson-dataformat-yaml', version: jacksonVersion
3940
implementation group: 'com.fasterxml.jackson.dataformat' , name: 'jackson-dataformat-csv', version: jacksonVersion
41+
implementation("com.fasterxml.jackson.module:jackson-module-kotlin:$jacksonVersion")
42+
4043
implementation group: 'com.beust', name: 'jcommander', version: jCommanderVersion
4144
implementation group: 'com.almworks.integers', name: 'integers', version: almworksVersion
4245

4346
implementation 'software.amazon.awssdk:s3:2.10.3'
47+
implementation 'com.opencsv:opencsv:5.0'
4448

4549
implementation group: 'org.apache.avro', name: 'avro-mapred', version: avroVersion
4650
implementation group: 'org.apache.hadoop', name: 'hadoop-common', version: hadoopVersion

restructure.yml

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
service:
2+
# Whether to run the application as a polling service.
3+
enable: false
4+
# Polling interval in seconds.
5+
interval: 30
6+
7+
# Compression characteristics
8+
compression:
9+
# Compression type: none, zip or gzip
10+
type: gzip
11+
# Compression Factory class
12+
# factory: org.radarbase.hdfs.compression.CompressionFactory
13+
# Additional compression properties
14+
# properties: {}
15+
16+
# File format
17+
format:
18+
# Format type: CSV or JSON
19+
type: csv
20+
# Whether to deduplicate the files in each topic by default
21+
deduplication:
22+
enable: true
23+
# Use specific fields to consider records distinct. Disregarded if empty.
24+
# distinctFields: []
25+
# Ignore specific fields to consider records distinct. Disregarded if empty.
26+
# ignoreFields: []
27+
# Format factory class
28+
# factory: org.radarbase.hdfs.format.FormatFactory
29+
# Additional format properties
30+
# properties: {}
31+
32+
# Worker settings
33+
worker:
34+
# Maximum number of files and converters to keep open while processing
35+
cacheSize: 300
36+
# Number of threads to do processing on
37+
numThreads: 2
38+
# Maximum number of files to process in any given topic.
39+
maxFilesPerTopic: null
40+
41+
# Path settings
42+
paths:
43+
# Input directories in HDFS
44+
inputs:
45+
- /topicAndroidNew
46+
# Root temporary directory for local file processing.
47+
temp: ./output/+tmp
48+
# Output directory
49+
output: ./output
50+
# Output path construction factory
51+
factory: org.radarbase.hdfs.path.MonthlyObservationKeyPathFactory
52+
# Additional properties
53+
# properties: {}
54+
55+
# Individual topic configuration
56+
topics:
57+
# topic name
58+
connect_fitbit_source:
59+
# deduplicate this topic, regardless of the format settings
60+
deduplication:
61+
enable: true
62+
# deduplicate this topic only using given fields.
63+
distinctFields: [key.sourceId, value.time]
64+
# topic name
65+
connect_fitbit_source2:
66+
# deduplicate this topic, regardless of the format settings
67+
deduplication:
68+
enable: true
69+
# deduplicate this topic without regard of given fields.
70+
ignoreFields: [value.timeReceived]
71+
connect_fitbit_bad:
72+
# Do not process this topic
73+
exclude: true
74+
biovotion_acceleration:
75+
# Disable deduplication
76+
deduplication:
77+
enable: false
78+
79+
# HDFS settings
80+
hdfs:
81+
# HDFS name node in case of a single name node, or HDFS cluster ID in case of high availability.
82+
name: hdfs-namenode
83+
# High availability settings:
84+
# nameNodes:
85+
# - name: hdfs1
86+
# hostname: hdfs-namenode-1
87+
# - name: hdfs2
88+
# hostname: hdfs-namenode-2
89+
# Where files will be locked. This value should be the same for all restructure processes.
90+
lockPath: /logs/org.radarbase.hdfs/lock
91+
# Additional raw HDFS configuration properties
92+
# properties: {}

0 commit comments

Comments
 (0)