Skip to content

Commit 779d34c

Browse files
authored
Merge pull request #56 from RADAR-base/release-0.6.0
Release 0.6.0
2 parents 5fa5b2a + 35c4acd commit 779d34c

File tree

107 files changed

+4793
-5561
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+4793
-5561
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ obj/
1515
*.ap_
1616

1717
# Generated files
18-
bin/
1918
gen/
2019
out/
2120
build/
@@ -98,3 +97,4 @@ fabric.properties
9897
## Pebble 2
9998
.lock*
10099
/data/
100+
/output/

Dockerfile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,14 @@ COPY ./src /code/src
2929

3030
RUN ./gradlew jar
3131

32-
FROM smizy/hadoop-base:3.0.3-alpine
32+
FROM gradiant/hadoop-base:3.1.2
3333

3434
MAINTAINER Joris Borgdorff <[email protected]>, Yatharth Ranjan<[email protected]>
3535

3636
LABEL description="RADAR-base HDFS data restructuring"
3737

38-
ENV JAVA_OPTS -Djava.library.path=${HADOOP_HOME}/lib/native
38+
ENV JAVA_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native -Djava.security.egd=file:/dev/./urandom -XX:+UseG1GC -XX:MaxHeapFreeRatio=10 -XX:MinHeapFreeRatio=10" \
39+
LD_LIBRARY_PATH=/lib64
3940

4041
RUN apk add --no-cache libc6-compat
4142

README.md

Lines changed: 67 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,20 @@
22

33
[![Build Status](https://travis-ci.org/RADAR-base/Restructure-HDFS-topic.svg?branch=master)](https://travis-ci.org/RADAR-base/Restructure-HDFS-topic)
44

5-
Data streamed to HDFS using the [RADAR HDFS sink connector](https://github.com/RADAR-CNS/RADAR-HDFS-Sink-Connector) is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: `userId/topic/date_hour.csv`. The date and hour is extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-CNS/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh` script.
5+
Data streamed to HDFS using the [RADAR HDFS sink connector](https://github.com/RADAR-base/RADAR-HDFS-Sink-Connector) is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: `userId/topic/date_hour.csv`. The date and hour is extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-base/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh` script.
6+
7+
_Note_: when upgrading to version 0.6.0, please follow the following instructions:
8+
- Write configuration file `restructure.yml` to match settings used with 0.5.x.
9+
- If needed, move all entries of `offsets.csv` to their per-topic file in `offsets/<topic>.csv`. First go to the output directory, then run the `bin/migrate-offsets-to-0.6.0.sh` script.
610

711
## Docker usage
812

913
This package is available as docker image [`radarbase/radar-hdfs-restructure`](https://hub.docker.com/r/radarbase/radar-hdfs-restructure). The entrypoint of the image is the current application. So in all of the commands listed in usage, replace `radar-hdfs-restructure` with for example:
1014
```shell
11-
docker run --rm -t --network hadoop -v "$PWD/output:/output" radarbase/radar-hdfs-restructure:0.5.7 -n hdfs-namenode -o /output /myTopic
15+
docker run --rm -t --network hadoop -v "$PWD/output:/output" radarbase/radar-hdfs-restructure:0.6.0 -n hdfs-namenode -o /output /myTopic
1216
```
1317
if your docker cluster is running in the `hadoop` network and your output directory should be `./output`.
1418

15-
## Local build
16-
17-
This package requires at least Java JDK 8. Build the distribution with
18-
19-
```shell
20-
./gradlew build
21-
```
22-
23-
and install the package into `/usr/local` with for example
24-
```shell
25-
sudo mkdir -p /usr/local
26-
sudo tar -xzf build/distributions/radar-hdfs-restructure-0.5.7.tar.gz -C /usr/local --strip-components=1
27-
```
28-
29-
Now the `radar-hdfs-restructure` command should be available.
30-
3119
## Command line usage
3220

3321
When the application is installed, it can be used as follows:
@@ -46,32 +34,81 @@ radar-hdfs-restructure --help
4634
```
4735
Note that the options preceded by the `*` in the above output are required to run the app. Also note that there can be multiple input paths from which to read the files. Eg - `/topicAndroidNew/topic1 /topicAndroidNew/topic2 ...`. At least one input path is required.
4836

37+
Each argument, as well as much more, can be supplied in a config file. The default name of the config file is `restructure.yml`. Please refer to `restructure.yml` in the current directory for all available options. An alternative file can be specified with the `-F` flag.
38+
39+
### File Format
40+
4941
By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:
5042
```shell
5143
radar-hdfs-restructure --format json --nameservice <hdfs_node> --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
5244
```
5345

54-
Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size.
46+
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it. Deduplication can also be enabled or disabled per topic using the config file. If lines should be deduplicated using a subset of fields, e.g. only `sourceId` and `time` define a unique record and only the last record with duplicate values should be kept, then specify `topics: <topicName>: deduplication: distinctFields: [key.sourceId, value.time]`.
47+
48+
### Compression
49+
50+
Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size. Zip compression is also available.
5551
```
5652
radar-hdfs-restructure --compression gzip --nameservice <hdfs_node> --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
5753
```
5854

59-
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it.
55+
### Storage
6056

61-
To set the output user ID and group ID, specify the `-p local-uid=123` and `-p local-gid=12` properties.
57+
There are two storage drivers implemented: `org.radarbase.hdfs.storage.LocalStorageDriver` for an output directory on the local file system or `org.radarbase.hdfs.storage.S3StorageDriver` for storage on an object store.
6258

63-
To run the output generator as a service that will regularly poll the HDFS directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval.
59+
`LocalStorageDriver` takes the following properties:
60+
```yaml
61+
storage:
62+
factory: org.radarbase.hdfs.storage.LocalStorageDriver
63+
properties:
64+
# User ID to write data as
65+
localUid: 123
66+
# Group ID to write data as
67+
localGid: 123
68+
```
69+
70+
With the `S3StorageDriver`, use the following configuration instead:
71+
```yaml
72+
storage:
73+
factory: org.radarbase.hdfs.storage.S3StorageDriver
74+
properties:
75+
# Object store URL
76+
s3EndpointUrl: s3://my-region.s3.aws.amazon.com
77+
# Bucket to use
78+
s3Bucket: myBucketName
79+
```
80+
Ensure that the environment variables contain the authorized AWS keys that allow the service to list, download and upload files to the respective bucket.
81+
82+
### Service
83+
84+
To run the output generator as a service that will regularly poll the HDFS directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval or use the corresponding configuration file parameters.
85+
86+
## Local build
87+
88+
This package requires at least Java JDK 8. Build the distribution with
89+
90+
```shell
91+
./gradlew build
92+
```
93+
94+
and install the package into `/usr/local` with for example
95+
```shell
96+
sudo mkdir -p /usr/local
97+
sudo tar -xzf build/distributions/radar-hdfs-restructure-0.6.0.tar.gz -C /usr/local --strip-components=1
98+
```
99+
100+
Now the `radar-hdfs-restructure` command should be available.
64101

65-
## Extending the connector
102+
### Extending the connector
66103

67104
To implement alternative storage paths, storage drivers or storage formats, put your custom JAR in
68105
`$APP_DIR/lib/radar-hdfs-plugins`. To load them, use the following options:
69106

70-
| Option | Class | Behaviour | Default |
71-
| ----------------------- | ------------------------------------------- | ------------------------------------------ | ------------------------- |
72-
| `--path-factory` | `org.radarcns.hdfs.RecordPathFactory` | Factory to create output path names with. | ObservationKeyPathFactory |
73-
| `--storage-driver` | `org.radarcns.hdfs.data.StorageDriver` | Storage driver to use for storing data. | LocalStorageDriver |
74-
| `--format-factory` | `org.radarcns.hdfs.data.FormatFactory` | Factory for output formats. | FormatFactory |
75-
| `--compression-factory` | `org.radarcns.hdfs.data.CompressionFactory` | Factory class to use for data compression. | CompressionFactory |
107+
| Parameter | Base class | Behaviour | Default |
108+
| --------------------------- | --------------------------------------------------- | ------------------------------------------ | ------------------------- |
109+
| `paths: factory: ...` | `org.radarbase.hdfs.path.RecordPathFactory` | Factory to create output path names with. | ObservationKeyPathFactory |
110+
| `storage: factory: ...` | `org.radarbase.hdfs.storage.StorageDriver` | Storage driver to use for storing data. | LocalStorageDriver |
111+
| `format: factory: ...` | `org.radarbase.hdfs.format.FormatFactory` | Factory for output formats. | FormatFactory |
112+
| `compression: factory: ...` | `org.radarbase.hdfs.compression.CompressionFactory` | Factory class to use for data compression. | CompressionFactory |
76113

77-
To pass arguments to self-assigned plugins, use `-p arg1=value1 -p arg2=value2` command-line flags and read those arguments in the `Plugin#init(Map<String, String>)` method.
114+
The respective `<type>: properties: {}` configuration parameters can be used to provide custom configuration of the factory. This configuration will be passed to the `Plugin#init(Map<String, String>)` method.

bin/migrate-offsets-to-0.6.0.sh

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/bin/sh
2+
3+
set -e
4+
if [ ! -f offsets.csv ]; then
5+
echo "Can only migrate offsets if the current directory contains offsets.csv"
6+
exit 1
7+
fi
8+
9+
mkdir -p offsets
10+
TOPICS=$(tail -n+2 offsets.csv | cut -d , -f 4 | sort -u)
11+
for topic in $TOPICS; do
12+
target="offsets/$topic.csv"
13+
echo "Updating $target"
14+
if [ ! -f "$target" ]; then
15+
head -n 1 offsets.csv > "$target"
16+
fi
17+
grep ",$topic\$" offsets.csv >> "$target"
18+
done

build.gradle

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,12 @@ plugins {
44
id 'application'
55
id 'com.jfrog.bintray' version '1.8.4'
66
id 'maven-publish'
7+
id 'org.jetbrains.kotlin.jvm' version '1.3.50'
78
}
89

9-
group 'org.radarcns'
10-
version '0.5.7'
11-
mainClassName = 'org.radarcns.hdfs.Application'
10+
group 'org.radarbase'
11+
version '0.6.0'
12+
mainClassName = 'org.radarbase.hdfs.Application'
1213

1314
sourceCompatibility = '1.8'
1415
targetCompatibility = '1.8'
@@ -21,11 +22,11 @@ ext {
2122
issueUrl = "${githubUrl}/issues"
2223

2324
avroVersion = '1.8.2'
24-
jacksonVersion = '2.9.6'
25-
hadoopVersion = '3.0.3'
26-
jCommanderVersion = '1.72'
25+
jacksonVersion = '2.10.0'
26+
hadoopVersion = '3.1.2'
27+
jCommanderVersion = '1.78'
2728
almworksVersion = '1.1.1'
28-
junitVersion = '5.4.0-M1'
29+
junitVersion = '5.5.2'
2930
}
3031

3132
repositories {
@@ -35,20 +36,39 @@ repositories {
3536
dependencies {
3637
api group: 'org.apache.avro', name: 'avro', version: avroVersion
3738
implementation group: 'com.fasterxml.jackson.core' , name: 'jackson-databind', version: jacksonVersion
39+
implementation group: 'com.fasterxml.jackson.dataformat' , name: 'jackson-dataformat-yaml', version: jacksonVersion
3840
implementation group: 'com.fasterxml.jackson.dataformat' , name: 'jackson-dataformat-csv', version: jacksonVersion
41+
implementation("com.fasterxml.jackson.module:jackson-module-kotlin:$jacksonVersion")
42+
3943
implementation group: 'com.beust', name: 'jcommander', version: jCommanderVersion
4044
implementation group: 'com.almworks.integers', name: 'integers', version: almworksVersion
4145

46+
implementation 'software.amazon.awssdk:s3:2.10.3'
47+
implementation 'com.opencsv:opencsv:5.0'
48+
4249
implementation group: 'org.apache.avro', name: 'avro-mapred', version: avroVersion
4350
implementation group: 'org.apache.hadoop', name: 'hadoop-common', version: hadoopVersion
4451

52+
implementation "org.jetbrains.kotlin:kotlin-stdlib-jdk8"
53+
4554
runtimeOnly group: 'org.apache.hadoop', name: 'hadoop-hdfs-client', version: hadoopVersion
4655

4756
testCompile group: 'org.junit.jupiter', name: 'junit-jupiter-api', version: junitVersion
4857
testCompile group: 'org.junit.jupiter', name: 'junit-jupiter-params', version: junitVersion
4958
testRuntime group: 'org.junit.jupiter', name: 'junit-jupiter-engine', version: junitVersion
5059
}
5160

61+
compileKotlin {
62+
kotlinOptions {
63+
jvmTarget = "1.8"
64+
}
65+
}
66+
compileTestKotlin {
67+
kotlinOptions {
68+
jvmTarget = "1.8"
69+
}
70+
}
71+
5272
ext.sharedManifest = manifest {
5373
attributes(
5474
"Implementation-Title": rootProject.name,
@@ -87,6 +107,8 @@ test {
87107
useJUnitPlatform()
88108
testLogging {
89109
events "passed", "skipped", "failed"
110+
showStandardStreams = true
111+
setExceptionFormat("full")
90112
}
91113
}
92114

@@ -172,7 +194,7 @@ bintray {
172194
pkg {
173195
repo = project.group
174196
name = rootProject.name
175-
userOrg = 'radar-cns'
197+
userOrg = 'radar-base'
176198
desc = moduleDescription
177199
licenses = ['Apache-2.0']
178200
websiteUrl = website
@@ -190,5 +212,5 @@ bintray {
190212
}
191213

192214
wrapper {
193-
gradleVersion '5.4.1'
215+
gradleVersion '5.6.3'
194216
}
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
distributionBase=GRADLE_USER_HOME
22
distributionPath=wrapper/dists
3-
distributionUrl=https\://services.gradle.org/distributions/gradle-5.4.1-bin.zip
3+
distributionUrl=https\://services.gradle.org/distributions/gradle-5.6.3-bin.zip
44
zipStoreBase=GRADLE_USER_HOME
55
zipStorePath=wrapper/dists

gradlew

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
# you may not use this file except in compliance with the License.
88
# You may obtain a copy of the License at
99
#
10-
# http://www.apache.org/licenses/LICENSE-2.0
10+
# https://www.apache.org/licenses/LICENSE-2.0
1111
#
1212
# Unless required by applicable law or agreed to in writing, software
1313
# distributed under the License is distributed on an "AS IS" BASIS,
@@ -125,8 +125,8 @@ if $darwin; then
125125
GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
126126
fi
127127

128-
# For Cygwin, switch paths to Windows format before running java
129-
if $cygwin ; then
128+
# For Cygwin or MSYS, switch paths to Windows format before running java
129+
if [ "$cygwin" = "true" -o "$msys" = "true" ] ; then
130130
APP_HOME=`cygpath --path --mixed "$APP_HOME"`
131131
CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`
132132
JAVACMD=`cygpath --unix "$JAVACMD"`

gradlew.bat

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
@rem you may not use this file except in compliance with the License.
66
@rem You may obtain a copy of the License at
77
@rem
8-
@rem http://www.apache.org/licenses/LICENSE-2.0
8+
@rem https://www.apache.org/licenses/LICENSE-2.0
99
@rem
1010
@rem Unless required by applicable law or agreed to in writing, software
1111
@rem distributed under the License is distributed on an "AS IS" BASIS,

0 commit comments

Comments
 (0)