Skip to content

Commit 642e01e

Browse files
authored
Merge pull request #214 from RADAR-base/release-2.0.0
Release 2.0.0
2 parents e18f31a + d55e5f3 commit 642e01e

File tree

13 files changed

+207
-314
lines changed

13 files changed

+207
-314
lines changed

README.md

Lines changed: 14 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Restructure Kafka connector output files
22

3-
[![Build Status](https://travis-ci.org/RADAR-base/Restructure-HDFS-topic.svg?branch=master)](https://travis-ci.org/RADAR-base/Restructure-HDFS-topic)
4-
53
Data streamed by a Kafka Connector will be converted to a RADAR-base oriented output directory, by organizing it by project, user and collection date.
6-
It supports data written by [RADAR HDFS sink connector](https://github.com/RADAR-base/RADAR-HDFS-Sink-Connector) is streamed to files based on topic name only. This package transforms that output to a local directory structure as follows: `projectId/userId/topic/date_hour.csv`. The date and hour are extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-base/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/bin/hdfs-restructure` script.
4+
It supports data written by [RADAR S3 sink connector](https://github.com/RADAR-base/RADAR-S3-Connector) is streamed to files based on topic name only. This package transforms that output to a local directory structure as follows: `projectId/userId/topic/date_hour.csv`. The date and hour are extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-base/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/bin/hdfs-restructure` script.
75

86
## Upgrade instructions
97

8+
Since version 2.0.0, HDFS is no longer supported, only AWS S3 or Azure Blob Storage, and local file system compatible. If HDFS is still needed, please implement a HDFS source storage factory with constructor `org.radarbase.output.source.HdfsSourceStorageFactory(resourceConfig: ResourceConfig, tempPath: Path)` with method `createSourceStorage(): SourceStorage`. This implementation may be added as a separate JAR in the `lib/radar-output-plugins/` directory of where the distribution is installed.
9+
1010
When upgrading to version 1.2.0, please follow the following instructions:
1111

1212
- When using local target storage, ensure that:
@@ -70,21 +70,11 @@ When upgrading to version 0.6.0 from version 0.5.x or earlier, please follow the
7070
This package is available as docker image [`radarbase/radar-output-restructure`](https://hub.docker.com/r/radarbase/radar-output-restructure). The entrypoint of the image is the current application. So in all the commands listed in usage, replace `radar-output-restructure` with for example:
7171

7272
```shell
73-
docker run --rm -t --network hadoop -v "$PWD/output:/output" radarbase/radar-output-restructure:1.2.1 -n hdfs-namenode -o /output /myTopic
73+
docker run --rm -t --network s3 -v "$PWD/output:/output" radarbase/radar-output-restructure:2.0.0 -o /output /myTopic
7474
```
7575

7676
## Command line usage
7777

78-
When the application is installed, it can be used as follows:
79-
80-
```shell
81-
radar-output-restructure --nameservice <hdfs_node> --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
82-
```
83-
or you can use the short form as well:
84-
```shell
85-
radar-output-restructure -n <hdfs_node> -o <output_folder> <input_path_1> [<input_path_2> ...]
86-
```
87-
8878
To display the usage and all available options you can use the help option as follows:
8979
```shell
9080
radar-output-restructure --help
@@ -97,7 +87,7 @@ Each argument, as well as much more, can be supplied in a config file. The defau
9787

9888
By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:
9989
```shell
100-
radar-output-restructure --format json --nameservice <hdfs_node> --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
90+
radar-output-restructure --format json --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
10191
```
10292

10393
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it. Deduplication can also be enabled or disabled per topic using the config file. If lines should be deduplicated using a subset of fields, e.g. only `sourceId` and `time` define a unique record and only the last record with duplicate values should be kept, then specify `topics: <topicName>: deduplication: distinctFields: [key.sourceId, value.time]`.
@@ -106,7 +96,7 @@ By default, files records are not deduplicated after writing. To enable this beh
10696

10797
Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size. Zip compression is also available.
10898
```
109-
radar-output-restructure --compression gzip --nameservice <hdfs_node> --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
99+
radar-output-restructure --compression gzip --output-directory <output_folder> <input_path_1> [<input_path_2> ...]
110100
```
111101
112102
### Redis
@@ -115,26 +105,26 @@ This package assumes a Redis service running. See the example `restructure.yml`
115105
116106
### Source and target
117107
118-
The `source` and `target` properties contain resource descriptions. The source can have two types, `hdfs` and `s3`:
108+
The `source` and `target` properties contain resource descriptions. The source can have two types, `azure` and `s3`:
119109
120110
```yaml
121111
source:
122-
type: s3 # hdfs or s3
112+
type: s3 # azure or s3
123113
s3:
124114
endpoint: http://localhost:9000 # using AWS S3 endpoint is also possible.
125115
bucket: radar
126116
accessToken: minioadmin
127117
secretKey: minioadmin
128118
# only actually needed if source type is hdfs
129-
hdfs:
130-
nameNodes: [hdfs-namenode-1, hdfs-namenode-2]
119+
azure:
120+
# azure options
131121
```
132122

133-
The target is similar, but it does not support HDFS, but the local file system (`local`) or `s3`.
123+
The target is similar, and in addition supports the local file system (`local`).
134124

135125
```yaml
136126
target:
137-
type: s3 # s3 or local
127+
type: s3 # s3, local or azure
138128
s3:
139129
endpoint: http://localhost:9000
140130
bucket: out
@@ -179,7 +169,7 @@ The cleaner can also be enabled with the `--cleaner` command-line flag. To run t
179169

180170
### Service
181171

182-
To run the output generator as a service that will regularly poll the HDFS directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval or use the corresponding configuration file parameters.
172+
To run the output generator as a service that will regularly poll the source directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval or use the corresponding configuration file parameters.
183173

184174
## Local build
185175

@@ -192,7 +182,7 @@ This package requires at least Java JDK 8. Build the distribution with
192182
and install the package into `/usr/local` with for example
193183
```shell
194184
sudo mkdir -p /usr/local
195-
sudo tar -xzf build/distributions/radar-output-restructure-1.2.1.tar.gz -C /usr/local --strip-components=1
185+
sudo tar -xzf build/distributions/radar-output-restructure-2.0.0.tar.gz -C /usr/local --strip-components=1
196186
```
197187

198188
Now the `radar-output-restructure` command should be available.

build.gradle.kts

Lines changed: 83 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
import com.github.benmanes.gradle.versions.updates.DependencyUpdatesTask
22
import org.gradle.api.tasks.testing.logging.TestExceptionFormat.FULL
3-
import org.jetbrains.dokka.gradle.DokkaTask
43
import org.jetbrains.kotlin.gradle.tasks.KotlinCompile
54
import java.time.Duration
65

@@ -16,22 +15,18 @@ plugins {
1615
}
1716

1817
group = "org.radarbase"
19-
version = "1.2.1"
18+
version = "2.0.0"
19+
20+
repositories {
21+
mavenCentral()
22+
}
2023

2124
description = "RADAR-base output restructuring"
2225
val website = "https://radar-base.org"
2326
val githubRepoName = "RADAR-base/radar-output-restructure"
2427
val githubUrl = "https://github.com/${githubRepoName}"
2528
val issueUrl = "${githubUrl}/issues"
2629

27-
application {
28-
mainClass.set("org.radarbase.output.Application")
29-
}
30-
31-
repositories {
32-
mavenCentral()
33-
}
34-
3530
sourceSets {
3631
create("integrationTest") {
3732
compileClasspath += sourceSets.main.get().output
@@ -51,13 +46,18 @@ configurations["integrationTestRuntimeOnly"].extendsFrom(
5146
dependencies {
5247
val avroVersion: String by project
5348
api("org.apache.avro:avro:$avroVersion")
49+
val snappyVersion: String by project
50+
runtimeOnly("org.xerial.snappy:snappy-java:$snappyVersion")
51+
52+
implementation(kotlin("reflect"))
5453

5554
val jacksonVersion: String by project
56-
implementation("com.fasterxml.jackson.core:jackson-databind:$jacksonVersion")
57-
implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:$jacksonVersion")
58-
implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-csv:$jacksonVersion")
59-
implementation("com.fasterxml.jackson.module:jackson-module-kotlin:$jacksonVersion")
60-
implementation("com.fasterxml.jackson.datatype:jackson-datatype-jsr310:$jacksonVersion")
55+
implementation("com.fasterxml.jackson:jackson-bom:$jacksonVersion")
56+
implementation("com.fasterxml.jackson.core:jackson-databind")
57+
implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-yaml")
58+
implementation("com.fasterxml.jackson.dataformat:jackson-dataformat-csv")
59+
implementation("com.fasterxml.jackson.module:jackson-module-kotlin")
60+
implementation("com.fasterxml.jackson.datatype:jackson-datatype-jsr310")
6161

6262
val jedisVersion: String by project
6363
implementation("redis.clients:jedis:$jedisVersion")
@@ -75,16 +75,11 @@ dependencies {
7575
implementation("com.azure:azure-storage-blob:$azureStorageVersion")
7676
implementation("com.opencsv:opencsv:5.4")
7777

78-
implementation("org.apache.avro:avro-mapred:$avroVersion")
79-
80-
val hadoopVersion: String by project
81-
implementation("org.apache.hadoop:hadoop-common:$hadoopVersion")
82-
8378
val slf4jVersion: String by project
8479
implementation("org.slf4j:slf4j-api:$slf4jVersion")
8580

86-
runtimeOnly("org.slf4j:slf4j-log4j12:$slf4jVersion")
87-
runtimeOnly("org.apache.hadoop:hadoop-hdfs-client:$hadoopVersion")
81+
val log4jVersion: String by project
82+
runtimeOnly("org.apache.logging.log4j:log4j-slf4j-impl:$log4jVersion")
8883

8984
val radarSchemasVersion: String by project
9085
testImplementation("org.radarbase:radar-schemas-commons:$radarSchemasVersion")
@@ -100,12 +95,8 @@ dependencies {
10095
dokkaHtmlPlugin("org.jetbrains.dokka:kotlin-as-java-plugin:$dokkaVersion")
10196
}
10297

103-
tasks.withType<KotlinCompile> {
104-
kotlinOptions {
105-
jvmTarget = "11"
106-
apiVersion = "1.4"
107-
languageVersion = "1.4"
108-
}
98+
application {
99+
mainClass.set("org.radarbase.output.Application")
109100
}
110101

111102
distributions {
@@ -118,63 +109,12 @@ distributions {
118109
}
119110
}
120111

121-
tasks.startScripts {
122-
classpath = classpath?.let { it + files("lib/PlaceHolderForPluginPath") }
123-
124-
doLast {
125-
val windowsScriptFile = file(getWindowsScript())
126-
val unixScriptFile = file(getUnixScript())
127-
windowsScriptFile.writeText(windowsScriptFile.readText().replace("PlaceHolderForPluginPath", "radar-output-plugins\\*"))
128-
unixScriptFile.writeText(unixScriptFile.readText().replace("PlaceHolderForPluginPath", "radar-output-plugins/*"))
129-
}
130-
}
131-
132-
val integrationTest by tasks.registering(Test::class) {
133-
description = "Runs integration tests."
134-
group = "verification"
135-
136-
testClassesDirs = sourceSets["integrationTest"].output.classesDirs
137-
classpath = sourceSets["integrationTest"].runtimeClasspath
138-
outputs.upToDateWhen { false }
139-
shouldRunAfter("test")
140-
}
141-
142-
tasks.withType<Test> {
143-
useJUnitPlatform()
144-
testLogging {
145-
events("passed", "skipped", "failed")
146-
showStandardStreams = true
147-
setExceptionFormat(FULL)
148-
}
149-
}
150-
151-
dockerCompose {
152-
waitForTcpPortsTimeout = Duration.ofSeconds(30)
153-
environment["SERVICES_HOST"] = "localhost"
154-
captureContainersOutputToFiles = project.file("build/container-logs")
155-
isRequiredBy(integrationTest)
156-
}
157-
158-
val check by tasks
159-
check.dependsOn(integrationTest)
160-
161-
tasks.withType<Tar> {
162-
compression = Compression.GZIP
163-
archiveExtension.set("tar.gz")
164-
}
165-
166-
tasks.register("downloadDependencies") {
167-
doLast {
168-
description = "Pre-downloads dependencies"
169-
configurations.compileClasspath.get().files
170-
configurations.runtimeClasspath.get().files
112+
tasks.withType<KotlinCompile> {
113+
kotlinOptions {
114+
jvmTarget = "11"
115+
apiVersion = "1.4"
116+
languageVersion = "1.4"
171117
}
172-
outputs.upToDateWhen { false }
173-
}
174-
175-
tasks.register<Copy>("copyDependencies") {
176-
from(configurations.runtimeClasspath.get().files)
177-
into("$buildDir/third-party/")
178118
}
179119

180120
// custom tasks for creating source/javadoc jars
@@ -190,6 +130,11 @@ val dokkaJar by tasks.registering(Jar::class) {
190130
dependsOn(tasks.dokkaJavadoc)
191131
}
192132

133+
tasks.withType<Tar> {
134+
compression = Compression.GZIP
135+
archiveExtension.set("tar.gz")
136+
}
137+
193138
tasks.withType<Jar> {
194139
manifest {
195140
attributes(
@@ -199,6 +144,15 @@ tasks.withType<Jar> {
199144
}
200145
}
201146

147+
tasks.startScripts {
148+
classpath = classpath?.let { it + files("lib/PlaceHolderForPluginPath") }
149+
150+
doLast {
151+
windowsScript.writeText(windowsScript.readText().replace("PlaceHolderForPluginPath", "radar-output-plugins\\*"))
152+
unixScript.writeText(unixScript.readText().replace("PlaceHolderForPluginPath", "radar-output-plugins/*"))
153+
}
154+
}
155+
202156
publishing {
203157
publications {
204158
create<MavenPublication>("mavenJar") {
@@ -213,7 +167,7 @@ publishing {
213167
licenses {
214168
license {
215169
name.set("The Apache Software License, Version 2.0")
216-
url.set("http://www.apache.org/licenses/LICENSE-2.0.txt")
170+
url.set("https://www.apache.org/licenses/LICENSE-2.0.txt")
217171
distribution.set("repo")
218172
}
219173
}
@@ -271,6 +225,49 @@ nexusPublishing {
271225
}
272226
}
273227

228+
val integrationTest by tasks.registering(Test::class) {
229+
description = "Runs integration tests."
230+
group = "verification"
231+
232+
testClassesDirs = sourceSets["integrationTest"].output.classesDirs
233+
classpath = sourceSets["integrationTest"].runtimeClasspath
234+
outputs.upToDateWhen { false }
235+
shouldRunAfter("test")
236+
}
237+
238+
dockerCompose {
239+
waitForTcpPortsTimeout = Duration.ofSeconds(30)
240+
environment["SERVICES_HOST"] = "localhost"
241+
captureContainersOutputToFiles = project.file("build/container-logs")
242+
isRequiredBy(integrationTest)
243+
}
244+
245+
val check by tasks
246+
check.dependsOn(integrationTest)
247+
248+
tasks.withType<Test> {
249+
useJUnitPlatform()
250+
testLogging {
251+
events("passed", "skipped", "failed")
252+
showStandardStreams = true
253+
exceptionFormat = FULL
254+
}
255+
}
256+
257+
tasks.register("downloadDependencies") {
258+
doLast {
259+
description = "Pre-downloads dependencies"
260+
configurations.compileClasspath.get().files
261+
configurations.runtimeClasspath.get().files
262+
}
263+
outputs.upToDateWhen { false }
264+
}
265+
266+
tasks.register<Copy>("copyDependencies") {
267+
from(configurations.runtimeClasspath.get().files)
268+
into("$buildDir/third-party/")
269+
}
270+
274271
fun isNonStable(version: String): Boolean {
275272
val stableKeyword = listOf("RELEASE", "FINAL", "GA").any { version.toUpperCase().contains(it) }
276273
val regex = "^[0-9,.v-]+(-r)?$".toRegex()
@@ -285,5 +282,5 @@ tasks.named<DependencyUpdatesTask>("dependencyUpdates").configure {
285282
}
286283

287284
tasks.wrapper {
288-
gradleVersion = "7.0"
285+
gradleVersion = "7.0.2"
289286
}

gradle.properties

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
1-
kotlinVersion=1.4.32
1+
kotlinVersion=1.5.0
22
kotlin.code.style=official
33

44
avroVersion=1.10.2
5+
snappyVersion=1.1.8.4
56
jacksonVersion=2.12.3
67
hadoopVersion=3.3.0
78
jCommanderVersion=1.81
89
almworksVersion=1.1.2
9-
junitVersion=5.7.1
10+
junitVersion=5.7.2
1011
minioVersion=8.2.1
1112
jedisVersion=3.6.0
1213
slf4jVersion=1.7.30
13-
azureStorageVersion=12.11.0
14+
log4jVersion=2.14.1
15+
azureStorageVersion=12.11.1
1416
radarSchemasVersion=0.6.0
1517
mockitoVersion=3.8.0
1618
dokkaVersion=1.4.30
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
distributionBase=GRADLE_USER_HOME
22
distributionPath=wrapper/dists
3-
distributionUrl=https\://services.gradle.org/distributions/gradle-7.0-bin.zip
3+
distributionUrl=https\://services.gradle.org/distributions/gradle-7.0.2-bin.zip
44
zipStoreBase=GRADLE_USER_HOME
55
zipStorePath=wrapper/dists

0 commit comments

Comments
 (0)