You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data streamed to HDFS using the [RADAR HDFS sink connector](https://github.com/RADAR-base/RADAR-HDFS-Sink-Connector) is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: `userId/topic/date_hour.csv`. The date and hour is extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-base/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh` script.
6
-
7
-
_Note_: when upgrading to version 0.6.0, please follow the following instructions:
8
-
- Write configuration file `restructure.yml` to match settings used with 0.5.x.
5
+
Data streamed by a Kafka Connector will be converted to a RADAR-base oriented output directory, by organizing it by project, user and collection date.
6
+
It supports data written by [RADAR HDFS sink connector](https://github.com/RADAR-base/RADAR-HDFS-Sink-Connector) is streamed to files based on topic name only. This package transforms that output to a local directory structure as follows: `projectId/userId/topic/date_hour.csv`. The date and hour are extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-base/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/bin/hdfs-restructure` script.
7
+
8
+
## Upgrade instructions
9
+
10
+
When upgrading to version 1.0.0 from version 0.6.0 please follow the following instructions:
11
+
12
+
- This package now relies on Redis for locking and offset management. Please install Redis or use
13
+
the docker-compose.yml file to start it.
14
+
- Write configuration file `restructure.yml` to match settings used with 0.6.0
15
+
- HDFS settings have moved to `source`. Specify all name nodes in the `nameNodes`
16
+
property. The `name` property is no longer used.
17
+
18
+
```yaml
19
+
source:
20
+
type: hdfs
21
+
hdfs:
22
+
nameNodes: [hdfs-namenode]
23
+
```
24
+
- Add a `redis` block:
25
+
26
+
```yaml
27
+
redis:
28
+
uri: redis://localhost:6379
29
+
```
30
+
- Offset accounting will automatically be migrated from a file-based storage to a Redis entry
31
+
as radar-output processes the topic. Please do not remove the offsets directory until it is
32
+
empty.
33
+
- storage settings have moved to the `target` block. Using local output directory:
34
+
35
+
```yaml
36
+
target:
37
+
type: local
38
+
local:
39
+
# User ID to write data as
40
+
userId: 123
41
+
# Group ID to write data as
42
+
groupId: 123
43
+
```
44
+
45
+
With the `S3StorageDriver`, use the following configuration instead:
46
+
```yaml
47
+
target:
48
+
type: s3
49
+
s3:
50
+
endpoint: https://my-region.s3.aws.amazon.com # or http://localhost:9000 for local minio
51
+
accessToken: ABA...
52
+
secretKey: CSD...
53
+
bucket: myBucketName
54
+
```
55
+
56
+
When upgrading to version 0.6.0 from version 0.5.x or earlier, please follow the following instructions:
57
+
- Write configuration file `restructure.yml` to match command-line settings used with 0.5.x.
9
58
- If needed, move all entries of `offsets.csv` to their per-topic file in `offsets/<topic>.csv`. First go to the output directory, then run the `bin/migrate-offsets-to-0.6.0.sh` script.
10
59
11
60
## Docker usage
12
61
13
-
This package is available as docker image [`radarbase/radar-hdfs-restructure`](https://hub.docker.com/r/radarbase/radar-hdfs-restructure). The entrypoint of the image is the current application. So in all of the commands listed in usage, replace `radar-hdfs-restructure` with for example:
62
+
This package is available as docker image [`radarbase/radar-output-restructure`](https://hub.docker.com/r/radarbase/radar-output-restructure). The entrypoint of the image is the current application. So in all the commands listed in usage, replace `radar-output-restructure` with for example:
To display the usage and all available options you can use the help option as follows -
82
+
To display the usage and all available options you can use the help option as follows:
32
83
```shell
33
-
radar-hdfs-restructure --help
84
+
radar-output-restructure --help
34
85
```
35
-
Note that the options preceded by the `*` in the above output are required to run the app. Also note that there can be multiple input paths from which to read the files. Eg - `/topicAndroidNew/topic1 /topicAndroidNew/topic2 ...`. At least one input path is required.
86
+
Note that the options preceded by the `*` in the above output are required to run the app. Also note that there can be multiple input paths from which to read the files. Eg - `/topicAndroidNew/topic1 /topicAndroidNew/topic2 ...`. Provide at least one input path.
36
87
37
88
Each argument, as well as much more, can be supplied in a config file. The default name of the config file is `restructure.yml`. Please refer to `restructure.yml` in the current directory for all available options. An alternative file can be specified with the `-F` flag.
38
89
39
90
### File Format
40
91
41
92
By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it. Deduplication can also be enabled or disabled per topic using the config file. If lines should be deduplicated using a subset of fields, e.g. only `sourceId` and `time` define a unique record and only the last record with duplicate values should be kept, then specify `topics: <topicName>: deduplication: distinctFields: [key.sourceId, value.time]`.
@@ -49,35 +100,45 @@ By default, files records are not deduplicated after writing. To enable this beh
49
100
50
101
Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size. Zip compression is also available.
This package assumes a Redis service running. See the example `restructure.yml` for configuration options.
56
109
57
-
There are two storage drivers implemented: `org.radarbase.hdfs.storage.LocalStorageDriver` for an output directory on the local file system or `org.radarbase.hdfs.storage.S3StorageDriver` for storage on an object store.
110
+
### Source and target
111
+
112
+
The `source` and `target` properties contain resource descriptions. The source can have two types, `hdfs` and `s3`:
58
113
59
-
`LocalStorageDriver` takes the following properties:
userId: 1000# write as regular user, use -1 to use current user (default).
140
+
groupId: 100# write as regular group, use -1 to use current user (default).
79
141
```
80
-
Ensure that the environment variables contain the authorized AWS keys that allow the service to list, download and upload files to the respective bucket.
81
142
82
143
### Service
83
144
@@ -94,21 +155,20 @@ This package requires at least Java JDK 8. Build the distribution with
94
155
and install the package into `/usr/local` with for example
95
156
```shell
96
157
sudo mkdir -p /usr/local
97
-
sudo tar -xzf build/distributions/radar-hdfs-restructure-0.6.0.tar.gz -C /usr/local --strip-components=1
158
+
sudo tar -xzf build/distributions/radar-output-restructure-1.0.0.tar.gz -C /usr/local --strip-components=1
98
159
```
99
160
100
-
Now the `radar-hdfs-restructure` command should be available.
161
+
Now the `radar-output-restructure` command should be available.
101
162
102
163
### Extending the connector
103
164
104
165
To implement alternative storage paths, storage drivers or storage formats, put your custom JAR in
105
-
`$APP_DIR/lib/radar-hdfs-plugins`. To load them, use the following options:
166
+
`$APP_DIR/lib/radar-output-plugins`. To load them, use the following options:
| `compression: factory: ...` | `org.radarbase.hdfs.compression.CompressionFactory` | Factory class to use for data compression. | CompressionFactory |
| `compression: factory: ...` | `org.radarbase.output.compression.CompressionFactory` | Factory class to use for data compression. | CompressionFactory |
113
173
114
174
The respective `<type>: properties: {}` configuration parameters can be used to provide custom configuration of the factory. This configuration will be passed to the `Plugin#init(Map<String, String>)` method.
0 commit comments