You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data streamed to HDFS using the [RADAR HDFS sink connector](https://github.com/RADAR-CNS/RADAR-HDFS-Sink-Connector) is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: `userId/topic/date_hour.csv`. The date and hour is extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-CNS/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh` script.
5
+
Data streamed to HDFS using the [RADAR HDFS sink connector](https://github.com/RADAR-base/RADAR-HDFS-Sink-Connector) is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: `userId/topic/date_hour.csv`. The date and hour is extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-base/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh` script.
6
+
7
+
_Note_: when upgrading to version 0.6.0, please follow the following instructions:
8
+
- Write configuration file `restructure.yml` to match settings used with 0.5.x.
9
+
- If needed, move all entries of `offsets.csv` to their per-topic file in `offsets/<topic>.csv`. First go to the output directory, then run the `bin/migrate-offsets-to-0.6.0.sh` script.
6
10
7
11
## Docker usage
8
12
9
13
This package is available as docker image [`radarbase/radar-hdfs-restructure`](https://hub.docker.com/r/radarbase/radar-hdfs-restructure). The entrypoint of the image is the current application. So in all of the commands listed in usage, replace `radar-hdfs-restructure` with for example:
if your docker cluster is running in the `hadoop` network and your output directory should be `./output`.
14
18
15
-
## Local build
16
-
17
-
This package requires at least Java JDK 8. Build the distribution with
18
-
19
-
```shell
20
-
./gradlew build
21
-
```
22
-
23
-
and install the package into `/usr/local` with for example
24
-
```shell
25
-
sudo mkdir -p /usr/local
26
-
sudo tar -xzf build/distributions/radar-hdfs-restructure-0.5.7.tar.gz -C /usr/local --strip-components=1
27
-
```
28
-
29
-
Now the `radar-hdfs-restructure` command should be available.
30
-
31
19
## Command line usage
32
20
33
21
When the application is installed, it can be used as follows:
@@ -46,32 +34,81 @@ radar-hdfs-restructure --help
46
34
```
47
35
Note that the options preceded by the `*` in the above output are required to run the app. Also note that there can be multiple input paths from which to read the files. Eg - `/topicAndroidNew/topic1 /topicAndroidNew/topic2 ...`. At least one input path is required.
48
36
37
+
Each argument, as well as much more, can be supplied in a config file. The default name of the config file is `restructure.yml`. Please refer to `restructure.yml` in the current directory for all available options. An alternative file can be specified with the `-F` flag.
38
+
39
+
### File Format
40
+
49
41
By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:
Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size.
46
+
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it. Deduplication can also be enabled or disabled per topic using the config file. If lines should be deduplicated using a subset of fields, e.g. only `sourceId` and `time` define a unique record and only the last record with duplicate values should be kept, then specify `topics: <topicName>: deduplication: distinctFields: [key.sourceId, value.time]`.
47
+
48
+
### Compression
49
+
50
+
Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size. Zip compression is also available.
By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it.
55
+
### Storage
60
56
61
-
To set the output user ID and group ID, specify the `-p local-uid=123` and `-p local-gid=12` properties.
57
+
There are two storage drivers implemented: `org.radarbase.hdfs.storage.LocalStorageDriver` for an output directory on the local file system or `org.radarbase.hdfs.storage.S3StorageDriver` for storage on an object store.
62
58
63
-
To run the output generator as a service that will regularly poll the HDFS directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval.
59
+
`LocalStorageDriver` takes the following properties:
Ensure that the environment variables contain the authorized AWS keys that allow the service to list, download and upload files to the respective bucket.
81
+
82
+
### Service
83
+
84
+
To run the output generator as a service that will regularly poll the HDFS directory, add the `--service` flag and optionally the `--interval` flag to adjust the polling interval or use the corresponding configuration file parameters.
85
+
86
+
## Local build
87
+
88
+
This package requires at least Java JDK 8. Build the distribution with
89
+
90
+
```shell
91
+
./gradlew build
92
+
```
93
+
94
+
and install the package into `/usr/local` with for example
95
+
```shell
96
+
sudo mkdir -p /usr/local
97
+
sudo tar -xzf build/distributions/radar-hdfs-restructure-0.6.0.tar.gz -C /usr/local --strip-components=1
98
+
```
99
+
100
+
Now the `radar-hdfs-restructure` command should be available.
64
101
65
-
## Extending the connector
102
+
### Extending the connector
66
103
67
104
To implement alternative storage paths, storage drivers or storage formats, put your custom JAR in
68
105
`$APP_DIR/lib/radar-hdfs-plugins`. To load them, use the following options:
| `compression: factory: ...` | `org.radarbase.hdfs.compression.CompressionFactory` | Factory class to use for data compression. | CompressionFactory |
76
113
77
-
To pass arguments to self-assigned plugins, use `-p arg1=value1 -p arg2=value2` command-line flags and read those arguments in the `Plugin#init(Map<String, String>)` method.
114
+
The respective `<type>: properties: {}` configuration parameters can be used to provide custom configuration of the factory. This configuration will be passed to the `Plugin#init(Map<String, String>)` method.
0 commit comments