|
2 | 2 |
|
3 | 3 | [](https://travis-ci.org/RADAR-base/Restructure-HDFS-topic)
|
4 | 4 |
|
5 |
| -Data streamed to HDFS using the [RADAR HDFS sink connector](https://github.com/RADAR-base/RADAR-HDFS-Sink-Connector) is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: `userId/topic/date_hour.csv`. The date and hour is extracted from the `time` field of each record, and is formatted in UTC time. |
| 5 | +Data streamed to HDFS using the [RADAR HDFS sink connector](https://github.com/RADAR-CNS/RADAR-HDFS-Sink-Connector) is streamed to files based on sensor only. This package can transform that output to a local directory structure as follows: `userId/topic/date_hour.csv`. The date and hour is extracted from the `time` field of each record, and is formatted in UTC time. This package is included in the [RADAR-Docker](https://github.com/RADAR-CNS/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh` script. |
6 | 6 |
|
7 |
| -## Usage |
| 7 | +## Docker usage |
| 8 | + |
| 9 | +This package is available as docker image [`radarbase/radar-hdfs-restructure`](https://hub.docker.com/r/radarbase/radar-hdfs-restructure). The entrypoint of the image is the current application. So in all of the commands listed in usage, replace `radar-hdfs-restructure` with for example: |
| 10 | +```shell |
| 11 | +docker run --rm -t --network hadoop -v "$PWD/output:/output" radarbase/radar-hdfs-restructure:0.4.0 -u hdfs://hdfs -o /output /myTopic |
| 12 | +``` |
| 13 | +if your docker cluster is running in the `hadoop` network and your output directory should be `./output`. |
8 | 14 |
|
9 |
| -This package is included in the [RADAR-Docker](https://github.com/RADAR-base/RADAR-Docker) repository, in the `dcompose/radar-cp-hadoop-stack/hdfs_restructure.sh` script. |
10 | 15 |
|
11 |
| -## Advanced usage |
| 16 | +## Local build |
12 | 17 |
|
13 |
| -Build jar from source with |
| 18 | +This package requires at least Java JDK 8. Build the distribution with |
14 | 19 |
|
15 | 20 | ```shell
|
16 | 21 | ./gradlew build
|
17 | 22 | ```
|
18 |
| -and find the output JAR file as `build/libs/restructurehdfs-0.4.0-all.jar`. Then run with: |
| 23 | + |
| 24 | +and install the package into `/usr/local` with for example |
| 25 | +```shell |
| 26 | +sudo mkdir -p /usr/local |
| 27 | +sudo tar -xzf build/distributions/radar-hdfs-restructure-0.4.0.tar.gz -C /usr/local --strip-components=1 |
| 28 | +``` |
| 29 | + |
| 30 | +Now the `radar-hdfs-restructure` command should be available. |
| 31 | + |
| 32 | +## Command line usage |
| 33 | + |
| 34 | +When the application is installed, it can be used as follows: |
19 | 35 |
|
20 | 36 | ```shell
|
21 |
| -java -jar restructurehdfs-0.4.0-all.jar --hdfs-uri <webhdfs_url> --output-directory <output_folder> <input_path_1> [<input_path_2> ...] |
| 37 | +radar-hdfs-restructure --hdfs-uri <webhdfs_url> --output-directory <output_folder> <input_path_1> [<input_path_2> ...] |
22 | 38 | ```
|
23 | 39 | or you can use the short form as well like -
|
24 | 40 | ```shell
|
25 |
| -java -jar restructurehdfs-0.4.0-all.jar -u <webhdfs_url> -o <output_folder> <input_path_1> [<input_path_2> ...] |
| 41 | +radar-hdfs-restructure -u <webhdfs_url> -o <output_folder> <input_path_1> [<input_path_2> ...] |
26 | 42 | ```
|
27 | 43 |
|
28 | 44 | To display the usage and all available options you can use the help option as follows -
|
29 | 45 | ```shell
|
30 |
| -java -jar restructurehdfs-0.4.0-all.jar --help |
| 46 | +radar-hdfs-restructure --help |
31 | 47 | ```
|
32 | 48 | Note that the options preceded by the `*` in the above output are required to run the app. Also note that there can be multiple input paths from which to read the files. Eg - `/topicAndroidNew/topic1 /topicAndroidNew/topic2 ...`. At least one input path is required.
|
33 | 49 |
|
34 | 50 | By default, this will output the data in CSV format. If JSON format is preferred, use the following instead:
|
35 | 51 | ```shell
|
36 |
| -java -jar restructurehdfs-0.4.0-all.jar --format json --hdfs-uri <webhdfs_url> --output-directory <output_folder> <input_path_1> [<input_path_2> ...] |
| 52 | +radar-hdfs-restructure --format json --hdfs-uri <webhdfs_url> --output-directory <output_folder> <input_path_1> [<input_path_2> ...] |
37 | 53 | ```
|
38 | 54 |
|
39 | 55 | Another option is to output the data in compressed form. All files will get the `gz` suffix, and can be decompressed with a GZIP decoder. Note that for a very small number of records, this may actually increase the file size.
|
40 | 56 | ```
|
41 |
| -java -jar restructurehdfs-0.4.0-all.jar --compression gzip --hdfs-uri <webhdfs_url> --output-directory <output_folder> <input_path_1> [<input_path_2> ...] |
| 57 | +radar-hdfs-restructure --compression gzip --hdfs-uri <webhdfs_url> --output-directory <output_folder> <input_path_1> [<input_path_2> ...] |
42 | 58 | ```
|
43 | 59 |
|
44 | 60 | By default, files records are not deduplicated after writing. To enable this behaviour, specify the option `--deduplicate` or `-d`. This set to false by default because of an issue with Biovotion data. Please see - [issue #16](https://github.com/RADAR-base/Restructure-HDFS-topic/issues/16) before enabling it.
|
|
0 commit comments