Skip to content

Commit ff3f989

Browse files
committed
Update EXECUTE document
1 parent 1274311 commit ff3f989

File tree

3 files changed

+111
-41
lines changed

3 files changed

+111
-41
lines changed

BUILD.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,19 @@
11
### How to Build
22

3-
To build Data Cooker ETL, you need Java 11 and Apache Maven. Minimum supported version of Maven is enforced in the [project file](./pom.xml), so please look into enforcer plugin section. For Java, [Amazon's Corretto](https://corretto.aws/) is the preferred distribution.
3+
To build Data Cooker ETL executable FatJAR artifact, you need Java 11 and Apache Maven.
44

5-
There are two profiles to target AWS EMR production environment (`EMR` — selected by default) and for local testing of ETL processes (`local`), so you have to call
5+
Minimum supported version of Maven is enforced in the [project file](./pom.xml), so please look into enforcer plugin section. For Java, [Amazon's Corretto](https://corretto.aws/) is the preferred distribution.
6+
7+
There are two profiles to target AWS EMR production environment (`EMR` — selected by default) and for local debugging of ETL processes (`local`), so you have to call
68
```bash
79
mvn clean package
810
```
911
or
1012
```bash
1113
mvn -Plocal clean package
1214
```
13-
to build a shaded executable 'Fat JAR' artifact, [datacooker-etl-cli.jar](./cli/target/datacooker-etl-cli.jar).
15+
to build a desired flavor of [datacooker-etl-cli.jar](./cli/target/datacooker-etl-cli.jar).
1416

15-
Currently supported version of EMR is 6.9. For local testing, Ubuntu 22.04 is recommended (either native or inside WSL).
17+
Currently supported version of EMR is 6.9. For local debugging, Ubuntu 22.04 is recommended (either native or inside WSL).
1618

1719
As well as executable artifact, modular documentation is automatically built from the modules' metadata at [docs](./cli/docs/) directory, in both HTML ([single-file](./cli/docs/merged.html) and [linked files](./cli/docs/index.html)) and [PDF](./cli/docs/merged.pdf) formats.

EXECUTE.md

Lines changed: 103 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,138 @@
1-
### Local Execution
1+
### Execution Modes
2+
3+
**Data Cooker ETL** provides a handful of different execution modes, batch and interactive, local and remote, in
4+
different combinations.
5+
6+
Refer to following matrix:
7+
8+
Execution Mode | Batch Script \[Dry\] | Interactive... | ...with AutoExec Script \[Dry\]
9+
------------------------------|----------------------|----------------|---------------------------------
10+
On Spark Cluster | -s \[-d\] | |
11+
Local | -l -s \[-d\] | -R | -R -s \[-d\]
12+
REPL Server On Spark Cluster | | -e | -e -s \[-d\]
13+
REPL Server Local | | -l -e | -l -e -s \[-d\]
14+
REPL Client | | -r | -r -s \[-d\]
215

3-
To locally test Data Cooker ETL, you need an executable artifact [built](BUILD.md) with `local` profile.
16+
Cells with command line keys indicate which keys to use to run Data Cooker ETL in the desired execution mode. Empty
17+
cells indicate unsupported modes.
18+
19+
### Command Line in General
20+
21+
To familiarize with CLI command line, just invoke artifact with `-h` as lone argument:
422

5-
First, invoke it with `--help` argument to get a list of options:
623
```bash
7-
java -jar datacooker-etl-cli.jar --help
24+
java -jar datacooker-etl-cli.jar -h
825
```
926

1027
If its output is similar to
28+
1129
```
12-
usage: Data Cooker ETL
13-
-h,--help Print a list of command line options and exit
14-
-s,--script <arg> TDL4 script file
15-
-d,--dry Dry run: only check script syntax and print
16-
errors to console, if found
30+
usage: Data Cooker ETL (ver. 3.8.0)
31+
-h,--help Print full list of command line options and
32+
exit
33+
-s,--script <arg> TDL4 script file. Mandatory for batch modes
1734
-v,--variablesFile <arg> Path to variables file, name=value pairs per
1835
each line
1936
-V,--variables <arg> Pass contents of variables file encoded as
2037
Base64
21-
-l,--local Run in local mode (its options have no effect
38+
-l,--local Run in local batch mode (cluster batch mode
2239
otherwise)
40+
-d,--dry -l: Dry run (only check script syntax and
41+
print errors to console, if found)
2342
-m,--driverMemory <arg> -l: Driver memory, by default Spark uses 1g
2443
-u,--sparkUI -l: Enable Spark UI, by default it is disabled
2544
-L,--localCores <arg> -l: Set cores #, by default * (all cores)
2645
-R,--repl Run in local mode with interactive REPL
27-
interface. -s is optional
28-
-i,--history <arg> -R: Set history file location
29-
```
30-
then everything is OK, working as intended, and you could proceed to test your ETLs (you may safely ignore Spark warnings, if there are any).
46+
interface. Implies -l. -s is optional
47+
-r,--remoteRepl Connect to a remote REPL server. -s is optional
48+
-t,--history -R, -r: Set history file location
49+
-i,--host <arg> Use specified network address:
50+
-e: to listen at (default is all)
51+
-r: to connect to (in this case, mandatory
52+
parameter)
53+
-e,--serveRepl Start REPL server in local or cluster mode. -s
54+
is optional
55+
-p,--port <arg> -e, -r: Use specified port to listen at or
56+
connect to. Default is 9595
57+
```
58+
59+
then everything is OK, working as intended, and you could proceed to begin building your ETL processes.
60+
61+
To specify an ETL Script, use `-s <path/to/script.tdl>` argument. To check just ETL script syntax without performing
62+
the actual process, use `-d` switch for a Dry Run in any mode that supports `-s`. If any syntax error is encountered,
63+
it'll be reported to console.
64+
65+
To specify values for script variables, use either `-v <path/to/vars.properties>` to point to file in Java
66+
properties format, or encode that file contents as Base64, and specify it to `-V <Base64string>` argument.
67+
68+
### Local Execution
69+
70+
To run Data Cooker ETL in any of the Local modes, you need an executable artifact [built](BUILD.md) with `local`
71+
profile.
3172

32-
To specify ETL script, use `--script <path/to/script.tdl>` argument. To check just ETL script syntax without performing the actual process, use `--dry` switch.
73+
You must also use `-l` switch. If you want to limit number of CPU cores available to Spark, use `-L`
74+
argument. If you want to change default memory limit of `1G`, use `-m` argument. For example, `-l -L 4 -m 8G`.
3375

34-
To specify values for script variables, use either `--variablesFile <path/to/vars.properties>` to point to file in Java properties format, or encode that file contents as Base64, and specify it to `--variables <Base64string>` argument.
76+
If you want to watch for execution of lengthy processing in Spark UI, use `-u` switch to start it up. Otherwise, no
77+
Spark UI will be started.
3578

36-
You must also use `--local` switch. If you want to limit number of CPU cores available to Spark, use `--localCores` argument. If you want to change default memory limit of 1G, use `--driverMemory` argument. For example, `-l -L 4 -m 8G`.
79+
### On-Cluster Execution
3780

38-
### REPL Mode
81+
If your environment matches with `EMR` profile (which is targeted to EMR 6.9 with Java 11), you may take
82+
artifact [built](BUILD.md) with that profile, and use your favorite Spark submitter to pass it to cluster, and invoke
83+
with `-s` and `-v` or `-V` command line switches. Entry class name is `io.github.pastorgl.datacooker.cli.Main`.
3984

40-
In addition to standard local mode, which just executes a single TDL4 script and then exits, there is a local REPL mode, useful if you want to interactively debug some scripts.
85+
Otherwise, you may first need to tinker with [commons](./commons/pom.xml) and [cli](./cli/pom.xml) project manifests and
86+
adjust library versions to match your environment. Because there are no exactly same Spark setups in the production,
87+
that would be necessary in most cases.
4188

42-
To run in REPL mode, use `--repl` switch. By default, it stores command history in your home directory, but if you want to redirect some session history to a different location, use `--history <path/to/file>` switch.
89+
We recommend to wrap submitter calls with some scripting and automate execution with CI/CD service.
90+
91+
### REPL Modes
92+
93+
In addition to standard batch modes, which just execute a single TDL4 Script and then exit, there are interactive modes
94+
with REPL, useful if you want to interactively debug your processes.
95+
96+
To run in the Local REPL mode, use `-R` switch. If `-s` were specified, this Script becomes AutoExec, which will be
97+
executed before displaying the REPL prompt.
98+
99+
To run just a REPL Server (either Local with `-l` or On-Cluster otherwise), use `-e` switch. If `-s` were specified,
100+
this Script becomes AutoExec, which will be executed before starting the REST service. `-i` and `-p` control which
101+
interface and port to use for REST. By default, configuration is `0.0.0.0:9595`.
102+
103+
To run a REPl Client only, use `-r`. You need to provide which Server is to connect to using `-i` and `-p`. By default,
104+
configuration is `localhost:9595`.
105+
106+
Please note that currently protocol between Server and Client is simple HTTP without security and authentication. If you
107+
intend to use it within production environment, you should wrap it into the secure tunnel and use some sort of
108+
authenticating proxy.
109+
110+
By default, REPL shell stores command history in your home directory, but if you want to redirect some session history
111+
to a different location, use `-t <path/to/file>` switch.
112+
113+
After starting up, you may see some Spark logs, and then the following prompt:
43114

44-
After starting up, you'll see some Spark logs, and then the following prompt
45115
```
46-
================================
47-
Data Cooker ETL REPL interactive
116+
=============================================
117+
Data Cooker ETL REPL interactive (ver. 3.8.0)
48118
Type TDL4 statements to be executed in the REPL context in order of input, or a command.
49119
Statement must always end with a semicolon. If not, it'll be continued on a next line.
50120
If you want to type several statements at once on several lines, end each line with \
51-
Type \QUIT; to end session and \HELP; for list of all REPL commands
121+
Type \QUIT; to end session and \HELP; for list of all REPL commands and shortcuts
52122
datacooker> _
53123
```
54124

55-
Follow the instructions and explore `\COMMANDs` with `\HELP \COMMAND;` command.
125+
Follow the instructions and explore available `\COMMAND`s with the `\HELP COMMAND;` command.
126+
127+
You may freely execute any valid TDL4 statements, view your data, load scripts from files, and even record them
128+
directly in REPL.
56129

57-
After that, you may freely execute any TDL4 statements, load scripts from files, and even record them directly in REPL.
130+
Also, you may use some familiar shell shortcuts (like reverse search with `Ctrl+R`, automatic last commands expansion
131+
with `!n`) and contextual auto-complete of TDL4 statements with `TAB` key.
132+
133+
Regarding Spark logs, in REPL shell they're automatically set to `WARN` level. If you want to switch back to default
134+
`INFO` level, use
58135

59-
Regarding Spark logs, they're automatically set to `WARN` level. If you want to switch to default `INFO`, use
60136
```sql
61137
OPTIONS @log_level='INFO';
62138
```
63-
64-
### Cluster Execution
65-
66-
If your environment matches with `EMR` profile (which is targeted to EMR 6.9 with Java 11), you may take artifact [built](BUILD.md) with that profile, and use your favorite Spark submitter to pass it to cluster, and invoke with `--script` and `-v` or `-V` command line switches. Entry class name is `io.github.pastorgl.datacooker.cli.Main`.
67-
68-
Otherwise, you may first need to tinker with [commons](./commons/pom.xml) and [cli](./cli/pom.xml) project manifests and adjust library versions to match your environment. Because there are no exactly same Spark setups in the production, that would be necessary in most cases.
69-
70-
We recommend to wrap submitter calls with some scripting and automate execution with CI/CD service.

cli/src/main/java/io/github/pastorgl/datacooker/cli/Configuration.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ public Configuration() {
2828
addOption("u", "sparkUI", false, "-l: Enable Spark UI, by default it is disabled");
2929
addOption("L", "localCores", true, "-l: Set cores #, by default * (all cores)");
3030
addOption("R", "repl", false, "Run in local mode with interactive REPL interface. Implies -l. -s is optional");
31-
addOption("i", "history", true, "-R, -r: Set history file location");
31+
addOption("r", "remoteRepl", false, "Connect to a remote REPL server. -s is optional");
32+
addOption("t", "history", true, "-R, -r: Set history file location");
3233
addOption("e", "serveRepl", false, "Start REPL server in local or cluster mode. -s is optional");
33-
addOption("r", "remoteRepl", false, "Connect to a remote REPL server");
3434
addOption("i", "host", true, "Use specified network address:\n" +
3535
"-e: to listen at (default is all)\n" +
3636
"-r: to connect to (in this case, mandatory parameter)");

0 commit comments

Comments
 (0)