|
1 | | -### Local Execution |
| 1 | +### Execution Modes |
| 2 | + |
| 3 | +**Data Cooker ETL** provides a handful of different execution modes, batch and interactive, local and remote, in |
| 4 | +different combinations. |
| 5 | + |
| 6 | +Refer to following matrix: |
| 7 | + |
| 8 | + Execution Mode | Batch Script \[Dry\] | Interactive... | ...with AutoExec Script \[Dry\] |
| 9 | +------------------------------|----------------------|----------------|--------------------------------- |
| 10 | + On Spark Cluster | -s \[-d\] | | |
| 11 | + Local | -l -s \[-d\] | -R | -R -s \[-d\] |
| 12 | + REPL Server On Spark Cluster | | -e | -e -s \[-d\] |
| 13 | + REPL Server Local | | -l -e | -l -e -s \[-d\] |
| 14 | + REPL Client | | -r | -r -s \[-d\] |
2 | 15 |
|
3 | | -To locally test Data Cooker ETL, you need an executable artifact [built](BUILD.md) with `local` profile. |
| 16 | +Cells with command line keys indicate which keys to use to run Data Cooker ETL in the desired execution mode. Empty |
| 17 | +cells indicate unsupported modes. |
| 18 | + |
| 19 | +### Command Line in General |
| 20 | + |
| 21 | +To familiarize with CLI command line, just invoke artifact with `-h` as lone argument: |
4 | 22 |
|
5 | | -First, invoke it with `--help` argument to get a list of options: |
6 | 23 | ```bash |
7 | | -java -jar datacooker-etl-cli.jar --help |
| 24 | +java -jar datacooker-etl-cli.jar -h |
8 | 25 | ``` |
9 | 26 |
|
10 | 27 | If its output is similar to |
| 28 | + |
11 | 29 | ``` |
12 | | -usage: Data Cooker ETL |
13 | | - -h,--help Print a list of command line options and exit |
14 | | - -s,--script <arg> TDL4 script file |
15 | | - -d,--dry Dry run: only check script syntax and print |
16 | | - errors to console, if found |
| 30 | +usage: Data Cooker ETL (ver. 3.8.0) |
| 31 | + -h,--help Print full list of command line options and |
| 32 | + exit |
| 33 | + -s,--script <arg> TDL4 script file. Mandatory for batch modes |
17 | 34 | -v,--variablesFile <arg> Path to variables file, name=value pairs per |
18 | 35 | each line |
19 | 36 | -V,--variables <arg> Pass contents of variables file encoded as |
20 | 37 | Base64 |
21 | | - -l,--local Run in local mode (its options have no effect |
| 38 | + -l,--local Run in local batch mode (cluster batch mode |
22 | 39 | otherwise) |
| 40 | + -d,--dry -l: Dry run (only check script syntax and |
| 41 | + print errors to console, if found) |
23 | 42 | -m,--driverMemory <arg> -l: Driver memory, by default Spark uses 1g |
24 | 43 | -u,--sparkUI -l: Enable Spark UI, by default it is disabled |
25 | 44 | -L,--localCores <arg> -l: Set cores #, by default * (all cores) |
26 | 45 | -R,--repl Run in local mode with interactive REPL |
27 | | - interface. -s is optional |
28 | | - -i,--history <arg> -R: Set history file location |
29 | | - ``` |
30 | | -then everything is OK, working as intended, and you could proceed to test your ETLs (you may safely ignore Spark warnings, if there are any). |
| 46 | + interface. Implies -l. -s is optional |
| 47 | + -r,--remoteRepl Connect to a remote REPL server. -s is optional |
| 48 | + -t,--history -R, -r: Set history file location |
| 49 | + -i,--host <arg> Use specified network address: |
| 50 | + -e: to listen at (default is all) |
| 51 | + -r: to connect to (in this case, mandatory |
| 52 | + parameter) |
| 53 | + -e,--serveRepl Start REPL server in local or cluster mode. -s |
| 54 | + is optional |
| 55 | + -p,--port <arg> -e, -r: Use specified port to listen at or |
| 56 | + connect to. Default is 9595 |
| 57 | +``` |
| 58 | + |
| 59 | +then everything is OK, working as intended, and you could proceed to begin building your ETL processes. |
| 60 | + |
| 61 | +To specify an ETL Script, use `-s <path/to/script.tdl>` argument. To check just ETL script syntax without performing |
| 62 | +the actual process, use `-d` switch for a Dry Run in any mode that supports `-s`. If any syntax error is encountered, |
| 63 | +it'll be reported to console. |
| 64 | + |
| 65 | +To specify values for script variables, use either `-v <path/to/vars.properties>` to point to file in Java |
| 66 | +properties format, or encode that file contents as Base64, and specify it to `-V <Base64string>` argument. |
| 67 | + |
| 68 | +### Local Execution |
| 69 | + |
| 70 | +To run Data Cooker ETL in any of the Local modes, you need an executable artifact [built](BUILD.md) with `local` |
| 71 | +profile. |
31 | 72 |
|
32 | | -To specify ETL script, use `--script <path/to/script.tdl>` argument. To check just ETL script syntax without performing the actual process, use `--dry` switch. |
| 73 | +You must also use `-l` switch. If you want to limit number of CPU cores available to Spark, use `-L` |
| 74 | +argument. If you want to change default memory limit of `1G`, use `-m` argument. For example, `-l -L 4 -m 8G`. |
33 | 75 |
|
34 | | -To specify values for script variables, use either `--variablesFile <path/to/vars.properties>` to point to file in Java properties format, or encode that file contents as Base64, and specify it to `--variables <Base64string>` argument. |
| 76 | +If you want to watch for execution of lengthy processing in Spark UI, use `-u` switch to start it up. Otherwise, no |
| 77 | +Spark UI will be started. |
35 | 78 |
|
36 | | -You must also use `--local` switch. If you want to limit number of CPU cores available to Spark, use `--localCores` argument. If you want to change default memory limit of 1G, use `--driverMemory` argument. For example, `-l -L 4 -m 8G`. |
| 79 | +### On-Cluster Execution |
37 | 80 |
|
38 | | -### REPL Mode |
| 81 | +If your environment matches with `EMR` profile (which is targeted to EMR 6.9 with Java 11), you may take |
| 82 | +artifact [built](BUILD.md) with that profile, and use your favorite Spark submitter to pass it to cluster, and invoke |
| 83 | +with `-s` and `-v` or `-V` command line switches. Entry class name is `io.github.pastorgl.datacooker.cli.Main`. |
39 | 84 |
|
40 | | -In addition to standard local mode, which just executes a single TDL4 script and then exits, there is a local REPL mode, useful if you want to interactively debug some scripts. |
| 85 | +Otherwise, you may first need to tinker with [commons](./commons/pom.xml) and [cli](./cli/pom.xml) project manifests and |
| 86 | +adjust library versions to match your environment. Because there are no exactly same Spark setups in the production, |
| 87 | +that would be necessary in most cases. |
41 | 88 |
|
42 | | -To run in REPL mode, use `--repl` switch. By default, it stores command history in your home directory, but if you want to redirect some session history to a different location, use `--history <path/to/file>` switch. |
| 89 | +We recommend to wrap submitter calls with some scripting and automate execution with CI/CD service. |
| 90 | + |
| 91 | +### REPL Modes |
| 92 | + |
| 93 | +In addition to standard batch modes, which just execute a single TDL4 Script and then exit, there are interactive modes |
| 94 | +with REPL, useful if you want to interactively debug your processes. |
| 95 | + |
| 96 | +To run in the Local REPL mode, use `-R` switch. If `-s` were specified, this Script becomes AutoExec, which will be |
| 97 | +executed before displaying the REPL prompt. |
| 98 | + |
| 99 | +To run just a REPL Server (either Local with `-l` or On-Cluster otherwise), use `-e` switch. If `-s` were specified, |
| 100 | +this Script becomes AutoExec, which will be executed before starting the REST service. `-i` and `-p` control which |
| 101 | +interface and port to use for REST. By default, configuration is `0.0.0.0:9595`. |
| 102 | + |
| 103 | +To run a REPl Client only, use `-r`. You need to provide which Server is to connect to using `-i` and `-p`. By default, |
| 104 | +configuration is `localhost:9595`. |
| 105 | + |
| 106 | +Please note that currently protocol between Server and Client is simple HTTP without security and authentication. If you |
| 107 | +intend to use it within production environment, you should wrap it into the secure tunnel and use some sort of |
| 108 | +authenticating proxy. |
| 109 | + |
| 110 | +By default, REPL shell stores command history in your home directory, but if you want to redirect some session history |
| 111 | +to a different location, use `-t <path/to/file>` switch. |
| 112 | + |
| 113 | +After starting up, you may see some Spark logs, and then the following prompt: |
43 | 114 |
|
44 | | -After starting up, you'll see some Spark logs, and then the following prompt |
45 | 115 | ``` |
46 | | -================================ |
47 | | -Data Cooker ETL REPL interactive |
| 116 | +============================================= |
| 117 | +Data Cooker ETL REPL interactive (ver. 3.8.0) |
48 | 118 | Type TDL4 statements to be executed in the REPL context in order of input, or a command. |
49 | 119 | Statement must always end with a semicolon. If not, it'll be continued on a next line. |
50 | 120 | If you want to type several statements at once on several lines, end each line with \ |
51 | | -Type \QUIT; to end session and \HELP; for list of all REPL commands |
| 121 | +Type \QUIT; to end session and \HELP; for list of all REPL commands and shortcuts |
52 | 122 | datacooker> _ |
53 | 123 | ``` |
54 | 124 |
|
55 | | -Follow the instructions and explore `\COMMANDs` with `\HELP \COMMAND;` command. |
| 125 | +Follow the instructions and explore available `\COMMAND`s with the `\HELP COMMAND;` command. |
| 126 | + |
| 127 | +You may freely execute any valid TDL4 statements, view your data, load scripts from files, and even record them |
| 128 | +directly in REPL. |
56 | 129 |
|
57 | | -After that, you may freely execute any TDL4 statements, load scripts from files, and even record them directly in REPL. |
| 130 | +Also, you may use some familiar shell shortcuts (like reverse search with `Ctrl+R`, automatic last commands expansion |
| 131 | +with `!n`) and contextual auto-complete of TDL4 statements with `TAB` key. |
| 132 | + |
| 133 | +Regarding Spark logs, in REPL shell they're automatically set to `WARN` level. If you want to switch back to default |
| 134 | +`INFO` level, use |
58 | 135 |
|
59 | | -Regarding Spark logs, they're automatically set to `WARN` level. If you want to switch to default `INFO`, use |
60 | 136 | ```sql |
61 | 137 | OPTIONS @log_level='INFO'; |
62 | 138 | ``` |
63 | | - |
64 | | -### Cluster Execution |
65 | | - |
66 | | -If your environment matches with `EMR` profile (which is targeted to EMR 6.9 with Java 11), you may take artifact [built](BUILD.md) with that profile, and use your favorite Spark submitter to pass it to cluster, and invoke with `--script` and `-v` or `-V` command line switches. Entry class name is `io.github.pastorgl.datacooker.cli.Main`. |
67 | | - |
68 | | -Otherwise, you may first need to tinker with [commons](./commons/pom.xml) and [cli](./cli/pom.xml) project manifests and adjust library versions to match your environment. Because there are no exactly same Spark setups in the production, that would be necessary in most cases. |
69 | | - |
70 | | -We recommend to wrap submitter calls with some scripting and automate execution with CI/CD service. |
|
0 commit comments