@@ -46,7 +46,7 @@ E.g. with [pyenv](https://github.com/pyenv/pyenv) and [pyenv-virtualenv](https:/
46
46
pyenv install 3.7.13
47
47
pyenv virtualenv 3.7.13 ldbc_datagen_tools
48
48
pyenv local ldbc_datagen_tools
49
- pip install -U pip
49
+ pip install -U pip
50
50
pip install ./tools
51
51
` ` `
52
52
# ## Running locally
@@ -80,7 +80,8 @@ Once you have Spark in place and built the JAR file, run the generator as follow
80
80
` ` ` bash
81
81
export PLATFORM_VERSION=2.12_spark3.1
82
82
export DATAGEN_VERSION=0.5.0-SNAPSHOT
83
- ./tools/run.py ./target/ldbc_snb_datagen_${PLATFORM_VERSION} -${DATAGEN_VERSION} .jar < runtime configuration arguments> -- < generator configuration arguments>
83
+ export LDBC_SNB_DATAGEN_JAR=./target/ldbc_snb_datagen_${PLATFORM_VERSION} -${DATAGEN_VERSION} .jar
84
+ ./tools/run.py < runtime configuration arguments> -- < generator configuration arguments>
84
85
` ` `
85
86
86
87
# ### Runtime configuration arguments
@@ -94,7 +95,7 @@ The runtime configuration arguments determine the amount of memory, number of th
94
95
To generate a single ` part-* .csv` file, reduce the parallelism (number of Spark partitions) to 1.
95
96
96
97
` ` ` bash
97
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar --parallelism 1 -- --format csv --scale-factor 0.003 --mode interactive
98
+ ./tools/run.py --parallelism 1 -- --format csv --scale-factor 0.003 --mode interactive
98
99
` ` `
99
100
# ### Generator configuration arguments
100
101
@@ -103,49 +104,49 @@ The generator configuration arguments allow the configuration of the output dire
103
104
To get a complete list of the arguments, pass ` --help` to the JAR file:
104
105
105
106
` ` ` bash
106
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --help
107
+ ./tools/run.py -- --help
107
108
` ` `
108
109
109
110
* Generating ` CsvBasic` files in ** Interactive mode** :
110
111
111
112
` ` ` bash
112
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format csv --scale-factor 0.003 --explode-edges --explode-attrs --mode interactive
113
+ ./tools/run.py -- --format csv --scale-factor 0.003 --explode-edges --explode-attrs --mode interactive
113
114
` ` `
114
115
115
116
* Generating ` CsvCompositeMergeForeign` files in ** BI mode** resulting in compressed ` .csv.gz` files:
116
117
117
118
` ` ` bash
118
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format csv --scale-factor 0.003 --mode bi --format-options compression=gzip
119
+ ./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --format-options compression=gzip
119
120
` ` `
120
121
121
122
* Generating ` CsvCompositeMergeForeign` files in ** BI mode** and generating factors:
122
123
123
124
` ` ` bash
124
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format csv --scale-factor 0.003 --mode bi --generate-factors
125
+ ./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --generate-factors
125
126
` ` `
126
127
127
128
* Generating CSVs in ** raw mode** :
128
129
129
130
` ` ` bash
130
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format csv --scale-factor 0.003 --mode raw --output-dir sf0.003-raw
131
+ ./tools/run.py -- --format csv --scale-factor 0.003 --mode raw --output-dir sf0.003-raw
131
132
` ` `
132
133
133
134
* Generating Parquet files:
134
135
135
136
` ` ` bash
136
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format parquet --scale-factor 0.003 --mode bi
137
+ ./tools/run.py -- --format parquet --scale-factor 0.003 --mode bi
137
138
` ` `
138
139
139
140
* Use epoch milliseconds encoded as longs (née ` LongDateFormatter` ) for serializing date and datetime values:
140
141
141
142
` ` ` bash
142
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format csv --scale-factor 0.003 --mode bi --epoch-millis
143
+ ./tools/run.py -- --format csv --scale-factor 0.003 --mode bi --epoch-millis
143
144
` ` `
144
145
145
146
* For the ` interactive` and ` bi` formats, the ` --format-options` argument allows passing formatting options such as timestamp/date formats, the presence/abscence of headers (see the [Spark formatting options](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.sql.DataFrameWriter) for details), and whether quoting the fields in the CSV required:
146
147
147
148
` ` ` bash
148
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format csv --scale-factor 0.003 --mode interactive --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y,header=false,quoteAll=true
149
+ ./tools/run.py -- --format csv --scale-factor 0.003 --mode interactive --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y,header=false,quoteAll=true
149
150
` ` `
150
151
151
152
To change the Spark configuration directory, adjust the ` SPARK_CONF_DIR` environment variable.
@@ -154,31 +155,53 @@ A complex example:
154
155
155
156
` ` ` bash
156
157
export SPARK_CONF_DIR=./conf
157
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar --parallelism 4 --memory 8G -- --format csv --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y --explode-edges --explode-attrs --mode interactive --scale-factor 0.003
158
+ ./tools/run.py --parallelism 4 --memory 8G -- --format csv --format-options timestampFormat=MM/dd/y\ HH:mm:ss,dateFormat=MM/dd/y --explode-edges --explode-attrs --mode interactive --scale-factor 0.003
158
159
` ` `
159
160
160
161
It is also possible to pass a parameter file:
161
162
162
163
` ` ` bash
163
- ./tools/run.py ./target/ldbc_snb_datagen_ ${PLATFORM_VERSION} - ${DATAGEN_VERSION} .jar -- --format csv --param-file params.ini
164
+ ./tools/run.py -- --format csv --param-file params.ini
164
165
` ` `
165
166
166
- # ## Docker image
167
+ # ## Docker images
167
168
168
169
< ! -- SNB Datagen images are available via [Docker Hub](https://hub.docker.com/r/ldbc/datagen/) (currently outdated). -->
169
170
170
- The Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:
171
+ The image tags follow the pattern ` ${DATAGEN_VERSION} - ${PLATFORM_VERSION} ` , e.g ` ldbc/datagen-standalone:0.5.0-2.12_spark3.1 ` .
171
172
173
+ # ### Standalone Docker image
174
+
175
+ The standalone image bundles Spark with the JAR and Python helpers, so you can run a workload in a container similarly to a local run, as you can
176
+ see in this example:
172
177
` ` ` bash
173
- ./tools/docker-build.sh
178
+ mkdir -p out_sf0.003_interactive # create output directory
179
+ docker run \
180
+ --mount type=bind,source="$(pwd)"/out_sf0.003_interactive,target=/out \
181
+ --mount type=bind,source="$(pwd)"/conf,target=/conf,readonly \
182
+ -e SPARK_CONF_DIR=/conf \
183
+ ldbc/datagen-standalone:latest --parallelism 1 -- --format csv --scale-factor 0.003 --mode interactive
174
184
` ` `
175
185
176
- See [Build the JAR]( # build-the-jar) to build the library (e.g. by invoking `./tools/ build.sh`). Then, run the following:
186
+ The standalone Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory :
177
187
178
188
` ` ` bash
179
- ./tools/docker-run.sh
189
+ docker buildx build . --target=standalone -t ldbc/datagen-standalone:latest
190
+ ` ` `
191
+
192
+ # ### JAR-only image
193
+ The ` ldbc/datagen-jar` image contains the assembly JAR, so it can bundled in your custom container:
194
+
195
+ ` ` ` docker
196
+ FROM my-spark-image
197
+ COPY --from=ldbc/datagen-jar:latest /jar /lib/ldbc-datagen.jar
180
198
` ` `
181
199
200
+ The JAR-only Docker image can be built with the provided Dockerfile. To build, execute the following command from the repository directory:
201
+
202
+ ` ` ` bash
203
+ docker buildx build . --target=jar -t ldbc/datagen-jar:latest
204
+ ` ` `
182
205
# ## Elastic MapReduce
183
206
184
207
We provide scripts to run Datagen on AWS EMR. See the README in the [` ./tools/emr` ](tools/emr) directory for details.
0 commit comments