Skip to content

Commit 6257b6c

Browse files
authored
Merge pull request #262 from marklogic/release/1.1.0
Merge release/1.1.0 into main
2 parents 9f6a264 + 6378608 commit 6257b6c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+464
-132
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ flux/conf
1212
flux/export
1313
export
1414
flux-version.properties
15+
docker/sonarqube

CONTRIBUTING.md

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
To contribute to this project, complete these steps to setup a MarkLogic instance via Docker with a test
22
application installed:
33

4-
1. Clone this repository if you have not already.
5-
2. From the root directory of the project, run `docker-compose up -d --build`.
6-
3. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
7-
4. Run `./gradlew -i mlDeploy` to deploy this project's test application (note that Java 11 or Java 17 is required).
4+
1. Ensure you have Java 11 or higher installed; you will need Java 17 if you wish to use the Sonarqube support described below.
5+
2. Clone this repository if you have not already.
6+
3. From the root directory of the project, run `docker-compose up -d --build`.
7+
4. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
8+
5. Run `./gradlew -i mlDeploy` to deploy this project's test application.
89

910
Some of the tests depend on the Postgres instance deployed via Docker. Follow these steps to load a sample dataset
1011
into it:
@@ -57,10 +58,7 @@ publishing a local snapshot of our Spark connector. Then just run:
5758

5859
./gradlew clean test
5960

60-
You can run the tests using either Java 11 or Java 17.
61-
62-
In Intellij, the tests will run with Java 11. In order to run the tests in Intellij using Java 17,
63-
perform the following steps:
61+
If you are running the tests in Intellij with Java 17, you will need to perform the following steps:
6462

6563
1. Go to Run -> Edit Configurations in the Intellij toolbar.
6664
2. Click on "Edit configuration templates".
@@ -81,7 +79,7 @@ delete that configuration first via the "Run -> Edit Configurations" panel.
8179
## Generating code quality reports with SonarQube
8280

8381
In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file, and you must
84-
have the services in that file running.
82+
have the services in that file running. You must also use Java 17 to run the `sonar` Gradle task.
8583

8684
To configure the SonarQube service, perform the following steps:
8785

@@ -97,8 +95,8 @@ To configure the SonarQube service, perform the following steps:
9795
10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
9896
that file if it does not exist yet.
9997

100-
To run SonarQube, run the following Gradle tasks, which will run all the tests with code coverage and then generate
101-
a quality report with SonarQube:
98+
To run SonarQube, run the following Gradle tasks with Java 17 or higher, which will run all the tests with code
99+
coverage and then generate a quality report with SonarQube:
102100

103101
./gradlew test sonar
104102

@@ -116,7 +114,8 @@ before, then SonarQube will show "New Code" by default. That's handy, as you can
116114
you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.
117115

118116
Note that if you only need results on code smells and vulnerabilities, you can repeatedly run `./gradlew sonar`
119-
without having to re-run the tests.
117+
without having to re-run the tests. If you get an error from Sonar about Java sources, you just need to compile the
118+
Java code, so run `./gradlew compileTestJava sonar`.
120119

121120
## Testing the documentation locally
122121

@@ -229,7 +228,7 @@ Set `SPARK_HOME` to the location of Spark - e.g. `/Users/myname/.sdkman/candidat
229228

230229
Next, start a Spark master node:
231230

232-
cd $SPARK_HOME/bin
231+
cd $SPARK_HOME/sbin
233232
start-master.sh
234233

235234
You will need the address at which the Spark master node can be reached. To find it, open the log file that Spark
@@ -257,15 +256,15 @@ are all synonyms):
257256

258257
./gradlew shadowJar
259258

260-
This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.0.0-all.jar`.
259+
This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.1.0-all.jar`.
261260

262261
You can now run any CLI command via spark-submit. This is an example of previewing an import of files - change the value
263262
of `--path`, as an absolute path is needed, and of course change the value of `--master` to match that of your Spark
264263
cluster:
265264

266265
```
267266
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
268-
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.0.0-all.jar \
267+
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.1.0-all.jar \
269268
import-files --path /Users/rudin/workspace/flux/flux-cli/src/test/resources/mixed-files \
270269
--connection-string "admin:admin@localhost:8000" \
271270
--preview 5 --preview-drop content
@@ -282,7 +281,7 @@ to something you can access):
282281
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
283282
--packages org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-client:3.3.4 \
284283
--master spark://NYWHYC3G0W:7077 \
285-
flux-cli/build/libs/marklogic-flux-1.0.0-all.jar \
284+
flux-cli/build/libs/marklogic-flux-1.1.0-all.jar \
286285
import-files --path "s3a://changeme/" \
287286
--connection-string "admin:admin@localhost:8000" \
288287
--s3-add-credentials \

NOTICE.txt

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ To the extent required by the applicable open-source license, a complete machine
66

77
Third Party Notices
88

9-
aws-java-sdk-s3 1.12.367 (Apache-2.0)
10-
hadoop-aws 3.3.6 (Apache-2.0)
11-
hadoop-client 3.3.6 (Apache-2.0)
12-
marklogic-spark-connector 2.3.0 (Apache-2.0)
9+
aws-java-sdk-s3 1.12.262 (Apache-2.0)
10+
hadoop-aws 3.3.4 (Apache-2.0)
11+
hadoop-client 3.3.4 (Apache-2.0)
12+
marklogic-spark-connector 2.4.0 (Apache-2.0)
1313
picocli 4.7.6 (Apache-2.0)
14-
spark-avro_2.12 3.4.3 (Apache-2.0)
15-
spark-sql_2.12 3.4.3 (Apache-2.0)
14+
spark-avro_2.12 3.5.3 (Apache-2.0)
15+
spark-sql_2.12 3.5.3 (Apache-2.0)
1616

1717
Common Licenses
1818

@@ -22,32 +22,32 @@ Third-Party Components
2222

2323
The following is a list of the third-party components used by MarkLogic® Flux™ v1 (last updated July 2, 2024):
2424

25-
aws-java-sdk-s3 1.12.367 (Apache-2.0)
25+
aws-java-sdk-s3 1.12.262 (Apache-2.0)
2626
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3
2727
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
2828

2929

30-
hadoop-aws 3.3.6 (Apache-2.0)
30+
hadoop-aws 3.3.4 (Apache-2.0)
3131
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws
3232
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
3333

34-
hadoop-client 3.3.6 (Apache-2.0)
34+
hadoop-client 3.3.4 (Apache-2.0)
3535
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client
3636
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
3737

38-
marklogic-spark-connector 2.3 (Apache-2.0)
38+
marklogic-spark-connector 2.34.0(Apache-2.0)
3939
https://repo1.maven.org/maven2/com/marklogic/marklogic-spark-connector
4040
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
4141

4242
picocli 4.7.6 (Apache-2.0)
4343
https://repo1.maven.org/maven2/info/picocli/picocli
4444
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
4545

46-
spark-avro_2.12 3.4.3 (Apache-2.0)
46+
spark-avro_2.12 3.5.3 (Apache-2.0)
4747
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12
4848
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
4949

50-
spark-sql_2.12 3.4.3 (Apache-2.0)
50+
spark-sql_2.12 3.5.3 (Apache-2.0)
5151
https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12
5252
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
5353

build.gradle

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ task gettingStartedZip(type: Zip) {
2929
description = "Creates a zip of the getting-started project that is intended to be included as a downloadable file " +
3030
"on the GitHub release page."
3131
from "examples/getting-started"
32-
exclude "build", ".gradle", "gradle-*.properties", "flux", ".gitignore"
32+
exclude "build", ".gradle", "gradle-*.properties", "flux", ".gitignore", "marklogic-flux"
3333
into "marklogic-flux-getting-started-${version}"
3434
archiveFileName = "marklogic-flux-getting-started-${version}.zip"
3535
destinationDirectory = file("build")

docker-compose.yml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
version: '3.8'
2-
31
name: flux
42

53
services:
@@ -17,7 +15,7 @@ services:
1715
- 8007:8007
1816

1917
marklogic:
20-
image: "marklogicdb/marklogic-db:11.2.0-centos-1.1.2"
18+
image: "progressofficial/marklogic-db:latest"
2119
platform: linux/amd64
2220
environment:
2321
- MARKLOGIC_INIT=true
@@ -55,7 +53,7 @@ services:
5553

5654
# Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration .
5755
sonarqube:
58-
image: sonarqube:10.3.0-community
56+
image: sonarqube:10.6.0-community
5957
depends_on:
6058
- postgres
6159
environment:

docs/api.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,15 @@ To add Flux as a dependency to your application, add the following to your Maven
2222
<dependency>
2323
<groupId>com.marklogic</groupId>
2424
<artifactId>flux-api</artifactId>
25-
<version>1.0.0</version>
25+
<version>1.1.0</version>
2626
</dependency>
2727
```
2828

2929
Or if you are using Gradle, add the following to your `build.gradle` file:
3030

3131
```
3232
dependencies {
33-
implementation "com.marklogic:flux-api:1.0.0"
33+
implementation "com.marklogic:flux-api:1.1.0"
3434
}
3535
```
3636

@@ -97,7 +97,7 @@ buildscript {
9797
mavenCentral()
9898
}
9999
dependencies {
100-
classpath "com.marklogic:flux-api:1.0.0"
100+
classpath "com.marklogic:flux-api:1.1.0"
101101
}
102102
}
103103
```
@@ -139,7 +139,7 @@ buildscript {
139139
mavenCentral()
140140
}
141141
dependencies {
142-
classpath "com.marklogic:flux-api:1.0.0"
142+
classpath "com.marklogic:flux-api:1.1.0"
143143
classpath("com.marklogic:ml-gradle:4.8.0") {
144144
exclude group: "com.fasterxml.jackson.databind"
145145
exclude group: "com.fasterxml.jackson.core"

docs/export/export-archives.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -57,21 +57,21 @@ combination of those options as well, with the exception that `--query` will be
5757

5858
You must then use the `--path` option to specify a directory to write archive files to.
5959

60-
### Windows-specific issues with zip files
60+
### Windows-specific issues with ZIP files
6161

62-
In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a zip file
62+
In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a ZIP file
6363
with those URIs - which are used as the zip entry names - will produce confusing behavior on Windows. If you open the
64-
zip file via Windows Explorer, Windows will erroneously think the zip file is empty. If you open the zip file using
64+
ZIP file via Windows Explorer, Windows will erroneously think the file is empty. If you open the file using
6565
7-Zip, you will see a top-level entry named `_` if one or more of your URIs begin with a forward slash. These are
6666
effectively issues that only occur when viewing the file within Windows and do not reflect the actual contents of the
67-
zip file. The contents of the file are correct and if you were to import them with Flux via the `import-archive-files`
67+
ZIP file. The contents of the file are correct and if you were to import them with Flux via the `import-archive-files`
6868
command, you will get the expected results.
6969

7070

7171
## Controlling document metadata
7272

7373
Each exported document will have all of its associated metadata - collections, permissions, quality, properties, and
74-
metadata values - included in an XML document in the archive zip file. You can control which types of metadata are
74+
metadata values - included in an XML document in the archive ZIP file. You can control which types of metadata are
7575
included with the `--categories` option. This option accepts a comma-delimited sequence of the following metadata types:
7676

7777
- `collections`
@@ -120,4 +120,23 @@ bin\flux export-archives ^
120120
{% endtabs %}
121121

122122

123-
The encoding will be used for both document and metadata entries in each archive zip file.
123+
The encoding will be used for both document and metadata entries in each archive ZIP file.
124+
125+
## Exporting large binary files
126+
127+
Similar to [exporting large binary documents as files](export-documents.md), you can include large binary documents
128+
in archives by including the `--streaming` option introduced in Flux 1.1.0. When this option is set, Flux will stream
129+
each document from MarkLogic directly to a ZIP file, thereby avoiding reading the contents of a file into memory.
130+
131+
As streaming to an archive requires Flux to retrieve one document at a time from MarkLogic, you should not use this option
132+
when exporting smaller documents that can easily fit into the memory available to Flux.
133+
134+
When using `--streaming`, the following options will behave in a different fashion:
135+
136+
- `--batch-size` will still affect how many URIs are retrieved from MarkLogic in a single request, but will not impact
137+
the number of documents retrieved from MarkLogic in a single request, which will always be 1.
138+
- `--encoding` will be ignored as applying an encoding requires reading the document into memory.
139+
- `--pretty-print` will have no effect as the contents of a document will never be read into memory.
140+
141+
You typically will not want to use the `--transform` option as applying a REST transform in MarkLogic to a
142+
large binary document may exhaust the amount of memory available to MarkLogic.

docs/export/export-documents.md

Lines changed: 32 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -157,22 +157,22 @@ To use the above transform, verify that your user has been granted the MarkLogic
157157

158158
## Compressing content
159159

160-
The `--compression` option is used to write files either to Gzip or ZIP files.
160+
The `--compression` option is used to write files either to gzip or ZIP files.
161161

162-
To Gzip each file, include `--compression GZIP`.
162+
To gzip each file, include `--compression GZIP`.
163163

164-
To write multiple files to one or more ZIP files, include `--compression ZIP`. A zip file will be created for each
164+
To write multiple files to one or more ZIP files, include `--compression ZIP`. A ZIP file will be created for each
165165
partition that was created when reading data via Optic. You can include `--zip-file-count 1` to force all documents to be
166166
written to a single ZIP file. See the below section on "Understanding partitions" for more information.
167167

168-
### Windows-specific issues with zip files
168+
### Windows-specific issues with ZIP files
169169

170-
In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a zip file
170+
In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a ZIP file
171171
with those URIs - which are used as the zip entry names - will produce confusing behavior on Windows. If you open the
172-
zip file via Windows Explorer, Windows will erroneously think the zip file is empty. If you open the zip file using
172+
ZIP file via Windows Explorer, Windows will erroneously think the file is empty. If you open the file using
173173
7-Zip, you will see a top-level entry named `_` if one or more of your URIs begin with a forward slash. These are
174174
effectively issues that only occur when viewing the file within Windows and do not reflect the actual contents of the
175-
zip file. The contents of the file are correct and if you were to import them with Flux via the `import-files`
175+
ZIP file. The contents of the file are correct and if you were to import them with Flux via the `import-files`
176176
command, you will get the expected results.
177177

178178
## Specifying an encoding
@@ -202,6 +202,27 @@ bin\flux export-files ^
202202
{% endtabs %}
203203

204204

205+
## Exporting large binary documents
206+
207+
MarkLogic's [support for large binary documents](https://docs.marklogic.com/guide/app-dev/binaries#id_93203) allows
208+
for storing binary files of any size. To ensure that large binary documents can be exported to a file path, consider
209+
using the `--streaming` option introduced in Flux 1.1.0. When this option is set, Flux will stream each document
210+
from MarkLogic directly to the file path, thereby avoiding reading the contents of a file into memory. This option
211+
can be used when exporting documents to gzip or ZIP files as well via the `--compression zip` option.
212+
213+
As streaming to a file requires Flux to retrieve one document at a time from MarkLogic, you should not use this option
214+
when exporting smaller documents that can easily fit into the memory available to Flux.
215+
216+
When using `--streaming`, the following options will behave in a different fashion:
217+
218+
- `--batch-size` will still affect how many URIs are retrieved from MarkLogic in a single request, but will not impact
219+
the number of documents retrieved from MarkLogic in a single request, which will always be 1.
220+
- `--encoding` will be ignored as applying an encoding requires reading the document into memory.
221+
- `--pretty-print` will have no effect as the contents of a document will never be read into memory.
222+
223+
You typically will not want to use the `--transform` option as applying a REST transform in MarkLogic to a
224+
large binary document may exhaust the amount of memory available to MarkLogic.
225+
205226
## Understanding partitions
206227

207228
As Flux is built on top of Apache Spark, it is heavily influenced by how Spark
@@ -237,9 +258,9 @@ bin\flux export-files ^
237258
{% endtab %}
238259
{% endtabs %}
239260

240-
The `./export` directory will have 12 zip files in it. This count is due to how Flux reads data from MarkLogic,
261+
The `./export` directory will have 12 ZIP files in it. This count is due to how Flux reads data from MarkLogic,
241262
which involves creating 4 partitions by default per forest in the MarkLogic database. The example application has 3
242-
forests in its content database, and thus 12 partitions are created, resulting in 12 separate zip files.
263+
forests in its content database, and thus 12 partitions are created, resulting in 12 separate ZIP files.
243264

244265
You can use the `--partitions-per-forest` option to control how many partitions - and thus workers - read documents
245266
from each forest in your database:
@@ -272,7 +293,7 @@ bin\flux export-files ^
272293
{% endtabs %}
273294

274295

275-
This approach will produce 3 zip files - one per forest.
296+
This approach will produce 3 ZIP files - one per forest.
276297

277298
You can also use the `--repartition` option, available on every command, to force the number of partitions used when
278299
writing data, regardless of how many were used to read the data:
@@ -303,7 +324,7 @@ bin\flux export-files ^
303324
{% endtabs %}
304325

305326

306-
This approach will produce a single zip file due to the use of a single partition when writing files.
327+
This approach will produce a single ZIP file due to the use of a single partition when writing files.
307328
The `--zip-file-count` option is effectively an alias for `--repartition`. Both options produce the same outcome.
308329
`--zip-file-count` is included as a more intuitive option for the common case of configuring how many files should
309330
be written.

docs/export/export-rdf.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,6 @@ For some use cases involving exporting triples with their graphs to files contai
8686
reference the graph that each triple belongs to in MarkLogic. You can use `--graph-override` to specify an alternative
8787
graph value that will then be associated with every triple that Flux writes to a file.
8888

89-
## GZIP compression
89+
## gzip compression
9090

91-
To compress each file written by Flux using GZIP, simply include `--gzip` as an option.
91+
To compress each file written by Flux using gzip, simply include `--gzip` as an option.

docs/export/export-rows.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,10 @@ location where data already exists. This option supports the following values:
311311

312312
For convenience, the above values are case-sensitive so that you can ignore casing when choosing a value.
313313

314+
As of the 1.1.0 release of Flux, `--mode` defaults to `Append` for commands that write to a filesystem. In the 1.0.0
315+
release, these commands defaulted to `Overwrite`. The `export-jdbc` command defaults to `ErrorIfExists` avoid altering
316+
an existing table in any way.
317+
314318
For further information on each mode, please see
315319
[the Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes).
316320

0 commit comments

Comments
 (0)