Skip to content

Commit d61de9b

Browse files
authored
Merge pull request #542 from marklogic/release/2.7.0
Merge 2.7.0 into master
2 parents 5859248 + baa5012 commit d61de9b

File tree

65 files changed

+1186
-296
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1186
-296
lines changed

.env

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
11
# Defines environment variables for docker-compose.
2-
# Can be overridden via e.g. `MARKLOGIC_TAG=latest-10.0 docker-compose up -d --build`.
3-
MARKLOGIC_TAG=ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:latest-12
2+
# Can be overridden via e.g. `MARKLOGIC_IMAGE=latest-10.0 docker-compose up -d --build`.
3+
MARKLOGIC_IMAGE=ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:latest-12
4+
5+
# Defaults to a useful value for local development.
6+
MARKLOGIC_LOGS_VOLUME=./docker/marklogic/logs

.github/workflows/pr-workflow.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: 🏷️ JIRA ID Validator
2+
3+
on:
4+
# Using pull_request_target instead of pull_request to handle PRs from forks
5+
pull_request_target:
6+
types: [opened, edited, reopened, synchronize]
7+
# No branch filtering - will run on all PRs
8+
9+
jobs:
10+
jira-pr-check:
11+
name: 🏷️ Validate JIRA ticket ID
12+
# Use the reusable workflow from the central repository
13+
uses: marklogic/pr-workflows/.github/workflows/jira-id-check.yml@main
14+
with:
15+
# Pass the PR title from the event context
16+
pr-title: ${{ github.event.pull_request.title }}

CONTRIBUTING.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ This will produce a single jar file for the connector in the `./build/libs` dire
9090

9191
You can then launch PySpark with the connector available via:
9292

93-
pyspark --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
93+
pyspark --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar
9494

9595
The below command is an example of loading data from the test application deployed via the instructions at the top of
9696
this page.
@@ -150,8 +150,8 @@ spark.read.option("header", True).csv("marklogic-spark-connector/src/test/resour
150150
When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster
151151
that still runs on your local machine, perform the following steps:
152152

153-
1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.5.5` since we are currently
154-
building against Spark 3.5.5.
153+
1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.5.6` since we are currently
154+
building against Spark 3.5.6.
155155
2. `cd ~/.sdkman/candidates/spark/current/sbin`, which is where sdkman will install Spark.
156156
3. Run `./start-master.sh` to start a master Spark node.
157157
4. `cd ../logs` and open the master log file that was created to find the address for the master node. It will be in a
@@ -166,7 +166,7 @@ The Spark master GUI is at <http://localhost:8080>. You can use this to view det
166166

167167
Now that you have a Spark cluster running, you just need to tell PySpark to connect to it:
168168

169-
pyspark --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
169+
pyspark --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar
170170

171171
You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to
172172
examine details of each of the commands that you run.
@@ -185,15 +185,15 @@ You will need the connector jar available, so run `./gradlew clean shadowJar` if
185185
You can then run a test Python program in this repository via the following (again, change the master address as
186186
needed); note that you run this outside of PySpark, and `spark-submit` is available after having installed PySpark:
187187

188-
spark-submit --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar marklogic-spark-connector/src/test/python/test_program.py
188+
spark-submit --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar marklogic-spark-connector/src/test/python/test_program.py
189189

190190
You can also test a Java program. To do so, first move the `com.marklogic.spark.TestProgram` class from `src/test/java`
191191
to `src/main/java`. Then run the following:
192192

193193
```
194194
./gradlew clean shadowJar
195195
cd marklogic-spark-connector
196-
spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
196+
spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar
197197
```
198198

199199
Be sure to move `TestProgram` back to `src/test/java` when you are done.

Jenkinsfile

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
11
@Library('shared-libraries') _
22

33
def runtests(String javaVersion){
4+
// 'set -e' causes the script to fail if any command fails.
45
sh label:'test', script: '''#!/bin/bash
6+
set -e
57
export JAVA_HOME=$'''+javaVersion+'''
68
export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
79
export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
810
cd marklogic-spark-connector
911
echo "Waiting for MarkLogic server to initialize."
10-
sleep 30s
12+
sleep 60s
1113
./gradlew clean
12-
./gradlew -i mlDeploy
13-
echo "Loading data a second time to try to avoid Optic bug with duplicate rows being returned."
14-
./gradlew -i mlLoadData
15-
./gradlew clean testCodeCoverageReport || true
14+
./gradlew mlTestConnections
15+
./gradlew -i mlDeploy
16+
echo "Loading data a second time to try to avoid Optic bug with duplicate rows being returned."
17+
./gradlew -i mlLoadData
18+
./gradlew clean testCodeCoverageReport || true
1619
'''
1720
junit '**/build/**/*.xml'
1821
}
@@ -52,14 +55,14 @@ pipeline{
5255
}
5356
agent {label 'devExpLinuxPool'}
5457
steps{
58+
cleanupDocker()
5559
sh label:'mlsetup', script: '''#!/bin/bash
5660
echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
5761
sudo /usr/local/sbin/mladmin remove
5862
docker-compose down -v || true
5963
sudo /usr/local/sbin/mladmin cleandata
6064
cd marklogic-spark-connector
61-
mkdir -p docker/marklogic/logs
62-
docker-compose up -d --build
65+
MARKLOGIC_LOGS_VOLUME=/tmp docker-compose up -d --build
6366
'''
6467
runtests('JAVA17_HOME_DIR')
6568
withSonarQubeEnv('SONAR_Progress') {
@@ -68,11 +71,12 @@ pipeline{
6871
}
6972
post{
7073
always{
74+
updateWorkspacePermissions()
7175
sh label:'mlcleanup', script: '''#!/bin/bash
7276
cd marklogic-spark-connector
7377
docker-compose down -v || true
74-
sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
7578
'''
79+
cleanupDocker()
7680
}
7781
}
7882
}
@@ -102,25 +106,26 @@ pipeline{
102106
}
103107
}
104108
steps{
109+
cleanupDocker()
105110
sh label:'mlsetup', script: '''#!/bin/bash
106111
echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
107112
sudo /usr/local/sbin/mladmin remove
108113
sudo /usr/local/sbin/mladmin cleandata
109114
cd marklogic-spark-connector
110115
mkdir -p docker/marklogic/logs
111116
docker-compose down -v || true
112-
MARKLOGIC_TAG=progressofficial/marklogic-db:latest-11 docker-compose up -d --build
117+
MARKLOGIC_LOGS_VOLUME=/tmp docker-compose up -d --build
113118
'''
114119
runtests('JAVA17_HOME_DIR')
115120
}
116121
post{
117122
always{
123+
updateWorkspacePermissions()
118124
sh label:'mlcleanup', script: '''#!/bin/bash
119125
cd marklogic-spark-connector
120126
docker-compose down -v || true
121-
sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/caddy/
122-
sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
123127
'''
128+
cleanupDocker()
124129
}
125130
}
126131

NOTICE.txt

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MarkLogic® Connector for Spark
22

3-
Copyright © 2025 MarkLogic Corporation. All Rights Reserved.
3+
Copyright (c) 2023-2025 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved.
44

55
To the extent required by the applicable open-source license, a complete machine-readable copy of the source code
66
corresponding to such code is available upon request. This offer is valid to anyone in receipt of this information and
@@ -13,20 +13,16 @@ Third Party Notices
1313
jackson-dataformat-xml 2.15.2 (Apache-2.0)
1414
jdom2 2.0.6.1 (Apache-2.0)
1515
jena-arq 4.10.0 (Apache-2.0)
16-
langchain4j 1.0.0-beta2 (Apache-2.0)
17-
marklogic-client-api 7.1.0 (Apache-2.0)
16+
langchain4j 1.2.0 (Apache-2.0)
17+
marklogic-client-api 7.2.0 (Apache-2.0)
1818
okhttp 4.12.0 (Apache-2.0)
1919
Semaphore-CS-Client 5.6.1 (Apache-2.0)
2020
Semaphore-Cloud-Client 5.6.1 (Apache-2.0)
21-
tika-core 3.1.0 (Apache-2.0)
22-
23-
Common Licenses
24-
25-
Apache License 2.0 (Apache-2.0)
21+
tika-core 3.2.1 (Apache-2.0)
2622

2723
Third-Party Components
2824

29-
The following is a list of the third-party components used by the MarkLogic® Spark connector 2.6.0 (last updated May 1, 2025):
25+
The following is a list of the third-party components used by the MarkLogic® Spark connector 2.7.0 (last updated July 31, 2025):
3026

3127
jackson-dataformat-xml 2.15.2 (Apache-2.0)
3228
https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-xml/
@@ -40,11 +36,11 @@ jena-arq 4.10.0 (Apache-2.0)
4036
https://repo1.maven.org/maven2/org/apache/jena/jena-arq/
4137
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
4238

43-
langchain4j 1.0.0-beta2 (Apache-2.0)
39+
langchain4j 1.2.0 (Apache-2.0)
4440
https://repo1.maven.org/maven2/dev/langchain4j/langchain4j/
4541
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
4642

47-
marklogic-client-api 7.1.0 (Apache-2.0)
43+
marklogic-client-api 7.2.0 (Apache-2.0)
4844
https://repo1.maven.org/maven2/com/marklogic/marklogic-client-api/
4945
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
5046

@@ -60,13 +56,13 @@ Semaphore-CS-Client 5.6.1 (Apache-2.0)
6056
https://repo1.maven.org/maven2/com/smartlogic/cloud/Semaphore-Cloud-Client/
6157
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
6258

63-
tika-core 3.1.0 (Apache-2.0)
59+
tika-core 3.2.1 (Apache-2.0)
6460
https://repo1.maven.org/maven2/org/apache/tika/tika-core/
6561
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
6662

6763
Common Licenses
6864

69-
The following is a list of the third-party components used by the MarkLogic® Spark connector 2.6.0 (last updated May 1, 2025):
65+
The following is a list of the common licenses used by the MarkLogic® Spark connector 2.7.0 (last updated July 31, 2025):
7066

7167
Apache License 2.0 (Apache-2.0)
7268
https://spdx.org/licenses/Apache-2.0.html

build.gradle

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ subprojects {
1818
apply plugin: "jacoco"
1919

2020
group = "com.marklogic"
21-
version "2.6.0"
2221

2322
// Defaults to the Java 11 toolchain. Overridden by subprojects that require Java 17.
2423
// See https://docs.gradle.org/current/userguide/toolchains.html .
@@ -36,7 +35,6 @@ subprojects {
3635

3736
repositories {
3837
mavenCentral()
39-
mavenLocal()
4038
maven {
4139
url "https://bed-artifactory.bedford.progress.com:443/artifactory/ml-maven-snapshots/"
4240
}
@@ -65,6 +63,9 @@ subprojects {
6563
// Avoids a classpath conflict between Spark and the tika-parser-microsoft-module. Tika needs a
6664
// more recent version and Spark (and Jena as well) both seems fine with this (as they should be per semver).
6765
force "org.apache.commons:commons-compress:1.27.1"
66+
67+
// Avoids CVEs in earlier minor versions.
68+
force "org.apache.commons:commons-lang3:3.18.0"
6869
}
6970

7071
// Excluded from Flux for size reasons, so excluded here as well to ensure we don't need it when running tests.

docker-compose.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,14 @@ services:
1515
- "8116:8116"
1616

1717
marklogic:
18-
image: "${MARKLOGIC_TAG}"
18+
image: "${MARKLOGIC_IMAGE}"
1919
platform: linux/amd64
2020
environment:
2121
- MARKLOGIC_INIT=true
2222
- MARKLOGIC_ADMIN_USERNAME=admin
2323
- MARKLOGIC_ADMIN_PASSWORD=admin
2424
volumes:
25-
- ./docker/marklogic/logs:/var/opt/MarkLogic/Logs
25+
- ${MARKLOGIC_LOGS_VOLUME}:/var/opt/MarkLogic/Logs
2626
ports:
2727
- "8000-8002:8000-8002"
2828
- "8015:8015"

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ from any data source that Spark supports and then writing it to MarkLogic.
1212

1313
The connector has the following system requirements:
1414

15-
* Apache Spark 3.5.5 is recommended, but earlier versions of Spark 3.5.x, 3.4.x, and 3.3.x should work as well. When choosing
15+
* Apache Spark 3.5.6 is recommended, but earlier versions of Spark 3.5.x, 3.4.x, and 3.3.x should work as well. When choosing
1616
[a Spark distribution](https://spark.apache.org/downloads.html), you must select a distribution that uses Scala 2.12 and not Scala 2.13.
1717
* For writing data, MarkLogic 9.0-9 or higher.
1818
* For reading data, MarkLogic 10.0-9 or higher.

docs/reading-data/documents.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,106 @@ the cost of a filtered search may be outweighed by the connector having to retur
222222
a filtered search will both return accurate results and may be faster. Ideally though, you can configure indexes on your
223223
database to allow for an unfiltered search, which will return accurate results and be faster than a filtered search.
224224

225+
## Using secondary URIs queries
226+
227+
As of version 2.7.0, the connector supports executing secondary queries to retrieve additional URIs beyond those specified in your initial document query. This feature is useful when you need to read documents that are related to your initial set of documents through shared data values or other relationships.
228+
229+
When using secondary URIs queries, the connector will first retrieve the URIs from your primary query (via `spark.marklogic.read.documents.uris` or other document query options), then execute your secondary query code with access to those URIs, and finally return documents for both the original URIs and any additional URIs returned by the secondary query.
230+
231+
### Basic usage
232+
233+
You can execute a secondary query using JavaScript via the `spark.marklogic.read.secondaryUris.javascript` option:
234+
235+
```python
236+
df = spark.read.format("marklogic") \
237+
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
238+
.option("spark.marklogic.read.documents.uris", "/author/author1.json\n/author/author2.json") \
239+
.option("spark.marklogic.read.secondaryUris.javascript", """
240+
var URIs;
241+
const citationIds = cts.elementValues(xs.QName("CitationID"), null, null, cts.documentQuery(URIs));
242+
cts.uris(null, null, cts.andQuery([
243+
cts.notQuery(cts.documentQuery(URIs)),
244+
cts.collectionQuery('author'),
245+
cts.jsonPropertyValueQuery('CitationID', citationIds)
246+
]))
247+
""") \
248+
.load()
249+
df.show()
250+
```
251+
252+
Or using XQuery via the `spark.marklogic.read.secondaryUris.xquery` option:
253+
254+
```python
255+
df = spark.read.format("marklogic") \
256+
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
257+
.option("spark.marklogic.read.documents.uris", "/author/author1.json\n/author/author2.json") \
258+
.option("spark.marklogic.read.secondaryUris.xquery", """
259+
declare namespace json = "http://marklogic.com/xdmp/json";
260+
declare variable $URIs external;
261+
let $values := json:array-values($URIs)
262+
let $citationIds := cts:element-values(xs:QName("CitationID"), (), (), cts:document-query($values))
263+
return cts:uris((), (), cts:and-query((
264+
cts:not-query(cts:document-query($values)),
265+
cts:collection-query('author'),
266+
cts:json-property-value-query('CitationID', $citationIds)
267+
)))
268+
""") \
269+
.load()
270+
df.show()
271+
```
272+
273+
### Available URIs in secondary queries
274+
275+
Your secondary query code has access to the URIs from your primary query through:
276+
277+
- **JavaScript**: An external variable named `URIs` containing the array of URIs
278+
- **XQuery**: An external variable named `$URIs` containing a JSON array of the URIs
279+
280+
The examples above show how to use these URIs to find related documents - in this case, finding other author documents that share the same CitationID values as the original documents.
281+
282+
### Using module invocation
283+
284+
You can invoke a JavaScript or XQuery module from your application's modules database via the `spark.marklogic.read.secondaryUris.invoke` option:
285+
286+
```python
287+
option("spark.marklogic.read.secondaryUris.invoke", "/findRelatedAuthors.xqy")
288+
```
289+
290+
### Using local files
291+
292+
You can specify local file paths containing either JavaScript or XQuery code via the `spark.marklogic.read.secondaryUris.javascriptFile` and `spark.marklogic.read.secondaryUris.xqueryFile` options:
293+
294+
```python
295+
.option("spark.marklogic.read.secondaryUris.javascriptFile", "/path/to/findRelatedAuthors.js") \
296+
```
297+
298+
### Custom external variables
299+
300+
You can pass external variables to your secondary query code by configuring options with names starting with `spark.marklogic.read.secondaryUris.vars.`. The remainder of the option name will be used as the external variable name:
301+
302+
```python
303+
df = spark.read.format("marklogic") \
304+
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
305+
.option("spark.marklogic.read.documents.uris", "/author/author1.json") \
306+
.option("spark.marklogic.read.secondaryUris.javascript", "Sequence.from([URI1, URI2])") \
307+
.option("spark.marklogic.read.secondaryUris.vars.URI1", "/author/author2.json") \
308+
.option("spark.marklogic.read.secondaryUris.vars.URI2", "/author/author3.json") \
309+
.load()
310+
df.show()
311+
```
312+
313+
### Use cases
314+
315+
Secondary URIs queries are particularly useful for:
316+
317+
- **Document relationships**: Finding documents that reference or are referenced by your initial set
318+
- **Hierarchical data**: Retrieving parent or child documents in a hierarchy
319+
- **Cross-references**: Finding documents that share common property values
320+
- **Graph traversal**: Following relationships between documents to expand your result set
321+
- **Data enrichment**: Adding related documents to provide fuller context for analysis
322+
323+
The secondary query is executed after the primary document selection, allowing you to build complex multi-step queries that would be difficult to express in a single MarkLogic search operation.
324+
225325
## Tuning performance
226326

227327
The connector mimics the behavior of the [MarkLogic Data Movement SDK](https://docs.marklogic.com/guide/java/data-movement)

docs/writing.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,8 @@ The options controlling the embedder feature are:
244244
| spark.marklogic.write.embedder.embedding.name | Allows for the embedding name to be customized when the embedding is added to a JSON or XML chunk. |
245245
| spark.marklogic.write.embedder.embedding.namespace | Allows for an optional namespace to be assigned to the embedding element in an XML chunk. |
246246
| spark.marklogic.write.embedder.batchSize | Defines the number of chunks to send to the embedding model in a single call. Defaults to 1. |
247+
| spark.marklogic.write.embedder.prompt | New in 2.7.0 - optional prompt to prepend to the text sent to the embedding model. Useful for providing context or other information to the embedding model. |
248+
| spark.marklogic.write.embedder.base64encode | New in 2.7.0 - encodes each vector produced by the embedding model using a format compliant with the new vector encoding functions in MarkLogic 12. Useful for reducing the size of vectors in your documents. |
247249

248250
### Streaming support
249251

0 commit comments

Comments
 (0)