marklogic
diff --git a/‎.env‎
Lines changed: 5 additions & 2 deletions b/‎.env‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎.github/workflows/pr-workflow.yaml‎
Lines changed: 16 additions & 0 deletions b/‎.github/workflows/pr-workflow.yaml‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 6 additions & 6 deletions b/‎CONTRIBUTING.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎Jenkinsfile‎
Lines changed: 16 additions & 11 deletions b/‎Jenkinsfile‎
Lines changed: 16 additions & 11 deletions
diff --git a/‎NOTICE.txt‎
Lines changed: 9 additions & 13 deletions b/‎NOTICE.txt‎
Lines changed: 9 additions & 13 deletions
diff --git a/‎build.gradle‎
Lines changed: 3 additions & 2 deletions b/‎build.gradle‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docker-compose.yaml‎
Lines changed: 2 additions & 2 deletions b/‎docker-compose.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/index.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/index.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/reading-data/documents.md‎
Lines changed: 100 additions & 0 deletions b/‎docs/reading-data/documents.md‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎docs/writing.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/writing.md‎
Lines changed: 2 additions & 0 deletions
@@ -1,3 +1,6 @@
 # Defines environment variables for docker-compose.
-# Can be overridden via e.g. `MARKLOGIC_TAG=latest-10.0 docker-compose up -d --build`.
-MARKLOGIC_TAG=ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:latest-12
+# Can be overridden via e.g. `MARKLOGIC_IMAGE=latest-10.0 docker-compose up -d --build`.
+MARKLOGIC_IMAGE=ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:latest-12
+
+# Defaults to a useful value for local development.
+MARKLOGIC_LOGS_VOLUME=./docker/marklogic/logs
@@ -0,0 +1,16 @@
+name: 🏷️ JIRA ID Validator
+
+on:
+  # Using pull_request_target instead of pull_request to handle PRs from forks
+  pull_request_target:
+    types: [opened, edited, reopened, synchronize]
+    # No branch filtering - will run on all PRs
+
+jobs:
+  jira-pr-check:
+    name: 🏷️ Validate JIRA ticket ID
+    # Use the reusable workflow from the central repository
+    uses: marklogic/pr-workflows/.github/workflows/jira-id-check.yml@main
+    with:
+      # Pass the PR title from the event context
+      pr-title: ${{ github.event.pull_request.title }}
@@ -90,7 +90,7 @@ This will produce a single jar file for the connector in the `./build/libs` dire
 
 You can then launch PySpark with the connector available via:
 
-    pyspark --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
+    pyspark --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar
 
 The below command is an example of loading data from the test application deployed via the instructions at the top of 
 this page. 
@@ -150,8 +150,8 @@ spark.read.option("header", True).csv("marklogic-spark-connector/src/test/resour
 When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster
 that still runs on your local machine, perform the following steps:
 
-1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.5.5` since we are currently
-building against Spark 3.5.5.
+1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.5.6` since we are currently
+building against Spark 3.5.6.
 2. `cd ~/.sdkman/candidates/spark/current/sbin`, which is where sdkman will install Spark.
 3. Run `./start-master.sh` to start a master Spark node.
 4. `cd ../logs` and open the master log file that was created to find the address for the master node. It will be in a
@@ -166,7 +166,7 @@ The Spark master GUI is at <http://localhost:8080>. You can use this to view det
 
 Now that you have a Spark cluster running, you just need to tell PySpark to connect to it:
 
-    pyspark --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
+    pyspark --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar
 
 You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to 
 examine details of each of the commands that you run.
@@ -185,15 +185,15 @@ You will need the connector jar available, so run `./gradlew clean shadowJar` if
 You can then run a test Python program in this repository via the following (again, change the master address as 
 needed); note that you run this outside of PySpark, and `spark-submit` is available after having installed PySpark:
 
-    spark-submit --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar marklogic-spark-connector/src/test/python/test_program.py
+    spark-submit --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar marklogic-spark-connector/src/test/python/test_program.py
 
 You can also test a Java program. To do so, first move the `com.marklogic.spark.TestProgram` class from `src/test/java`
 to `src/main/java`. Then run the following:
 
 ```
 ./gradlew clean shadowJar
 cd marklogic-spark-connector
-spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
+spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.7-SNAPSHOT.jar
 ```
 
 Be sure to move `TestProgram` back to `src/test/java` when you are done. 
 
@@ -1,18 +1,21 @@
 @Library('shared-libraries') _
 
 def runtests(String javaVersion){
+  // 'set -e' causes the script to fail if any command fails.
   sh label:'test', script: '''#!/bin/bash
+    set -e
     export JAVA_HOME=$'''+javaVersion+'''
     export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
     export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
     cd marklogic-spark-connector
     echo "Waiting for MarkLogic server to initialize."
-    sleep 30s
+    sleep 60s
     ./gradlew clean
-   ./gradlew -i mlDeploy
-   echo "Loading data a second time to try to avoid Optic bug with duplicate rows being returned."
-   ./gradlew -i mlLoadData
-   ./gradlew clean testCodeCoverageReport || true
+    ./gradlew mlTestConnections
+    ./gradlew -i mlDeploy
+    echo "Loading data a second time to try to avoid Optic bug with duplicate rows being returned."
+    ./gradlew -i mlLoadData
+    ./gradlew clean testCodeCoverageReport || true
   '''
   junit '**/build/**/*.xml'
 }
@@ -52,14 +55,14 @@ pipeline{
       }
       agent {label 'devExpLinuxPool'}
       steps{
+        cleanupDocker()
         sh label:'mlsetup', script: '''#!/bin/bash
             echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
             sudo /usr/local/sbin/mladmin remove
             docker-compose down -v || true
             sudo /usr/local/sbin/mladmin cleandata
             cd marklogic-spark-connector
-            mkdir -p docker/marklogic/logs
-            docker-compose up -d --build
+            MARKLOGIC_LOGS_VOLUME=/tmp docker-compose up -d --build
           '''
         runtests('JAVA17_HOME_DIR')
         withSonarQubeEnv('SONAR_Progress') {
@@ -68,11 +71,12 @@ pipeline{
       }
       post{
         always{
+          updateWorkspacePermissions()
           sh label:'mlcleanup', script: '''#!/bin/bash
             cd marklogic-spark-connector
             docker-compose down -v || true
-            sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
           '''
+          cleanupDocker()
         }
       }
     }
@@ -102,25 +106,26 @@ pipeline{
         }
       }
       steps{
+            cleanupDocker()
             sh label:'mlsetup', script: '''#!/bin/bash
                 echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
                 sudo /usr/local/sbin/mladmin remove
                 sudo /usr/local/sbin/mladmin cleandata
                 cd marklogic-spark-connector
                 mkdir -p docker/marklogic/logs
                 docker-compose down -v || true
-                MARKLOGIC_TAG=progressofficial/marklogic-db:latest-11 docker-compose up -d --build
+                MARKLOGIC_LOGS_VOLUME=/tmp docker-compose up -d --build
             '''
             runtests('JAVA17_HOME_DIR')
       }
       post{
         always{
+          updateWorkspacePermissions()
           sh label:'mlcleanup', script: '''#!/bin/bash
             cd marklogic-spark-connector
             docker-compose down -v || true
-            sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/caddy/
-            sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
           '''
+          cleanupDocker()
         }
       }
 
 
@@ -1,6 +1,6 @@
 MarkLogic® Connector for Spark
 
-Copyright © 2025 MarkLogic Corporation. All Rights Reserved.
+Copyright (c) 2023-2025 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved.
 
 To the extent required by the applicable open-source license, a complete machine-readable copy of the source code
 corresponding to such code is available upon request. This offer is valid to anyone in receipt of this information and
@@ -13,20 +13,16 @@ Third Party Notices
 jackson-dataformat-xml 2.15.2 (Apache-2.0)
 jdom2 2.0.6.1 (Apache-2.0)
 jena-arq 4.10.0 (Apache-2.0)
-langchain4j 1.0.0-beta2 (Apache-2.0)
-marklogic-client-api 7.1.0 (Apache-2.0)
+langchain4j 1.2.0 (Apache-2.0)
+marklogic-client-api 7.2.0 (Apache-2.0)
 okhttp 4.12.0 (Apache-2.0)
 Semaphore-CS-Client 5.6.1 (Apache-2.0)
 Semaphore-Cloud-Client 5.6.1 (Apache-2.0)
-tika-core 3.1.0 (Apache-2.0)
-
-Common Licenses
-
-Apache License 2.0 (Apache-2.0)
+tika-core 3.2.1 (Apache-2.0)
 
 Third-Party Components
 
-The following is a list of the third-party components used by the MarkLogic® Spark connector 2.6.0 (last updated May 1, 2025):
+The following is a list of the third-party components used by the MarkLogic® Spark connector 2.7.0 (last updated July 31, 2025):
 
 jackson-dataformat-xml 2.15.2 (Apache-2.0)
 https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-xml/
@@ -40,11 +36,11 @@ jena-arq 4.10.0 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/jena/jena-arq/
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-langchain4j 1.0.0-beta2 (Apache-2.0)
+langchain4j 1.2.0 (Apache-2.0)
 https://repo1.maven.org/maven2/dev/langchain4j/langchain4j/
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-marklogic-client-api 7.1.0 (Apache-2.0)
+marklogic-client-api 7.2.0 (Apache-2.0)
 https://repo1.maven.org/maven2/com/marklogic/marklogic-client-api/
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
@@ -60,13 +56,13 @@ Semaphore-CS-Client 5.6.1 (Apache-2.0)
 https://repo1.maven.org/maven2/com/smartlogic/cloud/Semaphore-Cloud-Client/
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-tika-core 3.1.0 (Apache-2.0)
+tika-core 3.2.1 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/tika/tika-core/
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
 Common Licenses
 
-The following is a list of the third-party components used by the MarkLogic® Spark connector 2.6.0 (last updated May 1, 2025):
+The following is a list of the common licenses used by the MarkLogic® Spark connector 2.7.0 (last updated July 31, 2025):
 
 Apache License 2.0 (Apache-2.0)
 https://spdx.org/licenses/Apache-2.0.html
 
@@ -18,7 +18,6 @@ subprojects {
   apply plugin: "jacoco"
 
   group = "com.marklogic"
-  version "2.6.0"
 
   // Defaults to the Java 11 toolchain. Overridden by subprojects that require Java 17.
   // See https://docs.gradle.org/current/userguide/toolchains.html .
@@ -36,7 +35,6 @@ subprojects {
 
   repositories {
     mavenCentral()
-    mavenLocal()
     maven {
       url "https://bed-artifactory.bedford.progress.com:443/artifactory/ml-maven-snapshots/"
     }
@@ -65,6 +63,9 @@ subprojects {
       // Avoids a classpath conflict between Spark and the tika-parser-microsoft-module. Tika needs a
       // more recent version and Spark (and Jena as well) both seems fine with this (as they should be per semver).
       force "org.apache.commons:commons-compress:1.27.1"
+
+      // Avoids CVEs in earlier minor versions.
+      force "org.apache.commons:commons-lang3:3.18.0"
     }
 
     // Excluded from Flux for size reasons, so excluded here as well to ensure we don't need it when running tests.
 
@@ -15,14 +15,14 @@ services:
       - "8116:8116"
 
   marklogic:
-    image: "${MARKLOGIC_TAG}"
+    image: "${MARKLOGIC_IMAGE}"
     platform: linux/amd64
     environment:
       - MARKLOGIC_INIT=true
       - MARKLOGIC_ADMIN_USERNAME=admin
       - MARKLOGIC_ADMIN_PASSWORD=admin
     volumes:
-      - ./docker/marklogic/logs:/var/opt/MarkLogic/Logs
+      - ${MARKLOGIC_LOGS_VOLUME}:/var/opt/MarkLogic/Logs
     ports:
       - "8000-8002:8000-8002"
       - "8015:8015"
 
@@ -12,7 +12,7 @@ from any data source that Spark supports and then writing it to MarkLogic.
 
 The connector has the following system requirements:
 
-* Apache Spark 3.5.5 is recommended, but earlier versions of Spark 3.5.x, 3.4.x, and 3.3.x should work as well. When choosing 
+* Apache Spark 3.5.6 is recommended, but earlier versions of Spark 3.5.x, 3.4.x, and 3.3.x should work as well. When choosing 
 [a Spark distribution](https://spark.apache.org/downloads.html), you must select a distribution that uses Scala 2.12 and not Scala 2.13.
 * For writing data, MarkLogic 9.0-9 or higher.
 * For reading data, MarkLogic 10.0-9 or higher.
 
@@ -222,6 +222,106 @@ the cost of a filtered search may be outweighed by the connector having to retur
 a filtered search will both return accurate results and may be faster. Ideally though, you can configure indexes on your
 database to allow for an unfiltered search, which will return accurate results and be faster than a filtered search.
 
+## Using secondary URIs queries
+
+As of version 2.7.0, the connector supports executing secondary queries to retrieve additional URIs beyond those specified in your initial document query. This feature is useful when you need to read documents that are related to your initial set of documents through shared data values or other relationships.
+
+When using secondary URIs queries, the connector will first retrieve the URIs from your primary query (via `spark.marklogic.read.documents.uris` or other document query options), then execute your secondary query code with access to those URIs, and finally return documents for both the original URIs and any additional URIs returned by the secondary query.
+
+### Basic usage
+
+You can execute a secondary query using JavaScript via the `spark.marklogic.read.secondaryUris.javascript` option:
+
+```python
+df = spark.read.format("marklogic") \
+    .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
+    .option("spark.marklogic.read.documents.uris", "/author/author1.json\n/author/author2.json") \
+    .option("spark.marklogic.read.secondaryUris.javascript", """
+        var URIs;
+        const citationIds = cts.elementValues(xs.QName("CitationID"), null, null, cts.documentQuery(URIs));
+        cts.uris(null, null, cts.andQuery([
+            cts.notQuery(cts.documentQuery(URIs)),
+            cts.collectionQuery('author'),
+            cts.jsonPropertyValueQuery('CitationID', citationIds)
+        ]))
+    """) \
+    .load()
+df.show()
+```
+
+Or using XQuery via the `spark.marklogic.read.secondaryUris.xquery` option:
+
+```python
+df = spark.read.format("marklogic") \
+    .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
+    .option("spark.marklogic.read.documents.uris", "/author/author1.json\n/author/author2.json") \
+    .option("spark.marklogic.read.secondaryUris.xquery", """
+        declare namespace json = "http://marklogic.com/xdmp/json";
+        declare variable $URIs external;
+        let $values := json:array-values($URIs)
+        let $citationIds := cts:element-values(xs:QName("CitationID"), (), (), cts:document-query($values))
+        return cts:uris((), (), cts:and-query((
+            cts:not-query(cts:document-query($values)),
+            cts:collection-query('author'),
+            cts:json-property-value-query('CitationID', $citationIds)
+        )))
+    """) \
+    .load()
+df.show()
+```
+
+### Available URIs in secondary queries
+
+Your secondary query code has access to the URIs from your primary query through:
+
+- **JavaScript**: An external variable named `URIs` containing the array of URIs
+- **XQuery**: An external variable named `$URIs` containing a JSON array of the URIs
+
+The examples above show how to use these URIs to find related documents - in this case, finding other author documents that share the same CitationID values as the original documents.
+
+### Using module invocation
+
+You can invoke a JavaScript or XQuery module from your application's modules database via the `spark.marklogic.read.secondaryUris.invoke` option:
+
+```python
+option("spark.marklogic.read.secondaryUris.invoke", "/findRelatedAuthors.xqy")
+```
+
+### Using local files
+
+You can specify local file paths containing either JavaScript or XQuery code via the `spark.marklogic.read.secondaryUris.javascriptFile` and `spark.marklogic.read.secondaryUris.xqueryFile` options:
+
+```python
+.option("spark.marklogic.read.secondaryUris.javascriptFile", "/path/to/findRelatedAuthors.js") \
+```
+
+### Custom external variables
+
+You can pass external variables to your secondary query code by configuring options with names starting with `spark.marklogic.read.secondaryUris.vars.`. The remainder of the option name will be used as the external variable name:
+
+```python
+df = spark.read.format("marklogic") \
+    .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
+    .option("spark.marklogic.read.documents.uris", "/author/author1.json") \
+    .option("spark.marklogic.read.secondaryUris.javascript", "Sequence.from([URI1, URI2])") \
+    .option("spark.marklogic.read.secondaryUris.vars.URI1", "/author/author2.json") \
+    .option("spark.marklogic.read.secondaryUris.vars.URI2", "/author/author3.json") \
+    .load()
+df.show()
+```
+
+### Use cases
+
+Secondary URIs queries are particularly useful for:
+
+- **Document relationships**: Finding documents that reference or are referenced by your initial set
+- **Hierarchical data**: Retrieving parent or child documents in a hierarchy
+- **Cross-references**: Finding documents that share common property values
+- **Graph traversal**: Following relationships between documents to expand your result set
+- **Data enrichment**: Adding related documents to provide fuller context for analysis
+
+The secondary query is executed after the primary document selection, allowing you to build complex multi-step queries that would be difficult to express in a single MarkLogic search operation.
+
 ## Tuning performance
 
 The connector mimics the behavior of the [MarkLogic Data Movement SDK](https://docs.marklogic.com/guide/java/data-movement)
 
@@ -244,6 +244,8 @@ The options controlling the embedder feature are:
 | spark.marklogic.write.embedder.embedding.name | Allows for the embedding name to be customized when the embedding is added to a JSON or XML chunk. | 
 | spark.marklogic.write.embedder.embedding.namespace | Allows for an optional namespace to be assigned to the embedding element in an XML chunk. | 
 | spark.marklogic.write.embedder.batchSize | Defines the number of chunks to send to the embedding model in a single call. Defaults to 1. | 
+| spark.marklogic.write.embedder.prompt | New in 2.7.0 - optional prompt to prepend to the text sent to the embedding model. Useful for providing context or other information to the embedding model. |
+| spark.marklogic.write.embedder.base64encode | New in 2.7.0 - encodes each vector produced by the embedding model using a format compliant with the new vector encoding functions in MarkLogic 12. Useful for reducing the size of vectors in your documents. |
 
 ### Streaming support