MLE-26966 Now supports NUA for processing by rjrudin · Pull Request #607 · marklogic/marklogic-spark-connector

rjrudin · 2026-02-03T19:31:19Z

Not a lot can be tested here without a valid Nuclia connection, which we may end up doing in Jenkins. Will largely depend on manual testing for now.

github-actions · 2026-02-03T19:31:32Z

⏭️ Skipped (Excluded) Files

.copyrightconfig
gradle.properties
marklogic-spark-connector/build.gradle
marklogic-spark-connector/src/main/resources/marklogic-spark-messages.properties

✅ Valid Files

marklogic-spark-connector/src/main/java/com/marklogic/spark/Options.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentPipeline.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentPipelineFactory.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaClient.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaEventCollector.java
marklogic-spark-connector/src/test/java/com/marklogic/spark/core/DocumentPipelineFactoryTest.java
marklogic-spark-connector/src/test/java/com/marklogic/spark/writer/embedding/AddEmbeddingsToJsonTest.java
marklogic-spark-connector/src/test/java/com/marklogic/spark/writer/nuclia/NucliaAdHocTest.java

✅ All files have valid copyright headers!

Copilot

Pull request overview

This PR adds Nuclia Understanding API (NUA) support for document processing in the MarkLogic Spark connector. Nuclia provides an external service for text extraction, chunking, and embedding generation, offering an alternative to the existing local processing pipeline.

Changes:

Added Nuclia client implementation with SSE event streaming for processing results
Integrated Nuclia as a processing option in DocumentPipeline with priority over standard pipeline
Added configuration options for Nuclia API key, KB ID, region, and timeout

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
NucliaAdHocTest.java	Manual test for Nuclia integration requiring environment variables
AddEmbeddingsToJsonTest.java	Removed unused import
DocumentPipelineFactoryTest.java	Comprehensive tests for Nuclia client creation and pipeline configuration
marklogic-spark-messages.properties	Added property keys for Nuclia configuration options
NucliaEventCollector.java	SSE event listener for collecting Nuclia processing results
NucliaDocumentProcessor.java	Processes documents through Nuclia API and extracts chunks/embeddings
NucliaClient.java	HTTP client for Nuclia API with file upload and processing workflow
DocumentPipelineFactory.java	Factory method for creating Nuclia client and prioritizing Nuclia pipeline
DocumentPipeline.java	Added Nuclia processing path with client lifecycle management
DocumentInputs.java	New method for adding chunks with embeddings and model names
Options.java	Defined configuration constants for Nuclia options
build.gradle	Added okhttp-sse dependency for SSE streaming

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T19:32:19Z

marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaClient.java

+            String resourceId = response.body().string();
+            return resourceId;


The variable assignment and return can be simplified by directly returning the value without the intermediate variable.

Suggested change

String resourceId = response.body().string();

return resourceId;

return response.body().string();

Copilot · 2026-02-03T19:32:19Z

...c-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java

+
+        // Process each embedding in the array
+        if (node.has("embeddings") && node.get("embeddings").isArray()) {
+            var embeddingsArray = node.get("embeddings");


Using 'var' reduces code clarity when the type isn't immediately obvious from the right side. Consider using explicit type 'JsonNode' for better readability and maintainability.

Copilot · 2026-02-03T19:32:19Z

...c-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java

+            var embeddingsArray = node.get("embeddings");
+
+            for (int i = 0; i < embeddingsArray.size(); i++) {
+                var embeddingObj = embeddingsArray.get(i);


Using 'var' reduces code clarity when the type isn't immediately obvious from the right side. Consider using explicit type 'JsonNode' for better readability and maintainability.

Copilot · 2026-02-03T19:32:20Z

...c-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java

+                String modelName = null;
+
+                if (embeddingObj.has("embedding") && embeddingObj.get("embedding").isArray()) {
+                    var embeddingArray = embeddingObj.get("embedding");


Using 'var' reduces code clarity when the type isn't immediately obvious from the right side. Consider using explicit type 'JsonNode' for better readability and maintainability.

Copilot · 2026-02-03T19:32:20Z

marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentPipelineFactory.java

@@ -73,24 +103,7 @@ private static EmbeddingProducer newEmbeddingProducer(Context context) {
    }



The removal of comprehensive error handling and logging for Langchain4j instantiation may impact debuggability. Consider documenting why direct instantiation is now safe and what changed to make the previous exception handling unnecessary.

Suggested change

/**

* Creates a new Langchain4jFactory instance.

* <p>

* Earlier versions of this method wrapped factory construction in additional

* error handling and logging. That logic was removed once Langchain4jFactory

* construction became side-effect free and stopped performing configuration

* work that could fail at instantiation time. Any errors related to

* misconfiguration or downstream processing are now expected to surface when

* the factory is actually used (for example, when creating a TextSplitter or

* EmbeddingProducer), where more contextual information is available.

* <p>

* As a result, direct instantiation here is considered safe and avoids

* redundant, low-value exception handling that would only rethrow generic

* runtime failures.

*/

Copilot · 2026-02-03T19:32:20Z

marklogic-spark-connector/src/main/resources/marklogic-spark-messages.properties

+spark.marklogic.write.nuclia.apikey=
+spark.marklogic.write.nuclia.kbid=
+spark.marklogic.write.nuclia.region=


Properties file entries are empty without any default values or documentation. Consider adding comments explaining the purpose and expected values for each Nuclia property, or default placeholder values to guide users.

Copilot · 2026-02-03T19:32:21Z

marklogic-spark-connector/build.gradle

@@ -60,6 +60,9 @@ dependencies {
  // Only needs compileOnly, as the Java Client brings this as an implementation dependency.
  compileOnly 'com.squareup.okhttp3:okhttp:5.2.0'


The okhttp-sse dependency version (5.2.1) differs from the okhttp version (5.2.0) specified earlier in the file. Consider aligning these versions to avoid potential compatibility issues.

Suggested change

compileOnly 'com.squareup.okhttp3:okhttp:5.2.0'

compileOnly 'com.squareup.okhttp3:okhttp:5.2.1'

vangheem

I wonder about all the manual serialization/deserialization but otherwise looks fine

vangheem · 2026-02-03T20:32:20Z

marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaClient.java

+     */
+    private Stream<ObjectNode> getProcessingResults(String processingId) throws IOException {
+        final String endpoint = baseUrl + "/processing/requests/" + processingId + "/results";
+        if (Util.MAIN_LOGGER.isDebugEnabled()) {


Are these types of checks necessary? Isn't it going to only output the log if the DEBUG log level is configured anyways?

vangheem · 2026-02-03T20:34:25Z

marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaClient.java

+            .newEventSource(request, collector);
+
+        try {
+            Stream<ObjectNode> results = collector.awaitCompletion(120).stream();


120 should maybe to configurable? In reality, it should be fast but I guess maybe for insanely complex and big docs with files it could be huge

vangheem · 2026-02-03T20:37:31Z

marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaClient.java

+            Util.MAIN_LOGGER.debug("Submitting file for processing: filename={}, endpoint={}", filename, endpoint);
+        }
+
+        final String requestBody = String.format("""


Is it normal to do string interpolation like this for json structures? Maybe use a map or record? (I'm not a java person)

Not a lot can be tested here without a valid Nuclia connection, which we may end up doing in Jenkins. Will largely depend on manual testing for now.

rjrudin requested a review from BillFarber as a code owner February 3, 2026 19:31

Copilot AI review requested due to automatic review settings February 3, 2026 19:31

rjrudin requested a review from stevebio as a code owner February 3, 2026 19:31

rjrudin force-pushed the feature/26966-nuclia-options branch from a4741bf to c1ee3d5 Compare February 3, 2026 19:31

Copilot AI reviewed Feb 3, 2026

View reviewed changes

rjrudin force-pushed the feature/26966-nuclia-options branch 2 times, most recently from 2cae573 to ba77977 Compare February 3, 2026 19:45

vangheem reviewed Feb 3, 2026

View reviewed changes

MLE-26966 Now supports NUA for processing

ece681b

Not a lot can be tested here without a valid Nuclia connection, which we may end up doing in Jenkins. Will largely depend on manual testing for now.

rjrudin force-pushed the feature/26966-nuclia-options branch from ba77977 to ece681b Compare February 3, 2026 21:05

BillFarber approved these changes Feb 3, 2026

View reviewed changes

rjrudin merged commit 0f8d6cc into develop Feb 3, 2026
4 checks passed

rjrudin deleted the feature/26966-nuclia-options branch February 3, 2026 21:59

		String resourceId = response.body().string();
		return resourceId;

	String resourceId = response.body().string();
	return resourceId;
	return response.body().string();

		@@ -73,24 +103,7 @@ private static EmbeddingProducer newEmbeddingProducer(Context context) {
		}

+    /**
+     * Creates a new Langchain4jFactory instance.
+     * <p>
+     * Earlier versions of this method wrapped factory construction in additional
+     * error handling and logging. That logic was removed once Langchain4jFactory
+     * construction became side-effect free and stopped performing configuration
+     * work that could fail at instantiation time. Any errors related to
+     * misconfiguration or downstream processing are now expected to surface when
+     * the factory is actually used (for example, when creating a TextSplitter or
+     * EmbeddingProducer), where more contextual information is available.
+     * <p>
+     * As a result, direct instantiation here is considered safe and avoids
+     * redundant, low-value exception handling that would only rethrow generic
+     * runtime failures.
+     */

		@@ -60,6 +60,9 @@ dependencies {
		// Only needs compileOnly, as the Java Client brings this as an implementation dependency.
		compileOnly 'com.squareup.okhttp3:okhttp:5.2.0'

Conversation

rjrudin commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⏭️ Skipped (Excluded) Files

✅ Valid Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

vangheem left a comment

Choose a reason for hiding this comment

Uh oh!

vangheem Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

vangheem Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

vangheem Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 3, 2026 •

edited

Loading