MLE-26966 Added Nuclia metadata by rjrudin · Pull Request #610 · marklogic/marklogic-spark-connector

rjrudin · 2026-02-04T21:05:25Z

In the next PR, going to turn NucliaClient into an interface so we can mock it in tests. That will allows us to verify how the metadata is handled without having to connect to Nuclia.

github-actions · 2026-02-04T21:05:39Z

⏭️ Skipped (Excluded) Files

marklogic-spark-connector/src/main/resources/marklogic-spark-messages.properties

✅ Valid Files

marklogic-spark-connector/src/main/java/com/marklogic/spark/core/ChunkInputs.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/JsonChunkDocumentProducer.java
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/XmlChunkDocumentProducer.java

✅ All files have valid copyright headers!

Copilot

Pull request overview

This PR adds support for Nuclia metadata in chunk documents, allowing metadata extracted from Nuclia responses to be stored alongside chunk text and embeddings.

Changes:

Added metadata field to ChunkInputs class to store chunk-level metadata as JsonNode
Updated chunk processing logic to handle and persist metadata in both JSON and XML chunk documents
Removed Nuclia configuration properties from the properties file

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/ChunkInputs.java	Added `metadata` field with getter/setter to store JsonNode metadata
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java	Updated `addChunk` method to accept metadata parameter
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java	Extracts metadata from Nuclia response and passes it when adding chunks
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/JsonChunkDocumentProducer.java	Adds "chunk-metadata" field to JSON chunks when metadata is present
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/XmlChunkDocumentProducer.java	Adds "chunk-metadata" element to XML chunks with JSON string representation
marklogic-spark-connector/src/main/resources/marklogic-spark-messages.properties	Removed Nuclia-specific configuration properties

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T21:05:56Z

...park-connector/src/main/java/com/marklogic/spark/core/splitter/XmlChunkDocumentProducer.java

+            // "correct" way for converting JSON to XML, particularly in regard to arrays. If the user wants XML
+            // documents, they can always e.g. use a REST transform to determine how they want to represent the JSON
+            // as XML.
+            metadataElement.setTextContent(chunkInputs.getMetadata().toString());


Calling toString() on a JsonNode may not produce valid JSON in all cases. Consider using ObjectMapper.writeValueAsString() to ensure proper JSON serialization, especially for complex nested structures.

Copilot · 2026-02-04T21:05:56Z

marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java

@@ -173,6 +174,7 @@ public void addChunk(String text, float[] embedding, String modelName) {
            chunkInputs.setEmbedding(embedding);
            chunkInputs.setModelName(modelName);
        }


The metadata is set unconditionally outside the null check that applies to embedding and modelName. This inconsistency could be confusing. Consider moving this line inside the if block or adding a comment explaining why metadata is handled differently.

Suggested change

}

}

// Metadata is associated with the chunk regardless of whether an embedding/model name is provided.

MLE-26966 Added Nuclia metadata

9cfd02e

In the next PR, going to turn NucliaClient into an interface so we can mock it in tests. That will allows us to verify how the metadata is handled without having to connect to Nuclia.

Copilot AI review requested due to automatic review settings February 4, 2026 21:05

rjrudin requested review from BillFarber and stevebio as code owners February 4, 2026 21:05

Copilot AI reviewed Feb 4, 2026

View reviewed changes

BillFarber approved these changes Feb 4, 2026

View reviewed changes

rjrudin merged commit 9234a41 into develop Feb 4, 2026
4 checks passed

rjrudin deleted the feature/26966-nuclia-metadata branch February 4, 2026 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLE-26966 Added Nuclia metadata#610

MLE-26966 Added Nuclia metadata#610
rjrudin merged 1 commit intodevelopfrom
feature/26966-nuclia-metadata

rjrudin commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	}
	}
	// Metadata is associated with the chunk regardless of whether an embedding/model name is provided.

Conversation

rjrudin commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

⏭️ Skipped (Excluded) Files

✅ Valid Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants