Skip to content

MLE-26966 Added Nuclia metadata#610

Merged
rjrudin merged 1 commit intodevelopfrom
feature/26966-nuclia-metadata
Feb 4, 2026
Merged

MLE-26966 Added Nuclia metadata#610
rjrudin merged 1 commit intodevelopfrom
feature/26966-nuclia-metadata

Conversation

@rjrudin
Copy link
Contributor

@rjrudin rjrudin commented Feb 4, 2026

In the next PR, going to turn NucliaClient into an interface so we can mock it in tests. That will allows us to verify how the metadata is handled without having to connect to Nuclia.

In the next PR, going to turn NucliaClient into an interface so we can mock it in tests. That will allows us to verify how the metadata is handled without having to connect to Nuclia.
Copilot AI review requested due to automatic review settings February 4, 2026 21:05
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Copyright Validation Results
Total: 6 | Passed: 5 | Failed: 0 | Skipped: 1 | at: 2026-02-04 21:05:38 UTC | commit: 9cfd02e

⏭️ Skipped (Excluded) Files

  • marklogic-spark-connector/src/main/resources/marklogic-spark-messages.properties

✅ Valid Files

  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/ChunkInputs.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/JsonChunkDocumentProducer.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/XmlChunkDocumentProducer.java

✅ All files have valid copyright headers!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for Nuclia metadata in chunk documents, allowing metadata extracted from Nuclia responses to be stored alongside chunk text and embeddings.

Changes:

  • Added metadata field to ChunkInputs class to store chunk-level metadata as JsonNode
  • Updated chunk processing logic to handle and persist metadata in both JSON and XML chunk documents
  • Removed Nuclia configuration properties from the properties file

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/ChunkInputs.java Added metadata field with getter/setter to store JsonNode metadata
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java Updated addChunk method to accept metadata parameter
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java Extracts metadata from Nuclia response and passes it when adding chunks
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/JsonChunkDocumentProducer.java Adds "chunk-metadata" field to JSON chunks when metadata is present
marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/XmlChunkDocumentProducer.java Adds "chunk-metadata" element to XML chunks with JSON string representation
marklogic-spark-connector/src/main/resources/marklogic-spark-messages.properties Removed Nuclia-specific configuration properties

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// "correct" way for converting JSON to XML, particularly in regard to arrays. If the user wants XML
// documents, they can always e.g. use a REST transform to determine how they want to represent the JSON
// as XML.
metadataElement.setTextContent(chunkInputs.getMetadata().toString());
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling toString() on a JsonNode may not produce valid JSON in all cases. Consider using ObjectMapper.writeValueAsString() to ensure proper JSON serialization, especially for complex nested structures.

Copilot uses AI. Check for mistakes.
@@ -173,6 +174,7 @@ public void addChunk(String text, float[] embedding, String modelName) {
chunkInputs.setEmbedding(embedding);
chunkInputs.setModelName(modelName);
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata is set unconditionally outside the null check that applies to embedding and modelName. This inconsistency could be confusing. Consider moving this line inside the if block or adding a comment explaining why metadata is handled differently.

Suggested change
}
}
// Metadata is associated with the chunk regardless of whether an embedding/model name is provided.

Copilot uses AI. Check for mistakes.
@rjrudin rjrudin merged commit 9234a41 into develop Feb 4, 2026
4 checks passed
@rjrudin rjrudin deleted the feature/26966-nuclia-metadata branch February 4, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants