Conversation
In the next PR, going to turn NucliaClient into an interface so we can mock it in tests. That will allows us to verify how the metadata is handled without having to connect to Nuclia.
|
Copyright Validation Results ⏭️ Skipped (Excluded) Files
✅ Valid Files
✅ All files have valid copyright headers! |
There was a problem hiding this comment.
Pull request overview
This PR adds support for Nuclia metadata in chunk documents, allowing metadata extracted from Nuclia responses to be stored alongside chunk text and embeddings.
Changes:
- Added metadata field to
ChunkInputsclass to store chunk-level metadata as JsonNode - Updated chunk processing logic to handle and persist metadata in both JSON and XML chunk documents
- Removed Nuclia configuration properties from the properties file
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| marklogic-spark-connector/src/main/java/com/marklogic/spark/core/ChunkInputs.java | Added metadata field with getter/setter to store JsonNode metadata |
| marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java | Updated addChunk method to accept metadata parameter |
| marklogic-spark-connector/src/main/java/com/marklogic/spark/core/nuclia/NucliaDocumentProcessor.java | Extracts metadata from Nuclia response and passes it when adding chunks |
| marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/JsonChunkDocumentProducer.java | Adds "chunk-metadata" field to JSON chunks when metadata is present |
| marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/XmlChunkDocumentProducer.java | Adds "chunk-metadata" element to XML chunks with JSON string representation |
| marklogic-spark-connector/src/main/resources/marklogic-spark-messages.properties | Removed Nuclia-specific configuration properties |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // "correct" way for converting JSON to XML, particularly in regard to arrays. If the user wants XML | ||
| // documents, they can always e.g. use a REST transform to determine how they want to represent the JSON | ||
| // as XML. | ||
| metadataElement.setTextContent(chunkInputs.getMetadata().toString()); |
There was a problem hiding this comment.
Calling toString() on a JsonNode may not produce valid JSON in all cases. Consider using ObjectMapper.writeValueAsString() to ensure proper JSON serialization, especially for complex nested structures.
| @@ -173,6 +174,7 @@ public void addChunk(String text, float[] embedding, String modelName) { | |||
| chunkInputs.setEmbedding(embedding); | |||
| chunkInputs.setModelName(modelName); | |||
| } | |||
There was a problem hiding this comment.
The metadata is set unconditionally outside the null check that applies to embedding and modelName. This inconsistency could be confusing. Consider moving this line inside the if block or adding a comment explaining why metadata is handled differently.
| } | |
| } | |
| // Metadata is associated with the chunk regardless of whether an embedding/model name is provided. |
In the next PR, going to turn NucliaClient into an interface so we can mock it in tests. That will allows us to verify how the metadata is handled without having to connect to Nuclia.