Skip to content

MLE-26953 Chunks now capture model name#606

Merged
rjrudin merged 1 commit intodevelopfrom
feature/26953-add-model-name
Feb 3, 2026
Merged

MLE-26953 Chunks now capture model name#606
rjrudin merged 1 commit intodevelopfrom
feature/26953-add-model-name

Conversation

@rjrudin
Copy link
Contributor

@rjrudin rjrudin commented Feb 3, 2026

This is using embeddingModel.getModelName() in the LangChain4j API, and it will allow for Nuclia integration to easily add the model name found in each chunk response.

Copilot AI review requested due to automatic review settings February 3, 2026 16:40
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Copyright Validation Results
Total: 14 | Passed: 14 | Failed: 0 | Skipped: 0 | at: 2026-02-03 17:38:14 UTC | commit: 293d991

✅ Valid Files

  • marklogic-spark-connector/src/main/java/com/marklogic/langchain4j/embedding/EmbeddingGenerator.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/ChunkInputs.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentInputs.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/DocumentPipeline.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/embedding/Chunk.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/embedding/DOMChunk.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/embedding/JsonChunk.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/JsonChunkDocumentProducer.java
  • marklogic-spark-connector/src/main/java/com/marklogic/spark/core/splitter/XmlChunkDocumentProducer.java
  • marklogic-spark-connector/src/test/java/com/marklogic/spark/writer/embedding/AbstractEmbeddingTest.java
  • marklogic-spark-connector/src/test/java/com/marklogic/spark/writer/embedding/AddEmbeddingsFromTextTest.java
  • marklogic-spark-connector/src/test/java/com/marklogic/spark/writer/embedding/AddEmbeddingsToJsonTest.java
  • marklogic-spark-connector/src/test/java/com/marklogic/spark/writer/embedding/AddEmbeddingsToXmlTest.java
  • marklogic-spark-connector/src/test/java/com/marklogic/spark/writer/embedding/TestEmbeddingModel.java

✅ All files have valid copyright headers!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for capturing and storing the model name in chunks when generating embeddings. The change leverages the LangChain4j API's getModelName() method to retrieve the model name from the embedding model and stores it alongside the embedding vector in both JSON and XML chunk documents.

Changes:

  • Modified the Chunk interface and its implementations to accept a modelName parameter in the addEmbedding method
  • Updated EmbeddingGenerator to retrieve and pass the model name when adding embeddings to chunks
  • Added comprehensive test coverage to verify model name storage across different document formats and configurations

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
Chunk.java Updated interface signature to include modelName parameter
DOMChunk.java Added XML element creation for model-name in the embedding namespace
JsonChunk.java Added model-name field to JSON chunk objects
ChunkInputs.java Added modelName field with getter/setter methods
DocumentInputs.java Updated to pass modelName when setting embeddings on chunks
DocumentPipeline.java Updated wrapper to forward modelName parameter
EmbeddingGenerator.java Captures model name from embedding model and passes it to chunks
XmlChunkDocumentProducer.java Updated method calls to include modelName parameter
JsonChunkDocumentProducer.java Updated method calls to include modelName parameter
AbstractEmbeddingTest.java New test base class with shared test constants including expected model name
AddEmbeddingsToXmlTest.java Extended test coverage to verify model name storage in XML documents
AddEmbeddingsToJsonTest.java Extended test coverage to verify model name storage in JSON documents
AddEmbeddingsFromTextTest.java Extended test coverage to verify model name storage in text-based embeddings
TestEmbeddingModel.java Updated test implementation to match new interface signature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 19 to +20
* @param embedding
* @param modelName
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter documentation is incomplete. Add descriptions explaining what each parameter represents. For example: '@param embedding the vector embedding data as a float array' and '@param modelName the name of the model used to generate the embedding'.

Suggested change
* @param embedding
* @param modelName
* @param embedding the vector embedding data associated with this chunk
* @param modelName the name of the model used to generate the embedding

Copilot uses AI. Check for mistakes.
@rjrudin rjrudin force-pushed the feature/26953-add-model-name branch 2 times, most recently from 5674a29 to 8b2ed75 Compare February 3, 2026 17:26
This is using embeddingModel.getModelName() in the LangChain4j API, and it will allow for Nuclia integration to easily add the model name found in each chunk response.
@rjrudin rjrudin force-pushed the feature/26953-add-model-name branch from 8b2ed75 to 293d991 Compare February 3, 2026 17:38
@rjrudin rjrudin merged commit 5463b41 into develop Feb 3, 2026
4 checks passed
@rjrudin rjrudin deleted the feature/26953-add-model-name branch February 3, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants