Support similarity scores in Document API #1794

ThomasVitale · 2024-11-21T20:39:32Z

Document

Introduced “score” attribute in Document API. It stores the similarity score.
Consolidate “distance” metadata for Documents. It stores the distance measurement.
Adopted prefix-less naming convention in Document.Builder and deprecated old methods.
Deprecated the many overloaded Document constructors in favour of Document.Builder.

Vector Stores

Every vector store implementation now configures a “score” attribute with the similarity score of the Document embedding. It also includes the “distance” metadata with the distance measurement.
Fixed error in Elasticsearch where distance and similarity were mixed up.
Added missing integration tests for SimpleVectorStore.
The Azure Vector Store and HanaDB Vector Store do not include those measurements because the product documentation do not include information about how the similarity score is returned, and without access to the cloud products I could not verify that via debugging.
Improved tests to actually assert the result of the similarity search based on the returned score.

ThomasVitale · 2024-11-21T20:41:01Z

spring-ai-core/src/main/java/org/springframework/ai/document/DocumentMetadata.java

+ * @author Thomas Vitale
+ * @since 1.0.0
+ */
+public enum DocumentMetadata {


The idea for this enum is to use it for other common metadata used in Documents, such as the "source file" or "page" when using a DocumentReader, helping the RAG flow traceability.

ThomasVitale · 2024-11-21T20:41:53Z

spring-ai-core/src/main/java/org/springframework/ai/document/DocumentMetadata.java

+	 * The lower the distance, the more they are similar.
+	 * It's the opposite of the similarity score.
+	 */
+	DISTANCE("distance");


I kept this metadata for backward compatibility, but we might consider removing it completely since we now have the "score" field in each Document (and "distance" is always the opposite value of "score").

ThomasVitale · 2024-11-21T20:42:45Z

spring-ai-core/src/main/java/org/springframework/ai/vectorstore/SimpleVectorStore.java

-			.filter(s -> s.score >= request.getSimilarityThreshold())
-			.sorted(Comparator.<Similarity>comparingDouble(s -> s.score).reversed())
+			.peek(document -> document
+				.setScore(EmbeddingMath.cosineSimilarity(userQueryEmbedding, document.getEmbedding())))


If we remove the "embedding" field (see: #1781), the SimpleVectorStore will not work

ThomasVitale · 2024-11-21T20:45:30Z

...ai-azure-cosmos-db-store/src/test/java/org/springframework/ai/vectorstore/CosmosDbImage.java

+	// It must always be "latest" or else Azure locks the image after a while. See:
+	// https://github.com/Azure/azure-cosmos-db-emulator-docker/issues/60
+	public static final DockerImageName DEFAULT_IMAGE = DockerImageName
+		.parse("mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest");


I hope we'll be able to have integration tests for CosmosDB based on Testcontainers in the future. For now, this image includes the vector store-specific features disabled and there's no way to enable them, so it cannot be used.

ThomasVitale · 2024-11-21T20:47:23Z

...-ai-pinecone-store/src/main/java/org/springframework/ai/vectorstore/PineconeVectorStore.java


 		private final String contentFieldName;

+		// TODO: Why is this field configurable? Can we remove this after standardizing


Would it be ok to remove this and keep the standard "distance" metadata? Having this configurable means we cannot use the metadata reliably across implementations.

ThomasVitale · 2024-11-21T20:47:57Z

...spring-ai-redis-store/src/main/java/org/springframework/ai/vectorstore/RedisVectorStore.java

 			.map(MetadataField::name)
 			.filter(doc::hasProperty)
 			.collect(Collectors.toMap(Function.identity(), doc::getString));
+		// TODO: this seems wrong. The key is named "vector_store", but the value is the


Would it be ok to remove this and keep the standard "distance" metadata?

ThomasVitale · 2024-11-26T07:03:09Z

PR updated after #1822 was merged

markpollack · 2024-11-26T20:58:18Z

spring-ai-core/src/main/java/org/springframework/ai/document/Document.java

+
+	public void setScore(@Nullable Double score) {
+		this.score = score;
+	}


Should the document object be immutable? If we want to update a Document with a score we should use a builder that takes the existing document and then call the 'score' builder method?

The post-retrieval steps in a RAG flow would all modify the Documents in some way, including the score, the content and the metadata. I guess we could create new instances on each hop to the next step in the flow. But we'd need several builders based on the type of field we change. Should I go with that?

I made score a final field and introduced a mutate() method to build a new Document instance with the possibility to change the score. However, Document was not immutable to begin with (media and metadata are mutable). I have created a separate issue to look into that because it would be a breaking change and it would require lots of refactoring. #1838

markpollack · 2024-11-26T21:06:32Z

spring-ai-core/src/main/java/org/springframework/ai/document/Document.java

+	 * @deprecated Use builder instead: {@link Document#builder()}.
+	 */
+	@Deprecated(since = "1.0.0-M5", forRemoval = true)
 	public Document(String content, Map<String, Object> metadata) {


There are probably many users out there using this constructor and perhaps the one with String id, String content, Map<String, Object> metadata args. Despite the having the builder, maybe we keep these ctors?

I thought of deprecating most of the constructors to make the callers more readable, mostly a problem with the other varying constructors with 3 or more arguments. For example, there are 3 constructors that accept 3 arguments, but all different (with media and metadata very easy to mix-up).

What if keep only the 2 ones you mentioned? Or should I keep all of them?

I have "undeprecated" the mentioned constructors.

yea, the two are good, once we get to three is when the ambiguity starts so the builder should be preferred

Document * Introduced “score” attribute in Document API. It stores the similarity score. * Consolidate “distance” metadata for Documents. It stores the distance measurement. * Adopted prefix-less naming convention in Document.Builder and deprecated old methods. * Deprecated the many overloaded Document constructors in favour of Document.Builder. Vector Stores * Every vector store implementation now configures a “score” attribute with the similarity score of the Document embedding. It also includes the “distance” metadata with the distance measurement. * Fixed error in Elasticsearch where distance and similarity were mixed up. * Added missing integration tests for SimpleVectorStore. * The Azure Vector Store and HanaDB Vector Store do not include those measurements because the product documentation do not include information about how the similarity score is returned, and without access to the cloud products I could not verify that via debugging. * Improved tests to actually assert the result of the similarity search based on the returned score. Signed-off-by: Thomas Vitale <[email protected]>

Signed-off-by: Thomas Vitale <[email protected]>

markpollack · 2024-12-02T19:20:43Z

spring-ai-core/src/main/java/org/springframework/ai/document/Document.java

-		public Builder withMedia(List<Media> media) {
-			Assert.notNull(media, "media must not be null");
+		public Builder media(List<Media> media) {
 			this.media = media;


i've updated it so that it adds to the existing list vs. replacing it. There are tests that assume it aggregates.

markpollack · 2024-12-02T19:56:47Z

I removed usage of deprecated methods and some other minor cleanup. Sorry it took a while.
We need another pass at improving Document as you noted in #1838

merged in fe58fd3

iAMSagar44 · 2024-12-04T10:07:18Z

@ThomasVitale - Regarding the scores returned by Azure AI Search, please have a read of my latest comment here -
517.

Depending on the type of search you carry out in Azure AI Search, the score might be different and its usage will be different.

ThomasVitale commented Nov 21, 2024

View reviewed changes

ThomasVitale mentioned this pull request Nov 24, 2024

Modular RAG - Part 2 #1811

Closed

markpollack assigned markpollack and sobychacko Nov 25, 2024

ThomasVitale mentioned this pull request Nov 25, 2024

Remove embedding from Document #1781

Closed

markpollack added this to the 1.0.0-M5 milestone Nov 26, 2024

ThomasVitale force-pushed the similarity-score-documents branch from 467dbc3 to efca3c7 Compare November 26, 2024 07:03

markpollack reviewed Nov 26, 2024

View reviewed changes

ThomasVitale mentioned this pull request Nov 27, 2024

Are Document objects supposed to be immutable? #1838

Open

ThomasVitale added 2 commits November 27, 2024 22:33

Handle PR comments

3a740ec

Signed-off-by: Thomas Vitale <[email protected]>

ThomasVitale force-pushed the similarity-score-documents branch from 4758e9e to 3a740ec Compare November 27, 2024 21:33

markpollack reviewed Dec 2, 2024

View reviewed changes

markpollack closed this Dec 2, 2024

ThomasVitale mentioned this pull request Jan 15, 2025

ElasticSearch doSimilaritySearch broken after recent changes in Document #1936

Closed


		private final String contentFieldName;

		// TODO: Why is this field configurable? Can we remove this after standardizing

Support similarity scores in Document API #1794

Support similarity scores in Document API #1794

Uh oh!

Conversation

ThomasVitale commented Nov 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasVitale Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasVitale commented Nov 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markpollack commented Dec 2, 2024

Uh oh!

iAMSagar44 commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ThomasVitale Nov 21, 2024 •

edited

Loading