Skip to content

Avoid duplicated entries in VectorStore(s) by allowing generation of Document ID based on the hashed document content. #113

@tzolov

Description

@tzolov

Currently the Document if not provided with an explicit ID, generates a random UUID for every document.
Even if the document content/metadata haven't changed a new ID is generated every time.
This will lead to document content duplications in the Vector store.

To prevent this type of unnecessary duplications we can allow generation of Document ID based on the hashed document content+metadata.

Following snippet is inspired by a langchain4j vector store implementations.

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

....

public static String generateIdFrom(String contentWithMetadata) {
    try {
	    byte[] hashBytes = MessageDigest.getInstance("SHA-256").digest(contentWithMetadata.getBytes(StandardCharsets.UTF_8));
	    StringBuilder sb = new StringBuilder();
	    for (byte b : hashBytes) {
		    sb.append(String.format("%02x", b));
	    }
	    return UUID.nameUUIDFromBytes(sb.toString().getBytes(StandardCharsets.UTF_8)).toString();
    }
    catch (NoSuchAlgorithmException e) {
	    throw new IllegalArgumentException(e);
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions