Skip to content

issue creating a vector store from a collection using imported embeddings #369

@jonathan2951

Description

@jonathan2951

Hi,
I have a question about declaring a vector store from a collection where I did the embeddings before hand (using openai-text-embedding small).

I created the collection as below:

collection_definition = CollectionDefinition(
    vector=CollectionVectorOptions(
        dimension=1536, # openai embeddings small
        metric=VectorMetric.COSINE,
    ),
    indexing={"allow": ["$vector"]}
)
new_collection = database.create_collection(
    "test_collection_chunks_for_demo", 
    definition=collection_definition
)

And I fill the collection as below:

embedding = response.data[0].embedding
        to_embed = {
            "chunk_id": f"doc_chunk_{i}_{table_name}",
            "catalog_name": catalog_name, # metadata
            "schema_name": schema_name,   # metadata
            "table_name": table_name,    # metadata
            "$vector": embedding  ,     # text to embed 
            "chunk": chunk       # raw text
        }

Later on during the retrieval, I want to use a vector store to make a retriever like this:

vstore = AstraDBVectorStore(
    embedding=OpenAIEmbeddings(model="text-embedding-3-small", api_key=os.getenv("OpenAI_API_KEY")),
    collection_name="test_collection_chunks_for_demo,
    token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
    api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
)

but got the following error message:

ValueError: Astra DB collection 'test_collection_chunks_for_demo' is detected as having the following indexing policy: {"allow": ["$vector"]}. This is incompatible with the requested indexing policy for this object. Consider indexing anew on a fresh collection with the requested indexing policy, or alternatively align the requested indexing settings to the collection to keep using it.

I did try to make the collection without indexing on $vector. I can create a retriever (I'm following this example: https://docs.datastax.com/en/ragstack/default-architecture/retrieval.html) but got the following error message:

417 logger.warning(invalid_doc_warning)
418 return None
419 return Document(
--> 420 page_content=astra_document[self.content_field],
421 metadata=astra_document[DEFAULT_METADATA_FIELD_NAME],
422 id=astra_document["_id"],
423 )

KeyError: 'content'

Is this the wrong way to use imported embeddings ?
Is there an option I need to pass to AstraDBVectorStore to use this specific indexing ?

Thanks for any help.

Regards,
Jonathan


python 3.12.10

Package Version

astrapy 2.0.1
langchain 0.3.27
langchain-astradb 0.6.0
langchain-community 0.3.27
langchain-core 0.3.72
langchain-mcp-adapters 0.1.9
langchain-openai 0.3.28
langchain-tavily 0.2.0
langchain-text-splitters 0.3.9
langgraph 0.6.0
langgraph-api 0.2.108
langgraph-checkpoint 2.1.1
langgraph-cli 0.3.6
langgraph-prebuilt 0.6.0
langgraph-runtime-inmem 0.6.3
langgraph-sdk 0.2.0
langsmith 0.4.8
openai 1.97.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions