Skip to content

[Bug]: Vectorizer runs when non-loading column changes #878

@zhiweit

Description

@zhiweit

What happened?

The vectorizer calls the embedding model to re-embed the loading_column even though the loading_column did not change. I am running the vectorizer worker via python.

Expected behaviour:
The re-embedding or API call to the embedding model should only be done when the loading_column was updated, not when other columns were changed. Re-embedding the loading_column when the loading_column has not changed seems to be making redundant calls and using up API key usage.

pgai extension affected

0.11.1

pgai library affected

0.12.0

PostgreSQL version used

17.0

What operating system did you use?

Ubuntu 24.04 x64

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

How can we reproduce the bug?

To replicate:
- Follow the quick start e.g.

CREATE TABLE blog(
    id        SERIAL PRIMARY KEY,
    title     TEXT,
    authors   TEXT,
    contents  TEXT,
    metadata  JSONB 
);



SELECT ai.create_vectorizer( 
   'blog'::regclass,
   name => 'blog_embeddings',  -- Optional custom name for easier reference
   loading => ai.loading_column('contents'),
   embedding => ai.embedding_ollama('nomic-embed-text', 768),
   destination => ai.destination_table('blog_contents_embeddings')
);


- Change any other column aside from the `contents` column (the vectorizer should not be called to re-embed the `contents` column but it seems like the vectorizer was run and the `contents` was re-embedded even though the `contents` column did not change)

Logs from vectorizer worker

2025-10-07 15:52:22 [debug    ] obtained secret 'OPENAI_API_KEY' from environment
2025-10-07 15:52:22 [info     ] running vectorizer             vectorizer_id=1
2025-10-07 15:52:22 [debug    ] Items pulled from queue: 1    
2025-10-07 15:52:22 [debug    ] Chunks produced: 1            
2025-10-07 15:52:22 [debug    ] Batch 1 has 17.25 tokens in 1 chunks
2025-10-07 15:52:22 [debug    ] Batch 1 of 1                  
2025-10-07 15:52:22 [debug    ] Chunks for this batch: 1      
2025-10-07 15:52:22 [debug    ] Request 1 of 1 initiated      
2025-10-07 15:52:22 [debug    ] Request 1 of 1 ended after: 0.9623326590008219 seconds. Tokens usage: Usage(prompt_tokens=13, total_tokens=13)
2025-10-07 15:52:23 [debug    ] Embedding stats                chunks_per_second=0.8606774421141247 total_chunks=5 total_request_time=5.80937730599544 wall_time=4696.198784884
2025-10-07 15:52:23 [debug    ] Processing stats               chunks_per_second=0.0010542820954175822 chunks_per_second_per_thread=0.7508895275378047 task=138297813986048 total_chunks=5 total_processing_time=6.658769121997466 wall_time=4742.563704469998
2025-10-07 15:52:23 [debug    ] Items pulled from queue: 0    
2025-10-07 15:52:23 [info     ] finished processing vectorizer items=1 vectorizer_id=1

Are you going to work on the bugfix?

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions