-
Notifications
You must be signed in to change notification settings - Fork 299
Open
Labels
Description
What happened?
The vectorizer calls the embedding model to re-embed the loading_column even though the loading_column did not change. I am running the vectorizer worker via python.
Expected behaviour:
The re-embedding or API call to the embedding model should only be done when the loading_column was updated, not when other columns were changed. Re-embedding the loading_column when the loading_column has not changed seems to be making redundant calls and using up API key usage.
pgai extension affected
0.11.1
pgai library affected
0.12.0
PostgreSQL version used
17.0
What operating system did you use?
Ubuntu 24.04 x64
What installation method did you use?
Docker
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
How can we reproduce the bug?
To replicate:
- Follow the quick start e.g.
CREATE TABLE blog(
id SERIAL PRIMARY KEY,
title TEXT,
authors TEXT,
contents TEXT,
metadata JSONB
);
SELECT ai.create_vectorizer(
'blog'::regclass,
name => 'blog_embeddings', -- Optional custom name for easier reference
loading => ai.loading_column('contents'),
embedding => ai.embedding_ollama('nomic-embed-text', 768),
destination => ai.destination_table('blog_contents_embeddings')
);
- Change any other column aside from the `contents` column (the vectorizer should not be called to re-embed the `contents` column but it seems like the vectorizer was run and the `contents` was re-embedded even though the `contents` column did not change)
Logs from vectorizer worker
2025-10-07 15:52:22 [debug ] obtained secret 'OPENAI_API_KEY' from environment
2025-10-07 15:52:22 [info ] running vectorizer vectorizer_id=1
2025-10-07 15:52:22 [debug ] Items pulled from queue: 1
2025-10-07 15:52:22 [debug ] Chunks produced: 1
2025-10-07 15:52:22 [debug ] Batch 1 has 17.25 tokens in 1 chunks
2025-10-07 15:52:22 [debug ] Batch 1 of 1
2025-10-07 15:52:22 [debug ] Chunks for this batch: 1
2025-10-07 15:52:22 [debug ] Request 1 of 1 initiated
2025-10-07 15:52:22 [debug ] Request 1 of 1 ended after: 0.9623326590008219 seconds. Tokens usage: Usage(prompt_tokens=13, total_tokens=13)
2025-10-07 15:52:23 [debug ] Embedding stats chunks_per_second=0.8606774421141247 total_chunks=5 total_request_time=5.80937730599544 wall_time=4696.198784884
2025-10-07 15:52:23 [debug ] Processing stats chunks_per_second=0.0010542820954175822 chunks_per_second_per_thread=0.7508895275378047 task=138297813986048 total_chunks=5 total_processing_time=6.658769121997466 wall_time=4742.563704469998
2025-10-07 15:52:23 [debug ] Items pulled from queue: 0
2025-10-07 15:52:23 [info ] finished processing vectorizer items=1 vectorizer_id=1Are you going to work on the bugfix?
None
Reactions are currently unavailable