-
Notifications
You must be signed in to change notification settings - Fork 531
Description
Environment
- qmd: @tobilu/qmd@1.0.0
- sqlite-vec: 0.1.7-alpha.2 (darwin-x64)
- Bun: 1.3.9
Problem
qmd embed fails with:
SQLiteError: UNIQUE constraint failed on vectors_vec primary key
when indexing collections that contain the same filename in different directory paths (e.g. template files copied to multiple project folders). Example paths (generic):
collection/project-a/module-x/readme.mdcollection/project-a/module-y/readme.mdcollection/project-b/module-x/readme.mdcollection/project-a/admin/architecture.mdcollection/project-c/admin/architecture.md
Root cause
vectors_vecprimary key ishash_seq = hash_${seq}wherehashis the content hash (fromcontenttable).- Multiple documents with identical content (same file copied to different paths) share the same
hash. - Embedding is driven by
getHashesForEmbedding()which returns one row per unique content hash (GROUP BY d.hash). So we only embed each content once and insert(hash_seq, embedding)once per(hash, seq). - The UNIQUE error still occurs when:
- vec0 virtual table does not honour
INSERT OR REPLACE(so a second insert of the samehash_seqfails), or - vectors_vec and content_vectors get out of sync (e.g. a previous run wrote to
vectors_vecbut failed before writingcontent_vectors; next run tries to embed the same hash again and hits UNIQUE).
- vec0 virtual table does not honour
In either case, the underlying design issue is: the vector key is content-addressable (hash_seq) instead of document-addressable. When the same content appears at different paths, the system treats them as one logical unit for vectors, which leads to duplicate key collisions when sync or REPLACE behaviour is not perfect.
Suggested fix
Make the vector primary key document-scoped so that same content in different paths gets distinct keys:
- Schema: Change
content_vectorsfrom(hash, seq)to(doc_id, seq)wheredoc_idisdocuments.id. So each document (path) has its own set of chunk rows. - vectors_vec: Use
hash_seq = doc_id_seq(e.g."123_0","456_0") so that different documents never share the samehash_seq. - getHashesForEmbedding: Return one row per document that needs embedding (no
GROUP BY d.hash), e.g.LEFT JOIN content_vectors v ON d.id = v.doc_id AND v.seq = 0 WHERE v.doc_id IS NULL. - insertEmbedding: Accept
(doc_id, seq, ...)and writehash_seq = doc_id_seq, and insert intocontent_vectors (doc_id, seq, ...). - searchVec / lookup: Join
content_vectorswithdocumentsondoc_idand match(cv.doc_id || '_' || cv.seq) IN (hashSeqs). - Migration: If
content_vectorsexists without adoc_idcolumn, dropcontent_vectorsandvectors_vecand let the user re-runqmd embed(or implement a one-time migration that duplicates vector rows per document for existing hashes).
This way, "same filename, different full path" always yields distinct doc_id and thus distinct hash_seq, and UNIQUE errors from shared content hash go away. Search results also correctly attribute snippets to the exact path.