Skip to content

UNIQUE constraint on vectors_vec when same content exists at different paths (same filename, different directories) #188

@tongsh6

Description

@tongsh6

Environment

  • qmd: @tobilu/qmd@1.0.0
  • sqlite-vec: 0.1.7-alpha.2 (darwin-x64)
  • Bun: 1.3.9

Problem

qmd embed fails with:

SQLiteError: UNIQUE constraint failed on vectors_vec primary key

when indexing collections that contain the same filename in different directory paths (e.g. template files copied to multiple project folders). Example paths (generic):

  • collection/project-a/module-x/readme.md
  • collection/project-a/module-y/readme.md
  • collection/project-b/module-x/readme.md
  • collection/project-a/admin/architecture.md
  • collection/project-c/admin/architecture.md

Root cause

  • vectors_vec primary key is hash_seq = hash_${seq} where hash is the content hash (from content table).
  • Multiple documents with identical content (same file copied to different paths) share the same hash.
  • Embedding is driven by getHashesForEmbedding() which returns one row per unique content hash (GROUP BY d.hash). So we only embed each content once and insert (hash_seq, embedding) once per (hash, seq).
  • The UNIQUE error still occurs when:
    1. vec0 virtual table does not honour INSERT OR REPLACE (so a second insert of the same hash_seq fails), or
    2. vectors_vec and content_vectors get out of sync (e.g. a previous run wrote to vectors_vec but failed before writing content_vectors; next run tries to embed the same hash again and hits UNIQUE).

In either case, the underlying design issue is: the vector key is content-addressable (hash_seq) instead of document-addressable. When the same content appears at different paths, the system treats them as one logical unit for vectors, which leads to duplicate key collisions when sync or REPLACE behaviour is not perfect.

Suggested fix

Make the vector primary key document-scoped so that same content in different paths gets distinct keys:

  1. Schema: Change content_vectors from (hash, seq) to (doc_id, seq) where doc_id is documents.id. So each document (path) has its own set of chunk rows.
  2. vectors_vec: Use hash_seq = doc_id_seq (e.g. "123_0", "456_0") so that different documents never share the same hash_seq.
  3. getHashesForEmbedding: Return one row per document that needs embedding (no GROUP BY d.hash), e.g. LEFT JOIN content_vectors v ON d.id = v.doc_id AND v.seq = 0 WHERE v.doc_id IS NULL.
  4. insertEmbedding: Accept (doc_id, seq, ...) and write hash_seq = doc_id_seq, and insert into content_vectors (doc_id, seq, ...).
  5. searchVec / lookup: Join content_vectors with documents on doc_id and match (cv.doc_id || '_' || cv.seq) IN (hashSeqs).
  6. Migration: If content_vectors exists without a doc_id column, drop content_vectors and vectors_vec and let the user re-run qmd embed (or implement a one-time migration that duplicates vector rows per document for existing hashes).

This way, "same filename, different full path" always yields distinct doc_id and thus distinct hash_seq, and UNIQUE errors from shared content hash go away. Search results also correctly attribute snippets to the exact path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions