UNIQUE constraint on vectors_vec when same content exists at different paths (same filename, different directories)

## Environment

- **qmd**: @tobilu/qmd@1.0.0
- **sqlite-vec**: 0.1.7-alpha.2 (darwin-x64)
- **Bun**: 1.3.9

## Problem

`qmd embed` fails with:

```
SQLiteError: UNIQUE constraint failed on vectors_vec primary key
```

when indexing collections that contain **the same filename in different directory paths** (e.g. template files copied to multiple project folders). Example paths (generic):

- `collection/project-a/module-x/readme.md`
- `collection/project-a/module-y/readme.md`
- `collection/project-b/module-x/readme.md`
- `collection/project-a/admin/architecture.md`
- `collection/project-c/admin/architecture.md`

## Root cause

- `vectors_vec` primary key is `hash_seq = hash_${seq}` where `hash` is the **content hash** (from `content` table).
- Multiple documents with **identical content** (same file copied to different paths) share the same `hash`.
- Embedding is driven by `getHashesForEmbedding()` which returns **one row per unique content hash** (`GROUP BY d.hash`). So we only embed each content once and insert `(hash_seq, embedding)` once per `(hash, seq)`.
- The UNIQUE error still occurs when:
  1. **vec0 virtual table** does not honour `INSERT OR REPLACE` (so a second insert of the same `hash_seq` fails), or
  2. **vectors_vec and content_vectors get out of sync** (e.g. a previous run wrote to `vectors_vec` but failed before writing `content_vectors`; next run tries to embed the same hash again and hits UNIQUE).

In either case, the underlying design issue is: **the vector key is content-addressable (`hash_seq`) instead of document-addressable**. When the same content appears at different paths, the system treats them as one logical unit for vectors, which leads to duplicate key collisions when sync or REPLACE behaviour is not perfect.

## Suggested fix

Make the vector primary key **document-scoped** so that **same content in different paths** gets **distinct keys**:

1. **Schema**: Change `content_vectors` from `(hash, seq)` to `(doc_id, seq)` where `doc_id` is `documents.id`. So each document (path) has its own set of chunk rows.
2. **vectors_vec**: Use `hash_seq = doc_id_seq` (e.g. `"123_0"`, `"456_0"`) so that different documents never share the same `hash_seq`.
3. **getHashesForEmbedding**: Return one row **per document** that needs embedding (no `GROUP BY d.hash`), e.g. `LEFT JOIN content_vectors v ON d.id = v.doc_id AND v.seq = 0 WHERE v.doc_id IS NULL`.
4. **insertEmbedding**: Accept `(doc_id, seq, ...)` and write `hash_seq = doc_id_seq`, and insert into `content_vectors (doc_id, seq, ...)`.
5. **searchVec / lookup**: Join `content_vectors` with `documents` on `doc_id` and match `(cv.doc_id || '_' || cv.seq) IN (hashSeqs)`.
6. **Migration**: If `content_vectors` exists without a `doc_id` column, drop `content_vectors` and `vectors_vec` and let the user re-run `qmd embed` (or implement a one-time migration that duplicates vector rows per document for existing hashes).

This way, "same filename, different full path" always yields distinct `doc_id` and thus distinct `hash_seq`, and UNIQUE errors from shared content hash go away. Search results also correctly attribute snippets to the exact path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNIQUE constraint on vectors_vec when same content exists at different paths (same filename, different directories) #188

Environment

Problem

Root cause

Suggested fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

UNIQUE constraint on vectors_vec when same content exists at different paths (same filename, different directories) #188

Description

Environment

Problem

Root cause

Suggested fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions