Skip to content

feat(embed): skip oversized files with configurable size limit#153

Closed
brettdavies wants to merge 1 commit intotobi:mainfrom
brettdavies:feat/embed-file-size-limit
Closed

feat(embed): skip oversized files with configurable size limit#153
brettdavies wants to merge 1 commit intotobi:mainfrom
brettdavies:feat/embed-file-size-limit

Conversation

@brettdavies
Copy link

@brettdavies brettdavies commented Feb 11, 2026

Summary

Add a configurable content size limit to qmd embed that skips oversized files before tokenization. Prevents OOM on constrained systems and improves embedding quality by avoiding excessively large documents that produce too many diluted chunks.

Changes

  • Add getEmbedBreakdown() SQL query in store.ts to split pending docs into embeddable vs too-large
  • Add DEFAULT_MAX_EMBED_FILE_BYTES constant (5MB) in store.ts
  • Add getMaxEmbedFileBytes() helper with QMD_MAX_EMBED_FILE_BYTES env var override
  • Add --no-size-limit CLI flag to bypass size checks for one-off full embeds
  • Show "Skipped: N exceed X size limit" in qmd status output
  • Replace new TextEncoder().encode(str).length with Buffer.byteLength(str, 'utf8') (avoids unnecessary memory allocation in the chunking loop)
  • Add 20 automated tests (10 env var parsing, 5 SQL query unit, 5 CLI integration)

Type of Change

  • feat: New feature (non-breaking change which adds functionality)
  • test: Adding or updating tests

Related Issues/Stories

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing completed
  • All tests passing

Test Summary:

  • Unit tests: 15 passing (10 getMaxEmbedFileBytes env var parsing + 5 getEmbedBreakdown SQL query)
  • Integration tests: 5 passing (status display with/without skipped files, embed skip behavior, --no-size-limit flag, help text)

Files Modified

Modified:

  • src/store.ts -- add getEmbedBreakdown() function and DEFAULT_MAX_EMBED_FILE_BYTES constant
  • src/qmd.ts -- add getMaxEmbedFileBytes(), --no-size-limit flag, skip logic in vectorIndex(), breakdown display in status
  • src/store.test.ts -- add describe("getEmbedBreakdown") with 5 unit tests
  • src/cli.test.ts -- add describe("CLI Embed File Size Limit") with 5 integration tests

Created:

  • src/embed-config.test.ts -- 10 unit tests for getMaxEmbedFileBytes() env var parsing

Deleted:

  • None

Key Features

  • Files exceeding the size limit are skipped before model loading and tokenization, so qmd embed stays fast even when all files are too large
  • qmd status shows a breakdown: pending count vs skipped count
  • --no-size-limit flag overrides the limit for a single run without changing config
  • QMD_MAX_EMBED_FILE_BYTES env var allows per-environment tuning
  • Invalid env var values produce a stderr warning and fall back to the 5MB default (prevents silent NaN bypass)

Configuration

# Default: 5MB limit
qmd embed

# Override via env var (e.g., 10MB)
QMD_MAX_EMBED_FILE_BYTES=10485760 qmd embed

# Bypass all limits
qmd embed --no-size-limit

Motivation

Large files cause two problems during embedding:

  1. OOM on constrained systems -- tokenizing a 10MB+ file creates millions of tokens in memory
  2. Poor embedding quality -- a 5MB file produces ~625 chunks at 800 tokens each; semantic meaning gets diluted and one file dominates search results

Skipping with a clear warning is better than crashing or producing bad embeddings.

Future Possibilities

If per-extension limits are useful, they could be added via per-extension constants, YAML config per collection, or per-extension env vars. Happy to add any of these if you'd find them useful.

Breaking Changes

  • No breaking changes

Deployment Notes

  • No special deployment steps required

Checklist

  • Code follows project conventions and style guidelines
  • Commit messages follow Conventional Commits
  • Self-review of code completed
  • Tests added/updated and passing
  • No new warnings or errors introduced
  • Changes are backward compatible (or breaking changes documented)

Add content size limit during embedding (default 5MB). Files exceeding
the limit are skipped with a warning before tokenization.

- Add getEmbedBreakdown() to split pending docs into embeddable vs too-large
- Add DEFAULT_MAX_EMBED_FILE_BYTES constant (5MB) in store.ts
- Add getMaxEmbedFileBytes() helper with QMD_MAX_EMBED_FILE_BYTES env var
- Add --no-size-limit CLI flag to bypass size checks
- Show "Skipped: N exceed X size limit" in qmd status output
- Replace TextEncoder.encode().length with Buffer.byteLength() (avoids
  unnecessary memory allocation in the chunking loop)
- Add 20 automated tests (10 env var, 5 SQL query, 5 CLI integration)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@brettdavies
Copy link
Author

brettdavies commented Feb 12, 2026

Superseded by #157 (branch recreated after rebase to resolve conflicts with #149)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qmd query fails with large files: reranker context size exceeded

1 participant