Skip to content

feat(embed): skip oversized files with configurable size limit (#1)#157

Open
brettdavies wants to merge 1 commit intotobi:mainfrom
brettdavies:feat/embed-file-size-limit
Open

feat(embed): skip oversized files with configurable size limit (#1)#157
brettdavies wants to merge 1 commit intotobi:mainfrom
brettdavies:feat/embed-file-size-limit

Conversation

@brettdavies
Copy link

Replaces #153 (branch was recreated during a rebase to resolve merge conflicts with #149).

Summary

Add a configurable content size limit to qmd embed that skips oversized files before tokenization. Prevents OOM on constrained systems and improves embedding quality by avoiding excessively large documents that produce too many diluted chunks.

Changes

  • Add getEmbedBreakdown() SQL query in store.ts to split pending docs into embeddable vs too-large
  • Add DEFAULT_MAX_EMBED_FILE_BYTES constant (5MB) in store.ts
  • Add getMaxEmbedFileBytes() helper with QMD_MAX_EMBED_FILE_BYTES env var override
  • Add --no-size-limit CLI flag to bypass size checks for one-off full embeds
  • Show "Skipped: N exceed X size limit" in qmd status output
  • Replace new TextEncoder().encode(str).length with Buffer.byteLength(str, 'utf8') (avoids unnecessary memory allocation in the chunking loop)
  • Add 20 automated tests (10 env var parsing, 5 SQL query unit, 5 CLI integration)

Type of Change

  • feat: New feature (non-breaking change which adds functionality)
  • test: Adding or updating tests

Related Issues/Stories

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing completed
  • All tests passing

Test Summary:

  • Unit tests: 15 passing (10 getMaxEmbedFileBytes env var parsing + 5 getEmbedBreakdown SQL query)
  • Integration tests: 5 passing (status display with/without skipped files, embed skip behavior, --no-size-limit flag, help text)

Files Modified

Modified:

  • src/store.ts -- add getEmbedBreakdown() function and DEFAULT_MAX_EMBED_FILE_BYTES constant
  • src/qmd.ts -- add getMaxEmbedFileBytes(), --no-size-limit flag, skip logic in vectorIndex(), breakdown display in status
  • src/store.test.ts -- add describe("getEmbedBreakdown") with 5 unit tests
  • src/cli.test.ts -- add describe("CLI Embed File Size Limit") with 5 integration tests

Created:

  • src/embed-config.test.ts -- 10 unit tests for getMaxEmbedFileBytes() env var parsing

Key Features

  • Files exceeding the size limit are skipped before model loading and tokenization, so qmd embed stays fast even when all files are too large
  • qmd status shows a breakdown: pending count vs skipped count
  • --no-size-limit flag overrides the limit for a single run without changing config
  • QMD_MAX_EMBED_FILE_BYTES env var allows per-environment tuning
  • Invalid env var values produce a stderr warning and fall back to the 5MB default (prevents silent NaN bypass)

Configuration

# Default: 5MB limit
qmd embed

# Override via env var (e.g., 10MB)
QMD_MAX_EMBED_FILE_BYTES=10485760 qmd embed

# Bypass all limits
qmd embed --no-size-limit

Motivation

Large files cause two problems during embedding:

  1. OOM on constrained systems -- tokenizing a 10MB+ file creates millions of tokens in memory
  2. Poor embedding quality -- a 5MB file produces ~625 chunks at 800 tokens each; semantic meaning gets diluted and one file dominates search results

Skipping with a clear warning is better than crashing or producing bad embeddings.

Breaking Changes

  • No breaking changes

Checklist

  • Code follows project conventions and style guidelines
  • Commit messages follow Conventional Commits
  • Self-review of code completed
  • Tests added/updated and passing
  • No new warnings or errors introduced
  • Changes are backward compatible (or breaking changes documented)

* feat(embed): skip oversized files with configurable size limit

Add a file size limit (default 5MB) to `qmd embed` that skips files
exceeding the threshold. Configurable via QMD_MAX_EMBED_FILE_BYTES env
var or bypassed with --no-size-limit flag.

- Skip files over size limit with yellow warning during embed
- Show "Skipped" count separately from "Pending" in `qmd status`
- Add getEmbedBreakdown() query to distinguish actionable vs too-large
- Refactor vectorIndex() to options object pattern
- Validate env var with Math.floor for integer byte values
- Add 10 unit tests for getMaxEmbedFileBytes config parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test(embed): add tests for file size limit feature

Cover getEmbedBreakdown SQL query (5 unit tests) and CLI behavior
(5 integration tests) for status display, embed skip, --no-size-limit
flag, and help text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qmd query fails with large files: reranker context size exceeded

1 participant