- Install Dependencies
cd apps/codex
npm install- Run Initial Catalog (categorizes ALL existing content)
npm run index -- --validateThis will:
- Scan all markdown files in
weaves/,docs/,wiki/ - Extract keywords using TF-IDF
- Auto-categorize using NLP (no LLM, pure statistical)
- Generate
codex-index.json(searchable index) - Generate
codex-report.json(analytics) - Output validation errors/warnings/suggestions
First run takes ~30 seconds for 100 files
Automatic on Every PR Merge:
- GitHub Actions triggers
build-index.ymlon push tomain - Runs
npm run index -- --validate(static NLP only) - Generates updated index files
- Pushes to
indexbranch for consumption - No AI/LLM calls - pure TF-IDF, n-grams, vocabulary matching
Static NLP Tools Used:
- TF-IDF: Keyword extraction (no external API)
- N-gram extraction: Common phrases (local computation)
- Vocabulary matching: Controlled taxonomy (regex/string matching)
- Readability scoring: Flesch-Kincaid (formula-based)
- Sentiment heuristics: Simple keyword patterns
Cost: $0 - All processing is local/in-CI
Add to framersai/codex repository settings:
# Required for auto-merge workflow
GH_PAT=ghp_xxxxxxxxxxxxxxxxxxxx # GitHub Personal Access Token (repo scope)
# AI Enhancement (OPTIONAL - only if you want AI-powered PR analysis)
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx
# Auto-merge control (default: false - requires manual approval)
AUTO_CATALOG_MERGE=false # Set to 'true' to auto-merge re-catalog PRs
# Configuration (optional)
AI_PROVIDER=disabled # Set to 'disabled' to skip AI entirelyTo enable AI enhancement:
OPENAI_API_KEY=sk-...To enable auto-merge for re-catalog PRs:
AUTO_CATALOG_MERGE=true
# Default: false (requires manual approval)
# Recommended: keep false to review metadata changesTo disable AI enhancement:
AI_PROVIDER=disabled
# Or just don't set OPENAI_API_KEYCreate .env in apps/codex/:
# Optional - only for testing AI enhancement locally
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...
AI_PROVIDER=openaiNote: The indexer and validator work WITHOUT any API keys. AI is only for optional PR enhancement.
cd apps/codex
chmod +x scripts/retrigger-full-catalog.sh
# Dry run first (see what would change)
./scripts/retrigger-full-catalog.sh --dry-run
# Create PR with changes (requires manual approval)
./scripts/retrigger-full-catalog.sh
# Force auto-merge (one-time override)
./scripts/retrigger-full-catalog.sh --auto-mergeWhat it does:
- Runs full static NLP analysis on ALL files
- Creates branch:
catalog/full-reindex-{timestamp} - Commits updated index files
- Creates PR with detailed summary
- Waits for manual approval (unless
AUTO_CATALOG_MERGE=true)
Requirements:
GH_PATenvironment variable
# Via GitHub CLI
gh workflow run build-index.yml --repo framersai/codex
# Or via web UI
# Go to: https://github.com/framersai/codex/actions/workflows/build-index.yml
# Click "Run workflow" → "Run workflow"What it does:
- Runs indexer in CI
- Pushes directly to
indexbranch (no PR) - Updates live immediately
cd apps/codex
npm run index -- --validate
# Review changes
cat codex-report.json | jq '.summary'
# Commit manually
git add codex-index.json codex-report.json
git commit -m "chore: re-index all content"
git pushSee full guide: RECATALOG_GUIDE.md
Frame.dev’s advanced search UI consumes a separate static artifact, codex-search.json, which contains:
- BM25 postings (term → docId, term frequency)
- Document metadata (path, weave, loom, summary, doc length)
- Packed Float32 embeddings (all-MiniLM-L6-v2, mean pooled, normalized)
Generate it after the main index:
cd apps/codex
npm run build:search
# Commit alongside codex-index.json to publish updated search data
git add codex-search.jsonThis command uses @xenova/transformers entirely in Node.js (no Python, no API keys) and produces a fully static JSON blob that can be hosted on GitHub Pages or any CDN.
Frame Codex uses @framers/sql-storage-adapter for intelligent incremental indexing.
On First Run:
- Creates
.cache/codex.db(better-sqlite3 in CI, IndexedDB in browser) - Analyzes ALL files with static NLP (TF-IDF, n-grams)
- Stores: file path, SHA hash, mtime, analysis JSON, keywords
- Generates
codex-index.jsonandcodex-report.json - Time: ~30 seconds for 100 files
On Subsequent Runs:
- Reads cache database
- Computes diff:
SELECT path, sha FROM files - Compares current filesystem SHA vs cached SHA
- Only re-processes changed files (added, modified)
- Merges cached + new analyses
- Time: ~2-5 seconds for 5 changed files (85-95% speedup)
-- File metadata and analysis cache
CREATE TABLE files (
path TEXT PRIMARY KEY,
sha TEXT NOT NULL, -- SHA-256 of content
mtime INTEGER NOT NULL, -- Last modified timestamp
size INTEGER NOT NULL, -- File size in bytes
analysis TEXT NOT NULL, -- JSON analysis result
indexed_at INTEGER NOT NULL -- When indexed
);
-- Keyword cache (for TF-IDF optimization)
CREATE TABLE keywords (
file_path TEXT NOT NULL,
keyword TEXT NOT NULL,
tfidf_score REAL NOT NULL,
frequency INTEGER NOT NULL,
PRIMARY KEY (file_path, keyword)
);
-- Loom/Weave aggregate statistics
CREATE TABLE stats (
scope TEXT PRIMARY KEY, -- Loom or weave path
scope_type TEXT NOT NULL, -- 'loom' or 'weave'
total_files INTEGER NOT NULL,
total_keywords INTEGER NOT NULL,
avg_difficulty TEXT,
subjects TEXT, -- JSON array
topics TEXT, -- JSON array
last_updated INTEGER NOT NULL
);// Pseudo-code
async function computeDiff(currentFiles) {
const cached = await db.all('SELECT path, sha FROM files')
const cachedMap = new Map(cached.map(f => [f.path, f.sha]))
const added = []
const modified = []
const unchanged = []
for (const file of currentFiles) {
const currentSha = calculateSHA(readFile(file))
const cachedSha = cachedMap.get(file)
if (!cachedSha) {
added.push(file)
} else if (cachedSha !== currentSha) {
modified.push(file)
} else {
unchanged.push(file)
}
}
const deleted = [...cachedMap.keys()].filter(p => !currentFiles.includes(p))
return { added, modified, deleted, unchanged }
}GitHub Actions:
- uses: actions/cache@v4
with:
path: .cache/codex.db
key: codex-cache-${{ hashFiles('weaves/**/*.md') }}Browser:
- Automatic via IndexedDB (persistent across sessions)
- Quota managed by browser (typically 50MB-1GB)
# Disable SQL caching (use full indexing)
SQL_CACHE_DISABLED=true
# Clear cache before building
npm run index -- --clear-cacheWhen a file is uploaded/submitted, the system:
-
Analyzes Content (TF-IDF keywords, n-grams)
-
Checks SQL Cache for similar files in existing looms
-
Matches Against Existing Looms:
- Compares keywords to cached loom vocabularies (from
statstable) - Calculates similarity scores (cosine similarity)
- Finds best-matching loom (threshold: 0.6)
- Compares keywords to cached loom vocabularies (from
-
Decision Logic:
IF similarity > 0.8: → Place in existing loom (high confidence) ELSE IF similarity > 0.6: → Suggest existing loom + create new loom option ELSE: → Create new loom (content is sufficiently unique) -
Folder Structure:
weaves/ [detected-weave]/ # Based on primary subject [topic-folder]/ # Any folder = loom (auto-detected) subtopic/ [filename].md # Strand (markdown file)
// Pseudo-code
function findBestLoom(uploadedContent) {
const uploadKeywords = extractKeywords(uploadedContent)
for (const loom of existingLooms) {
const loomKeywords = aggregateKeywords(loom.strands)
const similarity = cosineSimilarity(uploadKeywords, loomKeywords)
if (similarity > bestScore) {
bestScore = similarity
bestLoom = loom
}
}
return { loom: bestLoom, confidence: bestScore }
}Automatic (Static NLP):
- Runs on every PR
- Validates placement
- Suggests better loom if similarity is low
- Posts comment: "Consider moving to
weaves/<slug>/better-match/"
AI-Powered (Optional):
- Only if
OPENAI_API_KEYorANTHROPIC_API_KEYis set - Analyzes semantic meaning (not just keywords)
- Suggests structural improvements
- Only runs if PR author is a Weaver (for auto-apply)
Client-Side Flow:
- User clicks "Submit" on frame.dev/codex
- Provides GitHub Personal Access Token (stored in localStorage)
- Client-side JS calls GitHub API directly
- Creates PR with proper metadata
- GitHub handles all auth via PAT
Why This Works:
- No backend server needed
- No database for user accounts
- GitHub is the auth provider
- Rate limiting via GitHub API limits
When user clicks "Submit via GitHub":
https://github.com/framersai/codex/compare/main...user:branch?
quick_pull=1&
title=Add: [Auto-Generated Title]&
body=[Pre-filled PR template with metadata]
User just needs to:
- Be logged into GitHub
- Click "Create Pull Request"
- Done!
Problem: Re-analyzing entire weave on every PR is expensive
Solution: Incremental analysis with caching
// Only analyze affected loom + neighbors
function analyzeAffectedContent(changedFile) {
const loom = detectLoom(changedFile)
const relatedLooms = findRelatedLooms(loom) // Based on shared tags
// Only process these looms, not entire weave
const scope = [loom, ...relatedLooms]
return analyzeLooms(scope)
}Cache Key: loom-id:last-modified-timestamp
// Check cache before processing
const cacheKey = `${loomId}:${lastModified}`
const cached = await redis.get(cacheKey)
if (cached) {
return JSON.parse(cached)
}
// Process and cache
const result = await analyzeLoom(loom)
await redis.setex(cacheKey, 3600, JSON.stringify(result)) // 1 hour TTLfunction findRelatedLooms(loom) {
const loomTags = loom.metadata.tags
return allLooms.filter(otherLoom => {
const sharedTags = intersection(loomTags, otherLoom.metadata.tags)
return sharedTags.length >= 2 // At least 2 shared tags
})
}Without Caching:
- 100 strands × 0.5s = 50 seconds per PR
- Expensive for large weaves
With Loom-Scoped + Caching:
- 1 loom (5 strands) × 0.5s = 2.5 seconds
- 2 related looms (10 strands) × 0.5s = 5 seconds
- Total: ~7.5 seconds (85% reduction)
Cache Hit Rate:
- Most PRs affect 1-2 looms
- Related looms rarely change simultaneously
- Expected hit rate: 70-80%
- Effective time: ~2-3 seconds per PR
Add to scripts/auto-index.js:
class CachedIndexer extends CodexIndexer {
constructor() {
super()
this.cache = new Map() // In-memory for CI, Redis for production
}
async processLoomIncremental(loomPath, changedFiles) {
const loomId = path.basename(loomPath)
const lastModified = this.getLastModified(loomPath)
const cacheKey = `${loomId}:${lastModified}`
// Check cache
if (this.cache.has(cacheKey)) {
console.log(`✓ Cache hit: ${loomId}`)
return this.cache.get(cacheKey)
}
// Process only changed strands + neighbors
const affectedStrands = this.findAffectedStrands(loomPath, changedFiles)
const result = await this.processStrands(affectedStrands)
// Cache result
this.cache.set(cacheKey, result)
return result
}
}Loom-Level (Always):
- Total strands in loom
- Average difficulty
- Topic distribution
- Vocabulary frequency
- Cost: O(n) where n = strands in loom
Weave-Level (On-Demand Only):
- Total strands across all looms
- Cross-loom relationships
- Global vocabulary
- Cost: O(n) where n = all strands in weave
// Efficient: Aggregate only affected looms
function aggregateLoomStats(loom) {
return {
totalStrands: loom.strands.length,
avgDifficulty: mean(loom.strands.map(s => s.difficulty)),
topics: countBy(loom.strands, 'taxonomy.topics'),
keywords: extractTopKeywords(loom.strands, 20)
}
}
// Expensive: Only run on full re-index
function aggregateWeaveStats(weave) {
return weave.looms.map(aggregateLoomStats).reduce(merge)
}Trigger Conditions:
- Manual full re-index (user-initiated)
- New loom created (affects weave structure)
- Loom deleted/moved
- Weekly scheduled job (off-peak hours)
NOT on:
- Individual strand updates
- Metadata-only changes
- PR reviews/comments
Pros:
- Faster iteration (no compile step)
- Simpler CI (no build artifacts)
- Node.js native (no transpilation)
- Easier for contributors (lower barrier)
- Scripts run directly (
node script.js)
Cons:
- No type safety
- Harder to refactor
- IDE autocomplete less reliable
Recommendation: Keep JavaScript for scripts, use TypeScript for UI
Rationale:
-
Scripts (
auto-index.js,validate.js,ai-enhance.js):- Run in CI/Node.js directly
- Simple, focused logic
- Rarely refactored
- Keep as JS
-
UI Components (
codex-submit.tsx,codex-stats.tsx):- Already TypeScript
- Complex state management
- Frequent updates
- Already TS ✓
If Converting Scripts to TS:
# Would need:
npm install -D typescript @types/node ts-node
npx tsc scripts/*.ts --outDir dist/
node dist/auto-index.jsCost/Benefit:
- Conversion effort: 2-4 hours
- Ongoing maintenance: +10% time
- Bug reduction: ~15-20%
- Verdict: Not worth it for simple scripts
-
User Action:
- Visits frame.dev/codex
- Clicks "Contribute" → "Submit Content"
- Pastes markdown or uploads file
-
Client-Side Processing:
// Extract keywords (TF-IDF) const keywords = extractKeywords(content) // Generate summary const summary = generateSummary(content) // Detect difficulty const difficulty = detectDifficulty(content) // Find best loom const { loom, confidence } = await findBestLoom(keywords)
-
User Reviews Metadata:
- Auto-filled: title, summary, tags, difficulty
- Suggested loom:
weaves/technology/programming/ - User can edit or accept
-
PR Creation:
// Create branch await github.git.createRef({ ref: `refs/heads/submit/${Date.now()}`, sha: mainSha })
// Add file
await github.repos.createOrUpdateFileContents({
path: weaves/technology/programming/${slug}.md,
content: base64(frontmatter + content),
branch: branchName
})
// Create PR
await github.pulls.create({
title: Add: ${title},
head: branchName,
base: 'main'
})
5. **Automated Validation (GitHub Actions):**
```yaml
- Run schema validation
- Run static NLP analysis
- Check for duplicates
- Verify loom placement
- [Optional] Run AI enhancement
-
Auto-Merge (If Weaver):
if (isWeaver(author) && validationPassed) { await github.pulls.merge({ pull_number }) }
-
Index Rebuild:
npm run index -- --validate # Only processes affected loom + neighbors # Uses cached results for unchanged looms # Completes in ~3-5 seconds
-
Live on Site:
- New index pushed to
indexbranch - frame.dev/codex fetches updated index
- Content searchable immediately
- New index pushed to
Total Time: 10-30 seconds (most of it is GitHub API calls)
cd apps/codex
npm run validate
npm run index -- --validate
# View report
cat codex-report.json | jq '.summary'# Test single file
node scripts/auto-index.js --files "weaves/tech/python/intro.md"
# View extracted keywords
node -e "
const indexer = require('./scripts/auto-index.js')
const content = require('fs').readFileSync('path/to/file.md', 'utf8')
console.log(indexer.extractKeywords(content))
"# View workflow runs
gh run list --workflow=build-index.yml --limit 10
# View specific run
gh run view <run-id> --logQ: Do I need API keys to use the indexer?
A: No. Static NLP works without any API keys. AI enhancement (OpenAI) is optional.
Q: How much does AI enhancement cost?
A: Varies by content length:
- 100-500 words: ~$0.01-0.03/PR
- 500-2K words: ~$0.03-0.08/PR
- 2K-10K words: ~$0.08-0.20/PR
- 10K-100K words: ~$0.20-2.00/PR
Can be disabled with AI_PROVIDER=disabled.
Q: Can I run everything locally?
A: Yes. npm run index works offline. AI enhancement needs API keys.
Q: How do I add a new subject/topic to vocabulary?
A: Edit scripts/auto-index.js → VOCABULARY object → commit.
Q: What if the auto-categorization is wrong?
A: The AI/human reviewer can suggest a different loom in PR comments.
Q: How do I become a Weaver?
A: Submit 5+ high-quality PRs. Maintainers will add you to WEAVERS.txt.
- ✅ Set up GitHub secrets (GH_PAT required, OPENAI_API_KEY optional)
- ✅ Run initial catalog:
npm run index -- --validate - ✅ Test submission UI at frame.dev/codex
- ✅ Monitor first few PRs for accuracy
- ✅ Add trusted contributors to WEAVERS.txt
- ✅ Enable caching (Redis) for production scale