| id | recatalog-guide | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| slug | recatalog-guide | ||||||||
| title | Frame Codex Re-Catalog Guide | ||||||||
| summary | Complete guide to triggering full re-indexing and metadata updates for all Frame Codex content | ||||||||
| version | 1.0.0 | ||||||||
| contentType | markdown | ||||||||
| difficulty | intermediate | ||||||||
| taxonomy |
|
||||||||
| tags |
|
This guide explains how to trigger a complete re-indexing and metadata update for all Frame Codex content.
Run a full re-catalog when:
✅ Initial setup - First time setting up the repository
✅ Schema changes - After updating validation rules or metadata schema
✅ Vocabulary updates - After adding new subjects/topics to controlled vocabulary
✅ Bulk imports - After importing large amounts of content
✅ Quality audit - Periodic review of all content (monthly/quarterly)
❌ NOT needed for:
- Individual PR merges (automatic via GitHub Actions)
- Small metadata fixes (handled by normal PR flow)
- Content updates without schema changes
- Go to https://github.com/framersai/codex/actions/workflows/build-index.yml
- Click "Run workflow" dropdown (top right)
- Select branch:
main - Click "Run workflow" button
- Wait ~1-2 minutes for completion
gh workflow run build-index.yml --repo framersai/codexWhat It Does:
- Runs
npm run index -- --validateon ALL files - Generates
codex-index.jsonandcodex-report.json - Pushes to
indexbranch (no PR needed) - Updates live immediately on frame.dev/codex
Cost: $0 (static NLP only, no AI calls)
cd apps/codex
chmod +x scripts/retrigger-full-catalog.sh
./scripts/retrigger-full-catalog.shWhat It Does:
- Runs full static NLP analysis on ALL files
- Updates
codex-index.jsonandcodex-report.json - Creates a new branch:
catalog/full-reindex-{timestamp} - Commits changes
- Pushes to GitHub
- Creates a PR (requires manual approval by default)
- Optionally auto-merges if
AUTO_CATALOG_MERGE=true
Options:
# Dry run (see what would change, no PR)
./scripts/retrigger-full-catalog.sh --dry-run
# Force auto-merge (overrides AUTO_CATALOG_MERGE setting)
./scripts/retrigger-full-catalog.sh --auto-mergeRequirements:
GH_PATenvironment variable (for PR creation)- Git configured with user name/email
Cost: $0 (static NLP only)
For testing or local development:
cd apps/codex
npm install
npm run index -- --validateOutput Files:
codex-index.json- Full searchable indexcodex-report.json- Analytics and validation report
To Deploy:
git add codex-index.json codex-report.json
git commit -m "chore: manual re-index"
git pushBy default, full re-catalog PRs require manual review and approval. This is the recommended setting to catch any unexpected metadata changes.
Set this GitHub secret to enable automatic merging:
AUTO_CATALOG_MERGE=trueWhen enabled:
- Re-catalog PRs will auto-merge after validation passes
- No human review required
- Faster iteration, but less oversight
When to enable:
- High trust in automation
- Frequent re-catalogs needed
- Well-tested vocabulary and schema
When to keep disabled (recommended):
- Initial setup phase
- Testing new categorization rules
- Want to review metadata changes
- Prefer human oversight
# With auto-merge
AUTO_CATALOG_MERGE=true ./scripts/retrigger-full-catalog.sh
# Without auto-merge (default)
./scripts/retrigger-full-catalog.sh
# Force auto-merge (one-time override)
./scripts/retrigger-full-catalog.sh --auto-merge-
Keywords Extraction (TF-IDF)
- Identifies most important terms
- Filters stop words
- Ranks by relevance
-
Phrase Detection (N-grams)
- Finds common 2-3 word phrases
- Identifies repeated patterns
- Suggests tags
-
Category Matching
- Matches against controlled vocabulary
- Assigns subjects and topics
- Calculates confidence scores
-
Difficulty Detection
- Heuristic analysis of complexity indicators
- Keyword-based classification
- Assigns beginner/intermediate/advanced/expert
-
Summary Generation
- Extractive summarization
- Picks most representative sentence
- Truncates to 300 characters
-
Validation
- Schema compliance
- Required fields check
- Content quality rules
- Duplicate detection
❌ Manual metadata is preserved:
- Explicitly set titles, summaries, tags
- User-defined relationships
- Custom categorization
- Version numbers
- Author information
✅ Only auto-generated fields are updated:
metadata.autoGenerated.*- Missing fields (if not explicitly set)
- Validation warnings/suggestions
-
Metadata Changes
- Are auto-tags accurate?
- Is difficulty level appropriate?
- Are subjects/topics correct?
-
Categorization Confidence
- Check
confidencescores in report - Low confidence (<0.5) may need manual review
- Check
-
Validation Issues
- Review
codex-report.json→validation.fileErrors - Fix any schema violations
- Address quality warnings
- Review
-
Vocabulary Suggestions
- Review
vocabulary.suggestedAdditions - Consider adding frequent terms to controlled vocabulary
- Review
- Spot-check 5-10 random files for accuracy
- Review files with low confidence scores
- Check for any unexpected categorization changes
- Verify no content was lost or corrupted
- Review vocabulary suggestions for next iteration
Cost: $0
- TF-IDF, n-grams, vocabulary matching
- Runs locally in GitHub Actions
- No external API calls
Only runs if OPENAI_API_KEY is set:
| Content Length | Words | Tokens (est.) | Cost/PR |
|---|---|---|---|
| Short article | 100-500 | 150-750 | $0.01-0.03 |
| Medium article | 500-2K | 750-3K | $0.03-0.08 |
| Long article | 2K-10K | 3K-15K | $0.08-0.20 |
| Documentation | 10K-50K | 15K-75K | $0.20-1.00 |
| Large corpus | 50K-100K | 75K-150K | $1.00-2.00 |
Calculation:
- GPT-4 Turbo: $0.01/1K input tokens, $0.03/1K output tokens
- Average PR: ~2K input, ~500 output = ~$0.035
- Varies significantly based on content length
Monthly Budget Estimate:
- 10 PRs/month × $0.05 avg = ~$0.50/month
- 50 PRs/month × $0.05 avg = ~$2.50/month
- 200 PRs/month × $0.05 avg = ~$10/month
Full Re-Catalog (100 files):
- 100 files × $0.05 avg = ~$5.00 per full run
- Recommended: Run monthly or quarterly
The index is already up to date. No action needed.
Fix errors in the files listed in codex-report.json → validation.fileErrors, then re-run.
Check that GH_PAT is set and has repo scope. Verify token hasn't expired.
Check branch protection rules. Ensure GH_PAT has sufficient permissions.
Add to .github/workflows/build-index.yml:
on:
schedule:
- cron: '0 2 * * 0' # Every Sunday at 2 AM UTCon:
schedule:
- cron: '0 2 1 * *' # First day of month at 2 AM UTC- ✅ Run initial catalog:
./scripts/retrigger-full-catalog.sh --dry-run - ✅ Review output in
codex-report.json - ✅ If satisfied, run without
--dry-runto create PR - ✅ Review and merge PR
- ✅ Set up scheduled re-catalogs (optional)
- ✅ Configure
AUTO_CATALOG_MERGEbased on your workflow