recatalog-guide

slug

recatalog-guide

title

Frame Codex Re-Catalog Guide

summary

Complete guide to triggering full re-indexing and metadata updates for all Frame Codex content

version

1.0.0

contentType

markdown

difficulty

intermediate

taxonomy

subjects

topics

technology

knowledge

deployment

best-practices

Frame Codex Re-Catalog Guide

This guide explains how to trigger a complete re-indexing and metadata update for all Frame Codex content.

When to Re-Catalog

Run a full re-catalog when:

✅ Initial setup - First time setting up the repository
✅ Schema changes - After updating validation rules or metadata schema
✅ Vocabulary updates - After adding new subjects/topics to controlled vocabulary
✅ Bulk imports - After importing large amounts of content
✅ Quality audit - Periodic review of all content (monthly/quarterly)

❌ NOT needed for:

Individual PR merges (automatic via GitHub Actions)
Small metadata fixes (handled by normal PR flow)
Content updates without schema changes

Method 1: GitHub Actions (Recommended)

Via GitHub Web UI

Go to https://github.com/framersai/codex/actions/workflows/build-index.yml
Click "Run workflow" dropdown (top right)
Select branch: main
Click "Run workflow" button
Wait ~1-2 minutes for completion

Via GitHub CLI

gh workflow run build-index.yml --repo framersai/codex

What It Does:

Runs npm run index -- --validate on ALL files
Generates codex-index.json and codex-report.json
Pushes to index branch (no PR needed)
Updates live immediately on frame.dev/codex

Cost: $0 (static NLP only, no AI calls)

Method 2: Local Script with PR Creation

Using the Re-Catalog Script

cd apps/codex
chmod +x scripts/retrigger-full-catalog.sh
./scripts/retrigger-full-catalog.sh

What It Does:

Runs full static NLP analysis on ALL files
Updates codex-index.json and codex-report.json
Creates a new branch: catalog/full-reindex-{timestamp}
Commits changes
Pushes to GitHub
Creates a PR (requires manual approval by default)
Optionally auto-merges if AUTO_CATALOG_MERGE=true

Options:

# Dry run (see what would change, no PR)
./scripts/retrigger-full-catalog.sh --dry-run

# Force auto-merge (overrides AUTO_CATALOG_MERGE setting)
./scripts/retrigger-full-catalog.sh --auto-merge

Requirements:

GH_PAT environment variable (for PR creation)
Git configured with user name/email

Cost: $0 (static NLP only)

Method 3: Manual Local Re-Index

For testing or local development:

cd apps/codex
npm install
npm run index -- --validate

Output Files:

codex-index.json - Full searchable index
codex-report.json - Analytics and validation report

To Deploy:

git add codex-index.json codex-report.json
git commit -m "chore: manual re-index"
git push

Auto-Merge Configuration

Default Behavior: Manual Approval Required

By default, full re-catalog PRs require manual review and approval. This is the recommended setting to catch any unexpected metadata changes.

Enable Auto-Merge

Set this GitHub secret to enable automatic merging:

AUTO_CATALOG_MERGE=true

When enabled:

Re-catalog PRs will auto-merge after validation passes
No human review required
Faster iteration, but less oversight

When to enable:

High trust in automation
Frequent re-catalogs needed
Well-tested vocabulary and schema

When to keep disabled (recommended):

Initial setup phase
Testing new categorization rules
Want to review metadata changes
Prefer human oversight

Toggle via Script

# With auto-merge
AUTO_CATALOG_MERGE=true ./scripts/retrigger-full-catalog.sh

# Without auto-merge (default)
./scripts/retrigger-full-catalog.sh

# Force auto-merge (one-time override)
./scripts/retrigger-full-catalog.sh --auto-merge

What Gets Updated

Static NLP Analysis (Always Runs)

Keywords Extraction (TF-IDF)
- Identifies most important terms
- Filters stop words
- Ranks by relevance
Phrase Detection (N-grams)
- Finds common 2-3 word phrases
- Identifies repeated patterns
- Suggests tags
Category Matching
- Matches against controlled vocabulary
- Assigns subjects and topics
- Calculates confidence scores
Difficulty Detection
- Heuristic analysis of complexity indicators
- Keyword-based classification
- Assigns beginner/intermediate/advanced/expert
Summary Generation
- Extractive summarization
- Picks most representative sentence
- Truncates to 300 characters
Validation
- Schema compliance
- Required fields check
- Content quality rules
- Duplicate detection

What Does NOT Get Updated

❌ Manual metadata is preserved:

Explicitly set titles, summaries, tags
User-defined relationships
Custom categorization
Version numbers
Author information

✅ Only auto-generated fields are updated:

metadata.autoGenerated.*
Missing fields (if not explicitly set)
Validation warnings/suggestions

Reviewing Re-Catalog PRs

What to Check

Metadata Changes
- Are auto-tags accurate?
- Is difficulty level appropriate?
- Are subjects/topics correct?
Categorization Confidence
- Check confidence scores in report
- Low confidence (<0.5) may need manual review
Validation Issues
- Review codex-report.json → validation.fileErrors
- Fix any schema violations
- Address quality warnings
Vocabulary Suggestions
- Review vocabulary.suggestedAdditions
- Consider adding frequent terms to controlled vocabulary

Approval Checklist

Spot-check 5-10 random files for accuracy
Review files with low confidence scores
Check for any unexpected categorization changes
Verify no content was lost or corrupted
Review vocabulary suggestions for next iteration

Cost Estimates (AI Enhancement)

Static NLP (Default)

Cost: $0

TF-IDF, n-grams, vocabulary matching
Runs locally in GitHub Actions
No external API calls

AI Enhancement (Optional)

Only runs if OPENAI_API_KEY is set:

Content Length	Words	Tokens (est.)	Cost/PR
Short article	100-500	150-750	$0.01-0.03
Medium article	500-2K	750-3K	$0.03-0.08
Long article	2K-10K	3K-15K	$0.08-0.20
Documentation	10K-50K	15K-75K	$0.20-1.00
Large corpus	50K-100K	75K-150K	$1.00-2.00

Calculation:

GPT-4 Turbo: $0.01/1K input tokens, $0.03/1K output tokens
Average PR: ~2K input, ~500 output = ~$0.035
Varies significantly based on content length

Monthly Budget Estimate:

10 PRs/month × $0.05 avg = ~$0.50/month
50 PRs/month × $0.05 avg = ~$2.50/month
200 PRs/month × $0.05 avg = ~$10/month

Full Re-Catalog (100 files):

100 files × $0.05 avg = ~$5.00 per full run
Recommended: Run monthly or quarterly

Troubleshooting

"No changes detected"

The index is already up to date. No action needed.

"Validation failed"

Fix errors in the files listed in codex-report.json → validation.fileErrors, then re-run.

"PR creation failed"

Check that GH_PAT is set and has repo scope. Verify token hasn't expired.

"Auto-merge failed"

Check branch protection rules. Ensure GH_PAT has sufficient permissions.

Scheduled Re-Catalogs

Weekly (Recommended)

Add to .github/workflows/build-index.yml:

on:
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2 AM UTC

Monthly

on:
  schedule:
    - cron: '0 2 1 * *'  # First day of month at 2 AM UTC

Next Steps

✅ Run initial catalog: ./scripts/retrigger-full-catalog.sh --dry-run
✅ Review output in codex-report.json
✅ If satisfied, run without --dry-run to create PR
✅ Review and merge PR
✅ Set up scheduled re-catalogs (optional)
✅ Configure AUTO_CATALOG_MERGE based on your workflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frame Codex Re-Catalog Guide

When to Re-Catalog

Method 1: GitHub Actions (Recommended)

Via GitHub Web UI

Via GitHub CLI

Method 2: Local Script with PR Creation

Using the Re-Catalog Script

Method 3: Manual Local Re-Index

Auto-Merge Configuration

Default Behavior: Manual Approval Required

Enable Auto-Merge

Toggle via Script

What Gets Updated

Static NLP Analysis (Always Runs)

What Does NOT Get Updated

Reviewing Re-Catalog PRs

What to Check

Approval Checklist

Cost Estimates (AI Enhancement)

Static NLP (Default)

AI Enhancement (Optional)

Troubleshooting

"No changes detected"

"Validation failed"

"PR creation failed"

"Auto-merge failed"

Scheduled Re-Catalogs

Weekly (Recommended)

Monthly

Next Steps

FilesExpand file tree

RECATALOG_GUIDE.md

Latest commit

History

RECATALOG_GUIDE.md

File metadata and controls

Frame Codex Re-Catalog Guide

When to Re-Catalog

Method 1: GitHub Actions (Recommended)

Via GitHub Web UI

Via GitHub CLI

Method 2: Local Script with PR Creation

Using the Re-Catalog Script

Method 3: Manual Local Re-Index

Auto-Merge Configuration

Default Behavior: Manual Approval Required

Enable Auto-Merge

Toggle via Script

What Gets Updated

Static NLP Analysis (Always Runs)

What Does NOT Get Updated

Reviewing Re-Catalog PRs

What to Check

Approval Checklist

Cost Estimates (AI Enhancement)

Static NLP (Default)

AI Enhancement (Optional)

Troubleshooting

"No changes detected"

"Validation failed"

"PR creation failed"

"Auto-merge failed"

Scheduled Re-Catalogs

Weekly (Recommended)

Monthly

Next Steps