Skip to content

Conversation

@tsbhangu
Copy link
Contributor

Summary

  • Add database schema and migration for websites table
  • Add API endpoints for website indexing
  • Integrate with IndexSourceDb for source tracking
  • Add Turbopuffer sync for vector search

Details

This PR builds on the website crawler infrastructure (#4656) to add the API and database layer:

Database:

  • New websites table migration
  • WebsiteDb model with full metadata support
  • Integration with IndexSourceDb for job tracking

API Endpoints:

  • POST /sources/website/{domain}/index - Start website crawling
  • GET /sources/website/{domain}/status - Check crawl job status
  • GET /sources/website/{domain}/{website_id} - Get specific page
  • GET /sources/website/{domain} - List all indexed pages
  • POST /sources/website/{domain}/reindex - Re-crawl website
  • DELETE /sources/website/{domain}/delete - Delete specific website
  • DELETE /sources/website/{domain}/delete-all - Delete all websites

Features:

  • Background job processing for crawling
  • Real-time status tracking
  • Automatic Turbopuffer sync for search
  • Proper error handling and rollback

Dependencies

Test plan

  • Database migrations verified
  • Background jobs tested with real crawling

@tsbhangu tsbhangu requested a review from eyw520 as a code owner October 31, 2025 23:16
@vercel
Copy link
Contributor

vercel bot commented Oct 31, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
dev.ferndocs.com Ready Ready Preview Nov 5, 2025 5:48pm
fern-dashboard Ready Ready Preview Nov 5, 2025 5:48pm
fern-dashboard-dev Ready Ready Preview Nov 5, 2025 5:48pm
ferndocs.com Ready Ready Preview Nov 5, 2025 5:48pm
preview.ferndocs.com Ready Ready Preview Nov 5, 2025 5:48pm
prod-assets.ferndocs.com Ready Ready Preview Nov 5, 2025 5:48pm
prod.ferndocs.com Ready Ready Preview Nov 5, 2025 5:48pm
1 Skipped Deployment
Project Deployment Preview Updated (UTC)
fern-platform Ignored Ignored Nov 5, 2025 5:48pm

@tsbhangu tsbhangu force-pushed the tanvir/website-database-api-routes branch from 01a17ba to 9925581 Compare October 31, 2025 23:20
tsbhangu and others added 6 commits November 4, 2025 21:39
- Extract crawl_website_job to utils/website/jobs.py for better separation
- Add WebsiteCrawlConfig domain model with default values
- Implement selective sync functions for websites (sync_websites_to_tpuf, sync_websites_to_query_index)
- Track website IDs during crawl for incremental syncing
- Update delete operations to use selective deletion
- Add comprehensive test suite (12 route tests + 10 sync tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ation

- Fixed af951c45da91 to reference 1a06a4d351f9 instead of missing 2d743e49aaa1
- Created merge migration to combine two branches from initial schema
- Regenerated websites table migration with proper revision chain
- Migration chain: 1a06a4d351f9 -> [af951c45da91, 62afaf912daa] -> 7440621afbb0 -> 8e63cf285ea3

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants