Skip to content

Conversation

@tsbhangu
Copy link
Contributor

Summary

  • Add website crawling utilities (crawler, content extractor, chunker)
  • Add comprehensive test coverage for all crawling functionality
  • Add Python dependencies for web scraping (beautifulsoup4, lxml, html2text)

Details

This PR adds the foundational infrastructure for crawling and indexing websites:

New utilities:

  • DocumentationCrawler: Crawls websites starting from a base URL with configurable filters
  • ContentExtractor: Extracts and cleans content from HTML pages
  • MarkdownChunker: Splits markdown content into semantic chunks

Key features:

  • Configurable domain/path filtering
  • Extracts metadata (title, keywords, description)
  • Semantic chunking that preserves context
  • Comprehensive error handling

Tests:

  • Added unit tests for each utility class.

This is pure utility code with no API surface or database changes.

Test plan

  • All tests pass locally

@tsbhangu tsbhangu requested a review from eyw520 as a code owner October 31, 2025 23:12
@vercel
Copy link
Contributor

vercel bot commented Oct 31, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
dev.ferndocs.com Ready Ready Preview Nov 1, 2025 0:50am
fern-dashboard Ready Ready Preview Nov 1, 2025 0:50am
fern-dashboard-dev Ready Ready Preview Nov 1, 2025 0:50am
ferndocs.com Ready Ready Preview Nov 1, 2025 0:50am
preview.ferndocs.com Ready Ready Preview Nov 1, 2025 0:50am
prod-assets.ferndocs.com Ready Ready Preview Nov 1, 2025 0:50am
prod.ferndocs.com Ready Ready Preview Nov 1, 2025 0:50am
1 Skipped Deployment
Project Deployment Preview Updated (UTC)
fern-platform Ignored Ignored Nov 1, 2025 0:50am

- Add website crawling utilities (crawler, content extractor, chunker)
- Add comprehensive test coverage for all crawling functionality
- Add Python dependencies for web scraping (beautifulsoup4, lxml, html2text)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Remove unnecessary website_indexer.py file
- Update test imports to use direct module imports
- Update FastAPI version from 0.116.2 to 0.120.1 to match app branch
- Regenerate poetry.lock file
When large sections are split and some chunks filtered due to being
below min_chunk_size, part_number values were non-sequential and
total_parts was incorrect.

Fix: Filter chunks first, then assign sequential part numbers based
on filtered count.

Add test: test_part_numbers_sequential_after_filtering
@tsbhangu tsbhangu merged commit 68f5f74 into app Nov 3, 2025
20 checks passed
@tsbhangu tsbhangu deleted the tanvir/website-crawler-infrastructure branch November 3, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants