Add website crawler infrastructure #4656

tsbhangu · 2025-10-31T23:12:46Z

Summary

Add website crawling utilities (crawler, content extractor, chunker)
Add comprehensive test coverage for all crawling functionality
Add Python dependencies for web scraping (beautifulsoup4, lxml, html2text)

Details

This PR adds the foundational infrastructure for crawling and indexing websites:

New utilities:

DocumentationCrawler: Crawls websites starting from a base URL with configurable filters
ContentExtractor: Extracts and cleans content from HTML pages
MarkdownChunker: Splits markdown content into semantic chunks

Key features:

Configurable domain/path filtering
Extracts metadata (title, keywords, description)
Semantic chunking that preserves context
Comprehensive error handling

Tests:

Added unit tests for each utility class.

This is pure utility code with no API surface or database changes.

Test plan

All tests pass locally

vercel · 2025-10-31T23:12:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Updated (UTC)
dev.ferndocs.com	Ready	Preview	Nov 1, 2025 0:50am
fern-dashboard	Ready	Preview	Nov 1, 2025 0:50am
fern-dashboard-dev	Ready	Preview	Nov 1, 2025 0:50am
ferndocs.com	Ready	Preview	Nov 1, 2025 0:50am
preview.ferndocs.com	Ready	Preview	Nov 1, 2025 0:50am
prod-assets.ferndocs.com	Ready	Preview	Nov 1, 2025 0:50am
prod.ferndocs.com	Ready	Preview	Nov 1, 2025 0:50am

1 Skipped Deployment

Project	Deployment	Preview	Updated (UTC)
fern-platform	Ignored		Nov 1, 2025 0:50am

- Add website crawling utilities (crawler, content extractor, chunker) - Add comprehensive test coverage for all crawling functionality - Add Python dependencies for web scraping (beautifulsoup4, lxml, html2text) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Remove unnecessary website_indexer.py file - Update test imports to use direct module imports - Update FastAPI version from 0.116.2 to 0.120.1 to match app branch - Regenerate poetry.lock file

servers/fai/src/fai/utils/website/chunker.py

When large sections are split and some chunks filtered due to being below min_chunk_size, part_number values were non-sequential and total_parts was incorrect. Fix: Filter chunks first, then assign sequential part numbers based on filtered count. Add test: test_part_numbers_sequential_after_filtering

tsbhangu requested a review from eyw520 as a code owner October 31, 2025 23:12

vercel bot deployed to Preview – ferndocs.com October 31, 2025 23:13 View deployment

tsbhangu mentioned this pull request Oct 31, 2025

Tanvir/website database api routes #4657

Merged

vercel bot deployed to Preview – dev.ferndocs.com October 31, 2025 23:18 View deployment

vercel bot deployed to Preview – prod-assets.ferndocs.com October 31, 2025 23:18 View deployment

vercel bot deployed to Preview – preview.ferndocs.com October 31, 2025 23:18 View deployment

vercel bot deployed to Preview – prod.ferndocs.com October 31, 2025 23:18 View deployment

tsbhangu force-pushed the tanvir/website-crawler-infrastructure branch from a59b8fa to 6d208ac Compare October 31, 2025 23:20

tsbhangu had a problem deploying to Fern Dev October 31, 2025 23:20 — with GitHub Actions Failure

vercel bot deployed to Preview – ferndocs.com October 31, 2025 23:20 View deployment

Fix test signatures for _chunk_section method

9ef9b83

vercel bot deployed to Preview – ferndocs.com October 31, 2025 23:22 View deployment

tsbhangu temporarily deployed to Fern Dev October 31, 2025 23:22 — with GitHub Actions Inactive

Apply code formatting and fix line length issues

24e9f5d

vercel bot deployed to Preview – ferndocs.com October 31, 2025 23:25 View deployment

tsbhangu temporarily deployed to Fern Dev October 31, 2025 23:25 — with GitHub Actions Inactive

vercel bot deployed to Preview – prod.ferndocs.com October 31, 2025 23:31 View deployment

vercel bot deployed to Preview – dev.ferndocs.com October 31, 2025 23:31 View deployment

vercel bot deployed to Preview – preview.ferndocs.com October 31, 2025 23:31 View deployment

vercel bot deployed to Preview – prod-assets.ferndocs.com October 31, 2025 23:31 View deployment

Remove website_indexer.py and update FastAPI to match app branch

e2c40d8

- Remove unnecessary website_indexer.py file - Update test imports to use direct module imports - Update FastAPI version from 0.116.2 to 0.120.1 to match app branch - Regenerate poetry.lock file

tsbhangu temporarily deployed to Fern Dev October 31, 2025 23:31 — with GitHub Actions Inactive

vercel bot deployed to Preview – ferndocs.com October 31, 2025 23:31 View deployment

vercel bot deployed to Preview – prod.ferndocs.com October 31, 2025 23:37 View deployment

vercel bot deployed to Preview – dev.ferndocs.com October 31, 2025 23:37 View deployment

vercel bot deployed to Preview – prod-assets.ferndocs.com October 31, 2025 23:37 View deployment

vercel bot deployed to Preview – preview.ferndocs.com October 31, 2025 23:37 View deployment

vercel bot deployed to Preview – fern-dashboard October 31, 2025 23:38 View deployment

vercel bot deployed to Preview – fern-dashboard-dev October 31, 2025 23:38 View deployment

vercel bot reviewed Oct 31, 2025

View reviewed changes

servers/fai/src/fai/utils/website/chunker.py Outdated Show resolved Hide resolved

tsbhangu temporarily deployed to Fern Dev November 1, 2025 00:43 — with GitHub Actions Inactive

vercel bot deployed to Preview – ferndocs.com November 1, 2025 00:43 View deployment

vercel bot deployed to Preview – dev.ferndocs.com November 1, 2025 00:48 View deployment

vercel bot deployed to Preview – prod.ferndocs.com November 1, 2025 00:48 View deployment

vercel bot deployed to Preview – preview.ferndocs.com November 1, 2025 00:48 View deployment

vercel bot deployed to Preview – prod-assets.ferndocs.com November 1, 2025 00:48 View deployment

vercel bot deployed to Preview – fern-dashboard November 1, 2025 00:49 View deployment

vercel bot deployed to Preview – fern-dashboard-dev November 1, 2025 00:50 View deployment

tsbhangu enabled auto-merge (squash) November 2, 2025 15:34

eyw520 approved these changes Nov 3, 2025

View reviewed changes

tsbhangu merged commit 68f5f74 into app Nov 3, 2025
20 checks passed

tsbhangu deleted the tanvir/website-crawler-infrastructure branch November 3, 2025 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add website crawler infrastructure #4656

Add website crawler infrastructure #4656

Uh oh!

tsbhangu commented Oct 31, 2025

Uh oh!

vercel bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add website crawler infrastructure #4656

Add website crawler infrastructure #4656

Uh oh!

Conversation

tsbhangu commented Oct 31, 2025

Summary

Details

Test plan

Uh oh!

vercel bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Oct 31, 2025 •

edited

Loading