Releases: laurentftech/KidSearch-Backend
Releases · laurentftech/KidSearch-Backend
Version 1.0.0 - Initial Release
Version 1.0.0 - Initial Release
This marks the first official release of the KidSearch Crawler, a high-performance, asynchronous web crawler designed to populate a Meilisearch instance with content from various web sources. This initial version provides a robust and flexible framework for data collection, featuring a rich set of capabilities to handle modern web environments efficiently and respectfully.
Key Features
- Asynchronous Crawling: Built with asyncio and aiohttp for high-speed, concurrent crawling of multiple sites.
- Flexible Data Sources: Supports both standard HTML websites and structured JSON APIs as content sources.
- Incremental Indexing: Utilizes a local cache to intelligently re-index only pages that have changed, significantly speeding up subsequent crawls.
- Crawl Resumption: Automatically saves its state and resumes crawling large sites that were not fully indexed in a previous session due to page limits.
- Intelligent Content Extraction: Leverages trafilatura for robust main content detection, with fallbacks to custom heuristics and manual CSS selectors for complex layouts.
- Multi-lingual Support: Automatically detects the language of HTML pages and allows manual setting for JSON sources, enabling language-specific filtering.
- Good Web Citizenship: Fully respects robots.txt directives, including Crawl-delay, and comes with a built-in list of common URL patterns to exclude (e.g., login pages, shopping carts).
- Rich Configuration: All crawl targets, rules, and parameters are managed through a single, easy-to-understand sites.yml file.
This release establishes a solid foundation for the KidSearch project's data indexing pipeline.