Releases · laurentftech/KidSearch-Backend

Version 1.0.0 - Initial Release

This marks the first official release of the KidSearch Crawler, a high-performance, asynchronous web crawler designed to populate a Meilisearch instance with content from various web sources. This initial version provides a robust and flexible framework for data collection, featuring a rich set of capabilities to handle modern web environments efficiently and respectfully.

Key Features

Asynchronous Crawling: Built with asyncio and aiohttp for high-speed, concurrent crawling of multiple sites.
Flexible Data Sources: Supports both standard HTML websites and structured JSON APIs as content sources.
Incremental Indexing: Utilizes a local cache to intelligently re-index only pages that have changed, significantly speeding up subsequent crawls.
Crawl Resumption: Automatically saves its state and resumes crawling large sites that were not fully indexed in a previous session due to page limits.
Intelligent Content Extraction: Leverages trafilatura for robust main content detection, with fallbacks to custom heuristics and manual CSS selectors for complex layouts.
Multi-lingual Support: Automatically detects the language of HTML pages and allows manual setting for JSON sources, enabling language-specific filtering.
Good Web Citizenship: Fully respects robots.txt directives, including Crawl-delay, and comes with a built-in list of common URL patterns to exclude (e.g., login pages, shopping carts).
Rich Configuration: All crawl targets, rules, and parameters are managed through a single, easy-to-understand sites.yml file.

This release establishes a solid foundation for the KidSearch project's data indexing pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Version 1.0.0 - Initial Release

Key Features

Uh oh!

Releases: laurentftech/KidSearch-Backend

Version 1.0.0 - Initial Release

Version 1.0.0 - Initial Release

Key Features

Uh oh!