-
Notifications
You must be signed in to change notification settings - Fork 0
🚀 Thread Count Optimization
A fundamental challenge in web crawler design is determining the optimal number of threads. In this context, "optimal" refers to the highest crawl speed (pages per second) achieved without compromising the number of pages found or broken links detected. The goal is to maximize efficiency while avoiding missed data, server-side blocking, or throttling.
Each experiment involved crawling a fixed URL at a fixed depth, varying only the number of threads. For each run, we recorded crawl time, visited pages, links found, and broken links. This allowed analysis of crawl rate, consistency, and concurrency impact on crawler behavior and target site response.
🧷 Important: All experiments were performed under a specific hardware, network, and software configuration. The results are subject to external influences such as network conditions, dynamic web content, or server-side rate limits. Additionally, websites may change behavior across repeated crawls. Therefore, these findings should be considered indicative, not absolute—other setups may yield different outcomes.
Increasing thread count generally improved crawl speed—up to a point. Each site had its own concurrency limit, typically between 14 and 25 threads. Beyond this, performance either plateaued or degraded. In most cases, the number of found and broken links remained consistent, indicating crawl reliability. However, some sites responded to high concurrency with throttling or blocking, reducing performance.
📎 Full experimental data: BLC Experiments Excel
-
IKEA Israel: A crawler-friendly site. Performance improved steadily, peaking at 25.57 pages/sec with around 20 threads. Beyond that, throughput slightly declined.

-
Wikipedia: Also crawler-friendly. Performance peaked at 11 pages/sec with around 14 threads, with no major gains from additional threads.

The optimal thread count is highly dependent on the specific target website and system setup. While concurrency usually boosts performance, its benefits taper off beyond a site-specific limit due to system overhead or anti-bot mechanisms.
🔧 Rule of Thumb: On the tested setup, 20 threads was the practical sweet spot. Beyond that, crawler-friendly sites showed no improvement, while stricter sites began throttling or blocking requests.
For different environments, re-calibration is recommended to find the local optimum.
-
🚀Usage Instructions
Learn how to configure, run, and customize BLC. -
📐 High-Level Design
Explore the project's origin and architecture. -
🛠️ Implementation Notes
Tools, technologies, and key implementation decisions. -
🚫 Sites That Restrict Automated Crawling See what was done and achieved.
-
🚀 Thread Count Optimization
Discussion on thread number optimization. -
📄Sample Outputs See it in action!