Skip to content

🚀 Thread Count Optimization

Yohay Ohayon edited this page Apr 18, 2025 · 2 revisions

🧩 Thread Optimization in Web Crawling

A fundamental challenge in web crawler design is determining the optimal number of threads. In this context, "optimal" refers to the highest crawl speed (pages per second) achieved without compromising the number of pages found or broken links detected. The goal is to maximize efficiency while avoiding missed data, server-side blocking, or throttling.


🧪 Experiment Methodology

Each experiment involved crawling a fixed URL at a fixed depth, varying only the number of threads. For each run, we recorded crawl time, visited pages, links found, and broken links. This allowed analysis of crawl rate, consistency, and concurrency impact on crawler behavior and target site response.

🧷 Important: All experiments were performed under a specific hardware, network, and software configuration. The results are subject to external influences such as network conditions, dynamic web content, or server-side rate limits. Additionally, websites may change behavior across repeated crawls. Therefore, these findings should be considered indicative, not absolute—other setups may yield different outcomes.


📊 Results Summary

Increasing thread count generally improved crawl speed—up to a point. Each site had its own concurrency limit, typically between 14 and 25 threads. Beyond this, performance either plateaued or degraded. In most cases, the number of found and broken links remained consistent, indicating crawl reliability. However, some sites responded to high concurrency with throttling or blocking, reducing performance.

📎 Full experimental data: BLC Experiments Excel


📌 Illustrative Examples

  • IKEA Israel: A crawler-friendly site. Performance improved steadily, peaking at 25.57 pages/sec with around 20 threads. Beyond that, throughput slightly declined. image

  • Wikipedia: Also crawler-friendly. Performance peaked at 11 pages/sec with around 14 threads, with no major gains from additional threads. image


✅ Conclusion & Recommendation

The optimal thread count is highly dependent on the specific target website and system setup. While concurrency usually boosts performance, its benefits taper off beyond a site-specific limit due to system overhead or anti-bot mechanisms.

🔧 Rule of Thumb: On the tested setup, 20 threads was the practical sweet spot. Beyond that, crawler-friendly sites showed no improvement, while stricter sites began throttling or blocking requests.

For different environments, re-calibration is recommended to find the local optimum.

📚 Project Navigation

Clone this wiki locally