Skip to content

Commit d38128c

Browse files
metalwarrior665AndreyBykovTC-MO
authored
feat(academy): mention pw.request and curl-impersonate in anti-scraping (#993)
I'm not very happy that this quick-start is getting hard to read so it would be nice to refactor it eventually --------- Co-authored-by: Andrey Bykov <[email protected]> Co-authored-by: Michał Olender <[email protected]>
1 parent 2742969 commit d38128c

File tree

1 file changed

+1
-0
lines changed
  • sources/academy/webscraping/anti_scraping

1 file changed

+1
-0
lines changed

sources/academy/webscraping/anti_scraping/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ If you don't have time to read about the theory behind anti-scraping protections
2727
- Use a browser to pass bot capture challenges. We recommend [Playwright with Firefox](https://crawlee.dev/docs/examples/playwright-crawler-firefox) because it is not that common for scraping. You can also play with [non-headless mode](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#headless) and adjust other [fingerprint settings](https://crawlee.dev/api/browser-pool/interface/FingerprintGeneratorOptions).
2828
- Consider extracting data from **[private APIs](../api_scraping/index.md)** or **mobile app APIs**. They are usually much less protected.
2929
- Increase the number of request retries significantly to at least 10 with [`maxRequestRetries: 10`](https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#maxRequestRetries). Rotate sessions after every error with [`maxErrorScore: 1`](https://crawlee.dev/api/core/interface/SessionOptions#maxErrorScore)
30+
- If you cannot afford to use browsers for performance reasons, you can try [Playwright.request](https://playwright.dev/docs/api/class-playwright#playwright-request) or [curl-impersonate](https://www.npmjs.com/package/node-libcurl) as the HTTP library for [Cheerio](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) or [Basic](https://crawlee.dev/api/basic-crawler/class/BasicCrawler) Crawlers, instead of its default [got-scraping](https://crawlee.dev/docs/guides/got-scraping) HTTP back end. These libraries have access to native code which offers much finer control over the HTTP traffic and mimics real browsers more than what can be achieved with plain Node.js implementation like `got-scraping`. These libraries should become part of Crawlee itself in the future.
3031

3132
In the vast majority of cases, this configuration should lead to success. Success doesn't mean that all requests will go through unblocked, that is not realistic. Some IP addresses and fingerprint combinations will still be blocked but the automatic retry system takes care of that. If you can get at least 10% of your requests through, you can still scrape the whole website with enough retries. The default [SessionPool](https://crawlee.dev/api/core/class/SessionPool) configuration will preserve the working sessions and eventually the success rate will increase.
3233

0 commit comments

Comments
 (0)