|
1 | 1 | ## Website Checker
|
| 2 | +Website checker is a simple actor that allows you to scan any website for performance and blocking. |
| 3 | + |
| 4 | +### Features |
| 5 | +The actor provides these useful features out of the box: |
| 6 | +- Collects response status codes |
| 7 | +- Recognizes the most common captchas |
| 8 | +- Saves HTML snapshots and screenshots (if Puppeteer is chosen) |
| 9 | +- Enables choosing between Cheerio (plain HTTP) and Puppeteer (browser) scraper |
| 10 | +- Enables re-scraping start URLs or enqueueing with a familiar link selector + pseudo URLs system |
| 11 | +- Handles different failure states like timeouts and network errors |
| 12 | +- Enables basic proxy and browser configuration |
| 13 | + |
| 14 | +#### Planned features |
| 15 | +- Usage calculation/stats |
| 16 | +- Better automatic workloads/workload actors |
| 17 | +- Add support for Playwright + Firefox |
| 18 | + |
| 19 | +### How to use |
| 20 | +The most common use-case is to do a quick check on how aggressively the target site is blocking. In that case just supply a start URL, ideally a category one or product one. You can either set `replicateStartUrls` or add enqueueing with `linkSelector` + `pseudoUrls`, both are good options to test different proxies. You can test a few different proxy groups and compare `cheerio` vs `puppeteer` options. |
| 21 | + |
| 22 | +In the end you will get a simple statistics about the blocking rate. It is recommended to check a few screenshots just to make sure the actor correctly recognized the page status. You can get to the detailed output (per URL) via KV store or dataset (the KV output sorts by response status while dataset is simply ordered by scraping order). |
| 23 | + |
| 24 | +#### Checker workloads |
| 25 | +To make your life easier, you can use other actors that will start more checker runs at once and aggregate the result. This way you can test more sites at once or different cheerio/browser and proxy combinations and compare those. |
| 26 | + |
| 27 | +All of these actors are very young so we are glad for any feature ideas: |
| 28 | +[lukaskrivka/website-checker-workload](https://apify.com/lukaskrivka/website-checker-workload) |
| 29 | +[vaclavrut/website-checker-starter](https://apify.com/vaclavrut/website-checker-starter) |
| 30 | + |
| 31 | +### Input |
| 32 | +Please follow the [actor's input page](https://apify.com/lukaskrivka/website-checker/input-schema) for a detailed explanation. Most input fields have reasonable defaults. |
2 | 33 |
|
3 | 34 | ### Example output
|
4 | 35 |
|
|
10 | 41 | "accessDenied": 0,
|
11 | 42 | "recaptcha": 0,
|
12 | 43 | "distilCaptcha": 24,
|
| 44 | + "hCaptcha": 0, |
13 | 45 | "statusCodes": {
|
14 | 46 | "200": 3,
|
15 | 47 | "401": 2,
|
16 | 48 | "403": 5,
|
17 | 49 | "405": 24
|
18 | 50 | },
|
| 51 | + "success": 3, |
19 | 52 | "total": 43
|
20 | 53 | }
|
21 | 54 | ```
|
22 | 55 |
|
23 | 56 | #### Detailed output with URLs, screenshots and HTML links
|
24 | 57 | https://api.apify.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true
|
| 58 | + |
| 59 | +### Changelog |
| 60 | +Check history of changes in the [CHANGELOG](https://github.com/metalwarrior665/actor-website-checker/blob/master/CHANGELOG.md) |
0 commit comments