Skip to content

Commit a3dd8b0

Browse files
more readme and Dockerfile update
1 parent bb973c3 commit a3dd8b0

File tree

4 files changed

+39
-12
lines changed

4 files changed

+39
-12
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@
66
- Removed `useGoogleBotHeaders` option (we don't want to impersonate Google anyway)
77
- Updated `apify` from `0.18.1` to `1.3.1`
88
- `saveSnapshots` is `true` by default
9+
- Added recognition of Amazon's `hCaptcha`
910
- `success` and `wasSuccess` metrics added to output. Success is measured by status being less than 400 and no captcha

Dockerfile

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,5 @@
1-
# Dockerfile contains instructions how to build a Docker image that will contain
2-
# all the code and configuration needed to run your actor. For a full
3-
# Dockerfile reference, see https://docs.docker.com/engine/reference/builder/
41

5-
# First, specify the base Docker image. Apify provides the following base images
6-
# for your convenience:
7-
# apify/actor-node-basic (Node.js 10 on Alpine Linux, small and fast image)
8-
# apify/actor-node-chrome (Node.js 10 + Chrome on Debian)
9-
# apify/actor-node-chrome-xvfb (Node.js 10 + Chrome + Xvfb on Debian)
10-
# For more information, see https://apify.com/docs/actor#base-images
11-
# Note that you can use any other image from Docker Hub.
12-
FROM apify/actor-node-chrome-xvfb
2+
FROM apify/actor-node-puppeteer-chrome
133

144
# Second, copy just package.json and package-lock.json since they are the only files
155
# that affect NPM install in the next step

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,35 @@
11
## Website Checker
2+
Website checker is a simple actor that allows you to scan any website for performance and blocking.
3+
4+
### Features
5+
The actor provides these useful features out of the box:
6+
- Collects response status codes
7+
- Recognizes the most common captchas
8+
- Saves HTML snapshots and screenshots (if Puppeteer is chosen)
9+
- Enables choosing between Cheerio (plain HTTP) and Puppeteer (browser) scraper
10+
- Enables re-scraping start URLs or enqueueing with a familiar link selector + pseudo URLs system
11+
- Handles different failure states like timeouts and network errors
12+
- Enables basic proxy and browser configuration
13+
14+
#### Planned features
15+
- Usage calculation/stats
16+
- Better automatic workloads/workload actors
17+
- Add support for Playwright + Firefox
18+
19+
### How to use
20+
The most common use-case is to do a quick check on how aggressively the target site is blocking. In that case just supply a start URL, ideally a category one or product one. You can either set `replicateStartUrls` or add enqueueing with `linkSelector` + `pseudoUrls`, both are good options to test different proxies. You can test a few different proxy groups and compare `cheerio` vs `puppeteer` options.
21+
22+
In the end you will get a simple statistics about the blocking rate. It is recommended to check a few screenshots just to make sure the actor correctly recognized the page status. You can get to the detailed output (per URL) via KV store or dataset (the KV output sorts by response status while dataset is simply ordered by scraping order).
23+
24+
#### Checker workloads
25+
To make your life easier, you can use other actors that will start more checker runs at once and aggregate the result. This way you can test more sites at once or different cheerio/browser and proxy combinations and compare those.
26+
27+
All of these actors are very young so we are glad for any feature ideas:
28+
[lukaskrivka/website-checker-workload](https://apify.com/lukaskrivka/website-checker-workload)
29+
[vaclavrut/website-checker-starter](https://apify.com/vaclavrut/website-checker-starter)
30+
31+
### Input
32+
Please follow the [actor's input page](https://apify.com/lukaskrivka/website-checker/input-schema) for a detailed explanation. Most input fields have reasonable defaults.
233

334
### Example output
435

@@ -10,15 +41,20 @@
1041
"accessDenied": 0,
1142
"recaptcha": 0,
1243
"distilCaptcha": 24,
44+
"hCaptcha": 0,
1345
"statusCodes": {
1446
"200": 3,
1547
"401": 2,
1648
"403": 5,
1749
"405": 24
1850
},
51+
"success": 3,
1952
"total": 43
2053
}
2154
```
2255

2356
#### Detailed output with URLs, screenshots and HTML links
2457
https://api.apify.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true
58+
59+
### Changelog
60+
Check history of changes in the [CHANGELOG](https://github.com/metalwarrior665/actor-website-checker/blob/master/CHANGELOG.md)

src/main.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ Apify.main(async () => {
139139

140140
const wasSuccess = statusCode < 400 && captchas.length === 0;
141141
if (wasSuccess) {
142-
state.wasSuccess.push({ url: request.url, screenshotUrl, htmlUrl });
142+
state.success.push({ url: request.url, screenshotUrl, htmlUrl });
143143
}
144144

145145
await Apify.pushData({

0 commit comments

Comments
 (0)