Skip to content

Commit 549ff76

Browse files
resolve conflict
2 parents 2ec1840 + aa70641 commit 549ff76

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+25273
-653
lines changed

.eslintrc.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"extends": ["@apify/ts"],
3+
"parser": "@typescript-eslint/parser",
4+
"parserOptions": {
5+
"project": "./tsconfig.eslint.json",
6+
"sourceType": "module",
7+
"ecmaVersion": 2020
8+
},
9+
"env": {
10+
"es6": true,
11+
"es2017": true,
12+
"es2020": true
13+
}
14+
}

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
apify_storage
21
node_modules
3-
apify_storage
2+
apify_storage
3+
dist

.npmignore

Lines changed: 0 additions & 4 deletions
This file was deleted.

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,15 @@
11
### 2021-07-29
2+
23
*Features*
4+
35
- Pushing metadata about each page to dataset
46
- Added recognition of Amazon's `hCaptcha`
57
- `success` and `wasSuccess` metrics added to output. Success is measured by status being less than 400 and no captcha
68

79
*Changes*
10+
811
- Removed `useGoogleBotHeaders` option (we don't want to impersonate Google anyway)
912
- Updated `apify` from `0.18.1` to `1.3.1`
1013
- `saveSnapshots` is `true` by default
14+
- Added recognition of Amazon's `hCaptcha`
15+
- `success` and `wasSuccess` metrics added to output. Success is measured by status being less than 400 and no captcha

Dockerfile

Lines changed: 0 additions & 22 deletions
This file was deleted.

INPUT_SCHEMA.json

Lines changed: 0 additions & 103 deletions
This file was deleted.

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
## Website Checker
2+
23
Website checker is a simple actor that allows you to scan any website for performance and blocking.
34

45
### Features
6+
57
The actor provides these useful features out of the box:
8+
69
- Collects response status codes
710
- Recognizes the most common captchas
811
- Saves HTML snapshots and screenshots (if Puppeteer is chosen)
@@ -12,28 +15,33 @@ The actor provides these useful features out of the box:
1215
- Enables basic proxy and browser configuration
1316

1417
#### Planned features
18+
1519
- Usage calculation/stats
1620
- Better automatic workloads/workload actors
1721
- Add support for Playwright + Firefox
1822

1923
### How to use
24+
2025
The most common use-case is to do a quick check on how aggressively the target site is blocking. In that case just supply a start URL, ideally a category one or product one. You can either set `replicateStartUrls` or add enqueueing with `linkSelector` + `pseudoUrls`, both are good options to test different proxies. You can test a few different proxy groups and compare `cheerio` vs `puppeteer` options.
2126

2227
In the end you will get a simple statistics about the blocking rate. It is recommended to check a few screenshots just to make sure the actor correctly recognized the page status. You can get to the detailed output (per URL) via KV store or dataset (the KV output sorts by response status while dataset is simply ordered by scraping order).
2328

2429
#### Checker workloads
30+
2531
To make your life easier, you can use other actors that will start more checker runs at once and aggregate the result. This way you can test more sites at once or different cheerio/browser and proxy combinations and compare those.
2632

2733
All of these actors are very young so we are glad for any feature ideas:
2834
[lukaskrivka/website-checker-workload](https://apify.com/lukaskrivka/website-checker-workload)
2935
[vaclavrut/website-checker-starter](https://apify.com/vaclavrut/website-checker-starter)
3036

3137
### Input
38+
3239
Please follow the [actor's input page](https://apify.com/lukaskrivka/website-checker/input-schema) for a detailed explanation. Most input fields have reasonable defaults.
3340

3441
### Example output
3542

3643
#### Simple output
44+
3745
```
3846
{
3947
"timeouted": 0,
@@ -54,7 +62,8 @@ Please follow the [actor's input page](https://apify.com/lukaskrivka/website-che
5462
```
5563

5664
#### Detailed output with URLs, screenshots and HTML links
57-
https://api.apify.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true
65+
<https://api.apify.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true>
5866

5967
### Changelog
60-
Check history of changes in the [CHANGELOG](https://github.com/metalwarrior665/actor-website-checker/blob/master/CHANGELOG.md)
68+
69+
Check history of changes in the [CHANGELOG](https://github.com/metalwarrior665/actor-website-checker/blob/master/CHANGELOG.md)

apify.json

Lines changed: 0 additions & 7 deletions
This file was deleted.

checker-cheerio/Dockerfile

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
FROM apify/actor-node:16
2+
3+
# Copy the package.json and package-lock.json files, since they are the only files
4+
# that affect NPM install in the next step
5+
COPY package*.json ./
6+
7+
# Install NPM packages, skip optional and development dependencies to keep the
8+
# image small. Avoid logging too much and print the dependency tree for debugging
9+
10+
# Log Node version
11+
RUN echo "Node.js version:" && node --version
12+
13+
# Log npm version
14+
RUN echo "NPM version:" && npm --version
15+
16+
# Install all runtime dependencies
17+
RUN npm --quiet set progress=false \
18+
&& npm ci --only=prod --no-optional \
19+
&& echo "Installed NPM packages:" \
20+
&& (npm ls || true)
21+
22+
# Next, copy the remaining files and directories with the built source code.
23+
# Since we do this after NPM install, quick build will be really fast
24+
# for simple source file changes.
25+
COPY ./src ./dist

checker-cheerio/INPUT_SCHEMA.json

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
{
2+
"title": "Web Checker",
3+
"description": "The web checker actor loads <b>URLs to check</b> and checks for common captchas, status codes returned from crawling, as well as calculates the price a user may pay. <b>TODO: Needs to be more descriptive!!</b>",
4+
"type": "object",
5+
"schemaVersion": 1,
6+
"properties": {
7+
"urlsToCheck": {
8+
"title": "URLs to check",
9+
"type": "array",
10+
"description": "A static list of URLs to check for captchas. To be able to add new URLs on the fly, enable the <b>Use request queue</b> option.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> in README.",
11+
"sectionCaption": "Checker Options",
12+
"sectionDescription": "Options that will be passed to the checkers",
13+
"editor": "requestListSources",
14+
"prefill": [
15+
{
16+
"url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
17+
}
18+
]
19+
},
20+
"proxyConfiguration": {
21+
"title": "Proxy Configuration",
22+
"type": "object",
23+
"description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
24+
"default": {},
25+
"editor": "proxy",
26+
"prefill": {
27+
"useApifyProxy": false
28+
}
29+
},
30+
"saveSnapshot": {
31+
"title": "Enabled",
32+
"type": "boolean",
33+
"description": "Will save HTML for Cheerio and HTML + screenshot for Puppeteer/Playwright",
34+
"editor": "checkbox",
35+
"groupCaption": "Save Snapshots"
36+
},
37+
"linkSelector": {
38+
"title": "Link Selector",
39+
"type": "string",
40+
"description": "A CSS selector saying which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) shall be followed and added to the request queue. This setting only applies if <b>Use request queue</b> is enabled. To filter the links added to the queue, use the <b>Pseudo-URLs</b> setting.<br><br>If <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README.",
41+
"sectionCaption": "Crawler Options",
42+
"sectionDescription": "Specific options that are relevant for crawlers",
43+
"editor": "textfield",
44+
"prefill": "a[href]",
45+
"minLength": 1
46+
},
47+
"pseudoUrls": {
48+
"title": "Pseudo-URLs",
49+
"type": "array",
50+
"description": "Specifies what kind of URLs found by <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in <code>[]</code> brackets, e.g. <code>http://www.example.com/[.*]</code>. This setting only applies if the <b>Use request queue</b> option is enabled.<br><br>If <b>Pseudo-URLs</b> are omitted, the actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.",
51+
"default": [],
52+
"editor": "pseudoUrls",
53+
"prefill": [
54+
{
55+
"purl": "https://www.amazon.com[.*]/dp/[.*]"
56+
}
57+
]
58+
},
59+
"repeatChecksOnProvidedUrls": {
60+
"title": "Repeat checks on provided URLs",
61+
"type": "integer",
62+
"description": "Will access each URL multiple times. Useful to test the same URL or bypass blocking of the first page.",
63+
"editor": "number"
64+
},
65+
"maxNumberOfPagesCheckedPerDomain": {
66+
"title": "Max number of pages checked per domain",
67+
"type": "integer",
68+
"description": "The maximum number of pages that the checker will load. The checker will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.",
69+
"default": 100,
70+
"editor": "number"
71+
},
72+
"maxConcurrentPagesCheckedPerDomain": {
73+
"title": "Maximum concurrent pages checked per domain",
74+
"type": "integer",
75+
"description": "Specifies the maximum number of pages that can be processed by the checker in parallel for one domain. The checker automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target website.",
76+
"default": 50,
77+
"editor": "number",
78+
"minimum": 1
79+
},
80+
"maxConcurrentDomainsChecked": {
81+
"title": "Maximum number of concurrent domains checked",
82+
"type": "integer",
83+
"description": "Specifies the maximum number of domains that should be checked at a time. This setting is relevant when passing in more than one URL to check.",
84+
"default": 5,
85+
"editor": "number",
86+
"minimum": 1,
87+
"maximum": 10
88+
},
89+
"retireBrowserInstanceAfterRequestCount": {
90+
"title": "Retire browser instance after request count",
91+
"type": "integer",
92+
"description": "How often will the browser itself rotate. Pick a higher number for smaller consumption, pick a lower number to rotate (test) more proxies.",
93+
"default": 10,
94+
"editor": "number",
95+
"minimum": 1
96+
}
97+
},
98+
"required": ["urlsToCheck"]
99+
}

0 commit comments

Comments
 (0)