apify-projects
diff --git a/‎.eslintrc.json
Lines changed: 14 additions & 0 deletions b/‎.eslintrc.json
Lines changed: 14 additions & 0 deletions
diff --git a/‎.gitignore
Lines changed: 2 additions & 2 deletions b/‎.gitignore
Lines changed: 2 additions & 2 deletions
diff --git a/‎.npmignore
Lines changed: 0 additions & 4 deletions b/‎.npmignore
Lines changed: 0 additions & 4 deletions
diff --git a/‎CHANGELOG.md
Lines changed: 4 additions & 1 deletion b/‎CHANGELOG.md
Lines changed: 4 additions & 1 deletion
diff --git a/‎Dockerfile
Lines changed: 0 additions & 22 deletions b/‎Dockerfile
Lines changed: 0 additions & 22 deletions
diff --git a/‎INPUT_SCHEMA.json
Lines changed: 0 additions & 103 deletions b/‎INPUT_SCHEMA.json
Lines changed: 0 additions & 103 deletions
diff --git a/‎README.md
Lines changed: 11 additions & 2 deletions b/‎README.md
Lines changed: 11 additions & 2 deletions
diff --git a/‎apify.json
Lines changed: 0 additions & 7 deletions b/‎apify.json
Lines changed: 0 additions & 7 deletions
diff --git a/‎checker-cheerio/Dockerfile
Lines changed: 25 additions & 0 deletions b/‎checker-cheerio/Dockerfile
Lines changed: 25 additions & 0 deletions
diff --git a/‎checker-cheerio/INPUT_SCHEMA.json
Lines changed: 99 additions & 0 deletions b/‎checker-cheerio/INPUT_SCHEMA.json
Lines changed: 99 additions & 0 deletions
@@ -0,0 +1,14 @@
+{
+  "extends": ["@apify/ts"],
+  "parser": "@typescript-eslint/parser",
+  "parserOptions": {
+    "project": "./tsconfig.eslint.json",
+    "sourceType": "module",
+    "ecmaVersion": 2020
+  },
+  "env": {
+    "es6": true,
+		"es2017": true,
+		"es2020": true
+  }
+}
@@ -1,3 +1,3 @@
-apify_storage
 node_modules
-apify_storage
+apify_storage
+dist
@@ -1,10 +1,13 @@
 ### 2021-07-29
+
 *Features*
+
 - Pushing metadata about each page to dataset
 
 *Changes*
+
 - Removed `useGoogleBotHeaders` option (we don't want to impersonate Google anyway)
 - Updated `apify` from `0.18.1` to `1.3.1`
 - `saveSnapshots` is `true` by default
 - Added recognition of Amazon's `hCaptcha`
-- `success` and `wasSuccess` metrics added to output. Success is measured by status being less than 400 and no captcha
+- `success` and `wasSuccess` metrics added to output. Success is measured by status being less than 400 and no captcha
@@ -1,8 +1,11 @@
 ## Website Checker
+
 Website checker is a simple actor that allows you to scan any website for performance and blocking.
 
 ### Features
+
 The actor provides these useful features out of the box:
+
 - Collects response status codes
 - Recognizes the most common captchas
 - Saves HTML snapshots and screenshots (if Puppeteer is chosen)
@@ -12,28 +15,33 @@ The actor provides these useful features out of the box:
 - Enables basic proxy and browser configuration
 
 #### Planned features
+
 - Usage calculation/stats
 - Better automatic workloads/workload actors
 - Add support for Playwright + Firefox
 
 ### How to use
+
 The most common use-case is to do a quick check on how aggressively the target site is blocking. In that case just supply a start URL, ideally a category one or product one. You can either set `replicateStartUrls` or add enqueueing with `linkSelector` + `pseudoUrls`, both are good options to test different proxies. You can test a few different proxy groups and compare `cheerio` vs `puppeteer` options.
 
 In the end you will get a simple statistics about the blocking rate. It is recommended to check a few screenshots just to make sure the actor correctly recognized the page status. You can get to the detailed output (per URL) via KV store or dataset (the KV output sorts by response status while dataset is simply ordered by scraping order).
 
 #### Checker workloads
+
 To make your life easier, you can use other actors that will start more checker runs at once and aggregate the result. This way you can test more sites at once or different cheerio/browser and proxy combinations and compare those.
 
 All of these actors are very young so we are glad for any feature ideas:
 [lukaskrivka/website-checker-workload](https://apify.com/lukaskrivka/website-checker-workload)
 [vaclavrut/website-checker-starter](https://apify.com/vaclavrut/website-checker-starter)
 
 ### Input
+
 Please follow the [actor's input page](https://apify.com/lukaskrivka/website-checker/input-schema) for a detailed explanation. Most input fields have reasonable defaults.
 
 ### Example output
 
 #### Simple output
+
 ```
 {
     "timeouted": 0,
@@ -54,7 +62,8 @@ Please follow the [actor's input page](https://apify.com/lukaskrivka/website-che
 ```
 
 #### Detailed output with URLs, screenshots and HTML links
-https://api.apify.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true
+<https://api.apify.com/v2/key-value-stores/zT3zxpd53Wv9m9ukQ/records/DETAILED-OUTPUT?disableRedirect=true>
 
 ### Changelog
-Check history of changes in the [CHANGELOG](https://github.com/metalwarrior665/actor-website-checker/blob/master/CHANGELOG.md)
+
+Check history of changes in the [CHANGELOG](https://github.com/metalwarrior665/actor-website-checker/blob/master/CHANGELOG.md)
@@ -0,0 +1,25 @@
+FROM apify/actor-node:16
+
+# Copy the package.json and package-lock.json files, since they are the only files
+# that affect NPM install in the next step
+COPY package*.json ./
+
+# Install NPM packages, skip optional and development dependencies to keep the
+# image small. Avoid logging too much and print the dependency tree for debugging
+
+# Log Node version
+RUN echo "Node.js version:" && node --version
+
+# Log npm version
+RUN echo "NPM version:" && npm --version
+
+# Install all runtime dependencies
+RUN npm --quiet set progress=false \
+	&& npm ci --only=prod --no-optional \
+	&& echo "Installed NPM packages:" \
+	&& (npm ls || true)
+
+# Next, copy the remaining files and directories with the built source code.
+# Since we do this after NPM install, quick build will be really fast
+# for simple source file changes.
+COPY ./src ./dist
@@ -0,0 +1,99 @@
+{
+  "title": "Web Checker",
+  "description": "The web checker actor loads <b>URLs to check</b> and checks for common captchas, status codes returned from crawling, as well as calculates the price a user may pay. <b>TODO: Needs to be more descriptive!!</b>",
+  "type": "object",
+  "schemaVersion": 1,
+  "properties": {
+    "urlsToCheck": {
+      "title": "URLs to check",
+      "type": "array",
+      "description": "A static list of URLs to check for captchas. To be able to add new URLs on the fly, enable the <b>Use request queue</b> option.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> in README.",
+      "sectionCaption": "Checker Options",
+      "sectionDescription": "Options that will be passed to the checkers",
+      "editor": "requestListSources",
+      "prefill": [
+        {
+          "url": "https://www.amazon.com/b?ie=UTF8&node=11392907011"
+        }
+      ]
+    },
+    "proxyConfiguration": {
+      "title": "Proxy Configuration",
+      "type": "object",
+      "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
+      "default": {},
+      "editor": "proxy",
+      "prefill": {
+        "useApifyProxy": false
+      }
+    },
+    "saveSnapshot": {
+      "title": "Enabled",
+      "type": "boolean",
+      "description": "Will save HTML for Cheerio and HTML + screenshot for Puppeteer/Playwright",
+      "editor": "checkbox",
+      "groupCaption": "Save Snapshots"
+    },
+    "linkSelector": {
+      "title": "Link Selector",
+      "type": "string",
+      "description": "A CSS selector saying which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) shall be followed and added to the request queue. This setting only applies if <b>Use request queue</b> is enabled. To filter the links added to the queue, use the <b>Pseudo-URLs</b> setting.<br><br>If <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README.",
+      "sectionCaption": "Crawler Options",
+      "sectionDescription": "Specific options that are relevant for crawlers",
+      "editor": "textfield",
+      "prefill": "a[href]",
+      "minLength": 1
+    },
+    "pseudoUrls": {
+      "title": "Pseudo-URLs",
+      "type": "array",
+      "description": "Specifies what kind of URLs found by <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in <code>[]</code> brackets, e.g. <code>http://www.example.com/[.*]</code>. This setting only applies if the <b>Use request queue</b> option is enabled.<br><br>If <b>Pseudo-URLs</b> are omitted, the actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.",
+      "default": [],
+      "editor": "pseudoUrls",
+      "prefill": [
+        {
+          "purl": "https://www.amazon.com[.*]/dp/[.*]"
+        }
+      ]
+    },
+    "repeatChecksOnProvidedUrls": {
+      "title": "Repeat checks on provided URLs",
+      "type": "integer",
+      "description": "Will access each URL multiple times. Useful to test the same URL or bypass blocking of the first page.",
+      "editor": "number"
+    },
+    "maxNumberOfPagesCheckedPerDomain": {
+      "title": "Max number of pages checked per domain",
+      "type": "integer",
+      "description": "The maximum number of pages that the checker will load. The checker will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.",
+      "default": 100,
+      "editor": "number"
+    },
+    "maxConcurrentPagesCheckedPerDomain": {
+      "title": "Maximum concurrent pages checked per domain",
+      "type": "integer",
+      "description": "Specifies the maximum number of pages that can be processed by the checker in parallel for one domain. The checker automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target website.",
+      "default": 50,
+      "editor": "number",
+      "minimum": 1
+    },
+    "maxConcurrentDomainsChecked": {
+      "title": "Maximum number of concurrent domains checked",
+      "type": "integer",
+      "description": "Specifies the maximum number of domains that should be checked at a time. This setting is relevant when passing in more than one URL to check.",
+      "default": 5,
+      "editor": "number",
+      "minimum": 1,
+      "maximum": 10
+    },
+    "retireBrowserInstanceAfterRequestCount": {
+      "title": "Retire browser instance after request count",
+      "type": "integer",
+      "description": "How often will the browser itself rotate. Pick a higher number for smaller consumption, pick a lower number to rotate (test) more proxies.",
+      "default": 10,
+      "editor": "number",
+      "minimum": 1
+    }
+  },
+  "required": ["urlsToCheck"]
+}