Merge pull request #450 from apify/new-content-migrations

mstephen19 · web-flow · commit 1008549b7441 · 2022-10-04T13:13:59.000+02:00
docs: migrations
diff --git a/content/academy/apify_platform/running_a_web_server.md b/content/academy/apify_platform/running_a_web_server.md
@@ -0,0 +1,229 @@
+---
+title: Running a web server on the Apify platform
+description: A web server running in an actor can act as a communication channel with the outside world. Learn how to easily set one up with Node.js.
+menuWeight: 5.4
+paths:
+- apify-platform/running-a-web-server
+---
+
+# Running a web server on the Apify platform
+
+Sometimes, an actor needs a channel for communication with other systems (or humans). This channel might be used to receive commands, to provide info about progress, or both. To implement this, we will run a HTTP web server inside the actor that will provide:
+
+- An API to receive commands.
+- An HTML page displaying output data.
+
+Running a web server in an actor is a piece of cake! Each actor run is available at a unique URL (container URL) which always takes the form `https://CONTAINER-KEY.runs.apify.net`. This URL is available in the [**actor run** object](https://docs.apify.com/api/v2#/reference/actor-runs/run-object-and-its-storages/get-run) returned by the Apify API, as well as in the Apify console.
+
+If you start a web server on the port defined by the **APIFY_CONTAINER_PORT** environment variable (the default value is **4321**), the container URL becomes available and gets displayed in the **Live View** tab in the actor run console.
+
+For more details, see [the documentation](https://docs.apify.com/actor/run#container-web-server).
+
+## [](#building-the-actor) Building the actor
+
+Let's try to build the following actor:
+
+- The actor will provide an API to receive URLs to be processed.
+- For each URL, the actor will create a screenshot.
+- The screenshot will be stored in the key-value store.
+- The actor will provide a web page displaying thumbnails linked to screenshots and a HTML form to submit new URLs.
+
+To achieve this we will use the following technologies:
+
+- [Express.js](https://expressjs.com) framework to create the server
+- [Puppeteer](https://pptr.dev) to grab screenshots.
+- The [Apify SDK](https://sdk.apify.com) to access Apify storages to store the screenshots.
+
+Our server needs two paths:
+
+- `/` - Index path will display a page form to submit a new URL and the thumbnails of processed URLs.
+- `/add-url` - Will provide an API to add new URLs using a HTTP POST request.
+
+First, we'll import `express` and create an Express.js app. Then, we'll add some middleware that will allow us to receive form submissions.
+
+```JavaScript
+import Apify from 'apify';
+import express from 'express';
+
+const app = express()
+
+app.use(express.json());
+app.use(express.urlencoded({ extended: true }));
+```
+
+Now we need to read the following environment variables:
+
+- **APIFY_CONTAINER_PORT** contains a port number where we must start the server.
+- **APIFY_CONTAINER_URL** contains a URL under which we can access the container.
+- **APIFY_DEFAULT_KEY_VALUE_STORE_ID** is simply the ID of the default key-value store of this actor where we can store screenshots.
+
+```JavaScript
+const { 
+    APIFY_CONTAINER_PORT, 
+    APIFY_CONTAINER_URL, 
+    APIFY_DEFAULT_KEY_VALUE_STORE_ID,
+} = process.env;
+```
+
+Next, we'll create an array of the processed URLs where the **n**th URL has its screenshot stored under the key **n**.jpg in the key-value store.
+
+```JavaScript
+const processedUrls = [];
+```
+
+After that, the index route is ready to be defined.
+
+```JavaScript
+app.get('/', (req, res) => {
+    let listItems = '';
+
+    // For each of the processed
+    processedUrls.forEach((url, index) => {
+        const imageUrl = `https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/${index}.jpg`;
+
+        // Display the screenshots below the form
+        listItems += `<li>
+    <a href="${imageUrl}" target="_blank">
+        <img src="${imageUrl}" width="300px" />
+        <br />
+        ${url}
+    </a>
+</li>`;
+    });
+
+    const pageHtml = `<html>
+    <head><title>Example</title></head>
+    <body>
+        <form method="POST" action="${APIFY_CONTAINER_URL}/add-url">
+            URL: <input type="text" name="url" placeholder="http://example.com" />
+            <input type="submit" value="Add" />
+            <hr />
+            <ul>${listItems}</ul>
+        </form>
+    </body>
+</html>`;
+
+    res.send(pageHtml);
+});
+```
+
+And then the a second path that receives the new URL submitted using the HTML form; after the URL is processed, it redirects the user back to the root path.
+
+```JavaScript
+app.post('/add-url', async (req, res) => {
+    const { url } = req.body;
+    console.log(`Got new URL: ${url}`);
+
+    // Start chrome browser and open new page ...
+    const browser = await Apify.launchPuppeteer();
+    const page = await browser.newPage();
+
+    // ... go to our URL and grab a screenshot ...
+    await page.goto(url);
+    const screenshot = await page.screenshot({ type: 'jpeg' });
+
+    // ... close browser ...
+    await page.close();
+    await browser.close(); 
+
+    // ... save screenshot to key-value store and add URL to processedUrls.
+    await Apify.setValue(`${processedUrls.length}.jpg`, screenshot, { contentType: 'image/jpeg' });
+    processedUrls.push(url);
+
+    res.redirect('/');
+});
+```
+
+And finally we need to start the web server.
+
+```JavaScript
+// Start the web server!
+app.listen(APIFY_CONTAINER_PORT, () => {
+    console.log(`Application is listening at URL ${APIFY_CONTAINER_URL}.`);
+});
+```
+
+### [](#final-code) Final code
+
+```JavaScript
+import Apify from 'apify';
+import express from 'express';
+
+const app = express()
+
+app.use(express.json());
+app.use(express.urlencoded({ extended: true }));
+
+const {
+    APIFY_CONTAINER_PORT,
+    APIFY_CONTAINER_URL,
+    APIFY_DEFAULT_KEY_VALUE_STORE_ID,
+} = process.env;
+
+const processedUrls = [];
+
+app.get('/', (req, res) => {
+    let listItems = '';
+
+    // For each of the processed
+    processedUrls.forEach((url, index) => {
+        const imageUrl = `https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/${index}.jpg`;
+
+        // Display the screenshots below the form
+        listItems += `<li>
+    <a href="${imageUrl}" target="_blank">
+        <img src="${imageUrl}" width="300px" />
+        <br />
+        ${url}
+    </a>
+</li>`;
+    });
+
+    const pageHtml = `<html>
+    <head><title>Example</title></head>
+    <body>
+        <form method="POST" action="${APIFY_CONTAINER_URL}/add-url">
+            URL: <input type="text" name="url" placeholder="http://example.com" />
+            <input type="submit" value="Add" />
+            <hr />
+            <ul>${listItems}</ul>
+        </form>
+    </body>
+</html>`;
+
+    res.send(pageHtml);
+});
+
+app.post('/add-url', async (req, res) => {
+    const { url } = req.body;
+    console.log(`Got new URL: ${url}`);
+
+    // Start chrome browser and open new page ...
+    const browser = await Apify.launchPuppeteer();
+    const page = await browser.newPage();
+
+    // ... go to our URL and grab a screenshot ...
+    await page.goto(url);
+    const screenshot = await page.screenshot({ type: 'jpeg' });
+
+    // ... close browser ...
+    await page.close();
+    await browser.close(); 
+
+    // ... save screenshot to key-value store and add URL to processedUrls.
+    await Apify.setValue(`${processedUrls.length}.jpg`, screenshot, { contentType: 'image/jpeg' });
+    processedUrls.push(url);
+
+    res.redirect('/');
+});
+
+app.listen(APIFY_CONTAINER_PORT, () => {
+    console.log(`Application is listening at URL ${APIFY_CONTAINER_URL}.`);
+});
+```
+
+When we deploy and run this actor on the Apify platform, then we can open the **Live View** tab in the actor console to submit the URL to your actor through the form. After the URL is successfully submitted, it appears in the actor log.
+
+With that we're done! And our application works like a charm :)
+
+The complete code of this actor is available [here](https://www.apify.com/apify/example-web-server). You can run it there or copy it to your account.
diff --git a/content/academy/choosing_the_right_scraper.md b/content/academy/choosing_the_right_scraper.md
@@ -0,0 +1,34 @@
+---
+title: How to choose the right scraper for the job
+description: Understand how to choose the best scraper for your use-case by understanding some basic concepts.
+menuWeight: 20
+category: tutorials
+paths:
+    - choosing-the-right-scraper
+---
+
+# [](#choosing-the-right-scraper) Choosing the right scraper for the job
+
+There are two main ways you can proceed with building your crawler:
+
+1. Using plain HTTP requests.
+2. Using an automated browser.
+
+We will briefly go through the pros and cons of both, and also will cover the basic steps on how to determine which one should you go with.
+
+## [](#performance) Performance
+
+First, let's discuss performance. Plain HTTP request-based scraping will **always** be faster than browser-based scraping. When using plain requests, the page's HTML is not rendered, no JavaScript is executed, no images are loaded, etc. Also, there's no memory used by the browser, and there are no CPU-hungry operations.
+
+If it were only a question of performance, you'd of course use request-based scraping every time; however, it's unfortunately not that simple.
+
+## [](#dynamic-pages) Dynamic pages & blocking
+
+Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages]({{@link dealing_with_dynamic_pages.md}})). Another problem is blocking. If the website is collecting a [browser fingerprint]({{@link anti_scraping/techniques/fingerprinting.md}}), it is very easy for it to distinguish between a real user and a bot (crawler) and block access.
+
+## [](#making-the-choice) Making the choice
+
+When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick Javascript Switcher]({{@link tools/quick_javascript_switcher.md}}) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser browser. You can then check what data is received in response using [Postman]({{@link tools/postman.md}}) or [Insomnia]({{@link tools/insomnia.md}}) or try to sending a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.
+
+It also depends of course on whether you need to fill in some data (like a username and password) or select a location (such as entering zip code manually). Tasks where interacting with the page is absolutely necessary cannot be done using plain HTTP scraping, and require headless browsers. In some cases, you might also decide to use a browser-based solution in order to better blend in with the rest of the "regular" traffic coming from real users.
+