Merge branch 'master' into anti-scraping-revamp

mstephen19 · mstephen19 · commit 97ee4407e2d3 · 2022-06-08T14:23:21.000+02:00
diff --git a/content/academy/api_scraping.md b/content/academy/api_scraping.md
@@ -43,7 +43,7 @@ Especially for [dynamic sites](https://blog.apify.com/what-is-a-dynamic-page/),
 
 ### 4. Easy on the target website
 
-Depending on the website, sending large amounts of requests to their pages could result in a slight performance decrase on their end. By using their API instead, not only does your scraper run better, but it is less demanding of the target website.
+Depending on the website, sending large amounts of requests to their pages could result in a slight performance decrease on their end. By using their API instead, not only does your scraper run better, but it is less demanding of the target website.
 
 ## [](#disadvantages) Disdvantages of API Scraping
 
@@ -65,7 +65,7 @@ APIs come in all different shapes and sizes. That means every API will vary in n
 
 JSON responses are the most ideal, as they are easily manipulatable in JavaScript code. In general, no serious parsing is necessary, and the data can be easily filtered and formatted to fit a scraper's output schema.
 
-APIs which ouput HTML are generally returning the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response.
+APIs which output HTML are generally returning the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response.
 
 ### 2. Encoded data
 
diff --git a/content/academy/api_scraping/general_api_scraping/cookies_headers_tokens.md b/content/academy/api_scraping/general_api_scraping/cookies_headers_tokens.md
@@ -10,7 +10,7 @@ paths:
 
 Unfortunately, most APIs will require a valid cookie to be included in the `cookie` field within a request's headers in order to be authorized. Other APIs may require special tokens, or other data that validates the request.
 
-Luckily, there are ways to retrive and set cookies for requests prior to sending them, which will be covered more in-depth within future Scraping Academy modules. The most important things to know at the moment are:
+Luckily, there are ways to retrieve and set cookies for requests prior to sending them, which will be covered more in-depth within future Scraping Academy modules. The most important things to know at the moment are:
 
 ## [](#cookies) Cookies
 
@@ -88,7 +88,7 @@ const response = await gotScraping({
 
 For our SoundCloud example, testing the endpoint from the previous section in a tool like [Postman]({{@link tools/postman.md}}) works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud.
 
-Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a simple way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programatically instead.
+Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a simple way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead.
 
 Here is a way you could dynamically scrape the `client_id` using Puppeteer:
 
diff --git a/content/academy/api_scraping/graphql_scraping.md b/content/academy/api_scraping/graphql_scraping.md
@@ -16,7 +16,7 @@ In this section, we'll be scraping [cheddar.com](https://cheddar.com)'s GraphQL
 
 ![GraphQL endpoint]({{@asset api_scraping/images/graphql-endpoint.webp}})
 
-As a rule of thumb, when the endpoint ends with **/graphql** and it's a **POST** request, it's a 99.99% bulleproof indicator that the target site is using GraphQL. If you want to be 100% certain though, taking a look at the request payload will most definitely give it away.
+As a rule of thumb, when the endpoint ends with **/graphql** and it's a **POST** request, it's a 99.99% bulletproof indicator that the target site is using GraphQL. If you want to be 100% certain though, taking a look at the request payload will most definitely give it away.
 
 ![GraphQL payload]({{@asset api_scraping/images/graphql-payload.webp}})
 
diff --git a/content/academy/api_scraping/graphql_scraping/custom_queries.md b/content/academy/api_scraping/graphql_scraping/custom_queries.md
@@ -313,13 +313,15 @@ export default scrapeAppToken;
 
 ## Wrap up
 
-<!-- We are actively working on writing the GraphQL scraping guide, so stay tuned for more content here! In the meantime, take a moment to review the skills you learned in this section:
+<!-- We are actively working on writing the GraphQL scraping guide, so stay tuned for more content here! -->
+
+If you've made it this far, that means that you've conquered the king of API scraping - GraphQL, and that you're ready to take on writing scrapers for the majority of websites out there. Nice work!
+
+Take a moment to review the skills you learned in this section:
 
 1. Modifying the variables of copied GraphQL queries
 2. Introspecting a GraphQL API
 3. Visualizing and understanding a GraphQL API introspection
 4. Writing custom queries
 5. Dealing with cursor-based relay pagination
-6. Writing a GraphQL scraper with custom queries -->
-
-If you've made it this far, that means that you've conquered the king of API scraping - GraphQL, and that you're ready to take on writing scrapers for the majority of websites out there. Nice work!
+6. Writing a GraphQL scraper with custom queries
diff --git a/content/academy/apify_platform/deploying_your_code/input_schema.md b/content/academy/apify_platform/deploying_your_code/input_schema.md
@@ -71,7 +71,7 @@ Within our new **numbers** property, there are two more fields we must specify.
 
 ## [](#required-fields) Required fields
 
-The great thing about building an input schema is that it will automatically validate your inputs based on their type, maximum value, minumum value, etc. Sometimes, you want to ensure that the user will always provide input for certain fields, as they are crucial to the actor's run. This can be done by using the **required** field, and passing in the names of the fields you'd like to require.
+The great thing about building an input schema is that it will automatically validate your inputs based on their type, maximum value, minimum value, etc. Sometimes, you want to ensure that the user will always provide input for certain fields, as they are crucial to the actor's run. This can be done by using the **required** field, and passing in the names of the fields you'd like to require.
 
 ```JSON
 {
@@ -99,7 +99,7 @@ Here is what the output schema we wrote will render on the platform:
 
 ![Rendered UI from input schema]({{@asset apify_platform/deploying_your_code/images/rendered-ui.webp}})
 
-Later on, we'll be buildng more complex input schemas, as well as discussing how to write quality input schemas that allow the user to easily understand the actor and not become overwhelmed.
+Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to easily understand the actor and not become overwhelmed.
 
 It is not expected to memorize all of the fields that properties can take, or the different editor types available, which is why it's always good to reference the [input schema documentation](https://docs.apify.com/actors/development/input-schema) when writing a schema.
 
diff --git a/content/academy/expert_scraping_with_apify.md b/content/academy/expert_scraping_with_apify.md
@@ -61,6 +61,6 @@ Part of this course will be learning more in-depth about actors; however, some b
 
 ## [](#next) Next up
 
-[Next up]({{@link expert_scraping_with_apify/apify_sdk.md}}), we'll be learning in-depth about the most important tool in your actor-developemt toolbelt: The **Apify SDK**.
+[Next up]({{@link expert_scraping_with_apify/apify_sdk.md}}), we'll be learning in-depth about the most important tool in your actor-development toolbelt: The **Apify SDK**.
 
 > Each lesson will have a short _(and optional)_ quiz that you can take at home to test your skills and knowledge related to the lesson's content. Some questions have straight factual answers, but some others can have varying opinionated answers.
diff --git a/content/academy/expert_scraping_with_apify/apify_api_and_client.md b/content/academy/expert_scraping_with_apify/apify_api_and_client.md
@@ -1,6 +1,6 @@
 ---
 title: V - Apify API & client
-description: Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - throught the API, and through a client.
+description: Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.
 menuWeight: 6.5
 paths:
     - expert-scraping-with-apify/apify-api-and-client
diff --git a/content/academy/expert_scraping_with_apify/solutions/actor_building.md b/content/academy/expert_scraping_with_apify/solutions/actor_building.md
@@ -439,7 +439,7 @@ If Amazon hasn't changed any of their selectors or significantly updated any of
 
 ## [](#calling-another-actor) Calling another actor
 
-If you remember from our project's requirements outlined in the previous lesson, once the crawler has finished running, we have to email ourselves a public link to the dataset by using a [public actor which sends emails](https://console.apify.com/actors/e643gqfZae2TfQEbA/?addFromActorId=e643gqfZae2TfQEbA#/console). Luckily, the ability to do this programmatically is avaialable right within the Apify SDK with the [`Apify.call()`](https://sdk.apify.com/docs/api/apify#apifycallactid-input-options) function.
+If you remember from our project's requirements outlined in the previous lesson, once the crawler has finished running, we have to email ourselves a public link to the dataset by using a [public actor which sends emails](https://console.apify.com/actors/e643gqfZae2TfQEbA/?addFromActorId=e643gqfZae2TfQEbA#/console). Luckily, the ability to do this programmatically is available right within the Apify SDK with the [`Apify.call()`](https://sdk.apify.com/docs/api/apify#apifycallactid-input-options) function.
 
 Let's add a bit of code to the end of our actor to send this email:
 
@@ -506,7 +506,7 @@ Let's try it out now! Input **iphone** into the box labeled **keyword**, click *
 
 **Q: When using Puppeteer or Playwright, how can you still use jQuery with the SDK?**
 
-**A:** There are two ways. You can either use the [injectJQuery](https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage) utility function which will enable you to use jQuery inside of `page.evalute()`, or you can use Cheerio to load the page's content like this:
+**A:** There are two ways. You can either use the [injectJQuery](https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage) utility function which will enable you to use jQuery inside of `page.evaluate()`, or you can use Cheerio to load the page's content like this:
 
 ```JavaScript
 const $ = cheerio.load(await page.content());
@@ -558,7 +558,7 @@ const title = await page.evaluate(() => $('title').text());
 
 **Q: What is the difference between the RequestList and the RequestQueue?**
 
-The main differece is that once a request list has been created, no more requests can be dynamically added to it. When you want to dynamically add (or  remove) requests, a requst queue must be used.
+The main difference is that once a request list has been created, no more requests can be dynamically added to it. When you want to dynamically add (or  remove) requests, a request queue must be used.
 
 Request lists are better when adding a large batch of requests, as the RequestQueue is not optimized to handle the mass adding of requests. Additionally, the RequestList doesn't consume any platform credits.
 
diff --git a/content/academy/expert_scraping_with_apify/solutions/integrating_webhooks.md b/content/academy/expert_scraping_with_apify/solutions/integrating_webhooks.md
@@ -1,5 +1,5 @@
 ---
-title: II - Integrating webooks
+title: II - Integrating webhooks
 description: Learn how to integrate webhooks into your actors. Webhooks are a super powerful tool, and can be used to do almost anything!
 menuWeight: 2
 paths:
@@ -73,15 +73,15 @@ const filtered = items.reduce((acc, curr) => {
     // Grab the price of the item matching our current
     // item's ASIN in the map. If it doesn't exist, set
     // "prevPrice" to null
-    const prevPrive = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
+    const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
 
     // Grab the price of our current offer
     const price = +curr.offer.slice(1);
 
     // If the item doesn't yet exist in the map, add it.
     // Or, if the current offer's price is less than the
     // saved one, replace the saved one
-    if (!acc[curr.asin] || prevPrive > price) acc[curr.asin] = curr;
+    if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;
 
     // Return the map
     return acc;
@@ -106,10 +106,10 @@ Apify.main(async () => {
     const { items } = await dataset.getData();
 
     const filtered = items.reduce((acc, curr) => {
-        const prevPrive = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
+        const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
         const price = +curr.offer.slice(1);
 
-        if (!acc[curr.asin] || prevPrive > price) acc[curr.asin] = curr;
+        if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;
 
         return acc;
     }, {});
diff --git a/content/academy/puppeteer_playwright.md b/content/academy/puppeteer_playwright.md
@@ -9,7 +9,7 @@ paths:
 
 # [](#puppeteer-playwright-course) Puppeteer & Playwright course
 
-[Puppeteeer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are both libraries which allow you to write code in Node.js which automates a headless browser.
+[Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are both libraries which allow you to write code in Node.js which automates a headless browser.
 
 > A headless browser is just a regular browser like the one you're using right now, but without the user-interface. Because they don't have a UI, they generally perform faster as they don't render any visual content. For an in-depth understanding of headless browsers, check out [this short article](https://blog.arhg.net/2009/10/what-is-headless-browser.html) about them.
 
diff --git a/content/academy/puppeteer_playwright/executing_scripts/collecting_data.md b/content/academy/puppeteer_playwright/executing_scripts/collecting_data.md
@@ -150,7 +150,7 @@ Finally, we can create a `Cheerio` object based on our page's current content li
 const $ = load(await page.content());
 ```
 
-> It's important to note that this `$` object is static. If any content on the page changes, the `$` variable will not automatically be updated. It will need to be redeclared or redefined.
+> It's important to note that this `$` object is static. If any content on the page changes, the `$` variable will not automatically be updated. It will need to be re-declared or re-defined.
 
 Here's our full code so far:
 
diff --git a/content/academy/puppeteer_playwright/page/waiting.md b/content/academy/puppeteer_playwright/page/waiting.md
@@ -67,7 +67,7 @@ Though in theory this is correct, it can result in a race condition in which the
 await Promise.all([page.waitForNavigation(), page.click('.g a')]);
 ```
 
-Though the line of cod above is also valid in Playwright, it is recommended to use [`page.waitForLoadState('load')`](https://playwright.dev/docs/api/class-page#page-wait-for-load-state) instead of `page.waitForNavigaton()`, as it automatically handles the issues being solved in by using `Promise.all()`.
+Though the line of cod above is also valid in Playwright, it is recommended to use [`page.waitForLoadState('load')`](https://playwright.dev/docs/api/class-page#page-wait-for-load-state) instead of `page.waitForNavigation()`, as it automatically handles the issues being solved in by using `Promise.all()`.
 
 ```JavaScript
 await page.click('.g a');
diff --git a/content/academy/puppeteer_playwright/proxies.md b/content/academy/puppeteer_playwright/proxies.md
@@ -95,6 +95,8 @@ And that's it! Now, when we visit Google, it's in Vietnamese. Depending on the c
 
 ![Vietnamese Google]({{@asset puppeteer_playwright/images/vietnamese-google.webp}})
 
+> Note that in order to rotate through multiple proxies, you must retire a browser instance then create a new one to continue automating with a new proxy.
+
 ## [](#authenticating-a-proxy) Authenticating a proxy
 
 The proxy in the last activity didn't require a username and password, but let's say that this one does:
diff --git a/content/academy/web_scraping_for_beginners/crawling/filtering_links.md b/content/academy/web_scraping_for_beginners/crawling/filtering_links.md
@@ -91,7 +91,7 @@ demo-webstore\.apify\.org\/product\/[a-z|0-9|-]*
 
 This regular expression matches all URLs that include the `demo-webstore.apify.org/product/` substring immediately following with any number of letters or dashes `-`.
 
-> A great way to learn more about regular expression syntax and to test your expressions are tools like [regexr.com](https://regexr.com/) or [regex101.com](https://regex101.com/). It's okay if you don't get the hang of it right away!
+> A great way to learn more about regular expression syntax and to test your expressions are tools like [regex101.com](https://regex101.com/) or [regexr.com](https://regexr.com/). It's okay if you don't get the hang of it right away!
 
 To test our regular expression in the DevTools console, we'll first create a [`RegExp`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp) object and then test the URLs with the [`regExp.test(string)`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/test) function.
 
diff --git a/content/academy/web_scraping_for_beginners/crawling/first_crawl.md b/content/academy/web_scraping_for_beginners/crawling/first_crawl.md
@@ -1,12 +1,12 @@
 ---
-title: First crawl
+title: Your first crawl
 description: Learn how to crawl the web using Node.js, Cheerio and an HTTP client. Collect URLs from pages and use them to visit more websites.
 menuWeight: 5
 paths:
     - web-scraping-for-beginners/crawling/first-crawl
 ---
 
-# [](#your-first-crawl) First crawl
+# [](#your-first-crawl) Your first crawl
 
 In the previous lessons, we learned what crawling is and how to collect URLs from a page's HTML. The only thing that remains is to write the code - so let's get right to it!
 
diff --git a/content/academy/web_scraping_for_beginners/crawling/recap_collection_basics.md b/content/academy/web_scraping_for_beginners/crawling/recap_collection_basics.md
@@ -1,5 +1,5 @@
 ---
-title: Recap of data collection
+title: Recap! - Data collection
 description: Review our e-commerce website scraper and refresh our memory about its code and the programming techniques we used to collect and save the data.
 menuWeight: 1
 paths:
diff --git a/content/academy/web_scraping_for_beginners/crawling/scraping_the_data.md b/content/academy/web_scraping_for_beginners/crawling/scraping_the_data.md
@@ -1,12 +1,12 @@
 ---
-title: Scraping the data
+title: Scraping data
 description: Learn how to add data collection logic to your crawler, which will allow you to extract data from all the websites you crawled.
 menuWeight: 6
 paths:
     - web-scraping-for-beginners/crawling/scraping-the-data
 ---
 
-# [](#scraping-data) Scraping the data
+# [](#scraping-data) Scraping data
 
 At the [very beginning of the course](({{@link web_scraping_for_beginners.md}})), we learned that the term web scraping usually means a combined process of data collection and crawling. And this is exactly what we'll do in this lesson. We will take the code we built in the previous lesson and in the [Basics of data collection]({{@link web_scraping_for_beginners/data_collection/node_continued.md}}) section, and we will combine that into a web scraper.
 
diff --git a/content/academy/web_scraping_for_beginners/data_collection/browser_devtools.md b/content/academy/web_scraping_for_beginners/data_collection/browser_devtools.md
@@ -1,5 +1,5 @@
 ---
-title: Browser DevTools
+title: Browser DevTools - I
 description: Learn about browser DevTools, a valuable tool in the world of web scraping , and how you can use them to collect data from a website.
 menuWeight: 1
 paths:
diff --git a/content/academy/web_scraping_for_beginners/data_collection/devtools_continued.md b/content/academy/web_scraping_for_beginners/data_collection/devtools_continued.md
@@ -1,5 +1,5 @@
 ---
-title: DevTools (continued)
+title: Browser DevTools - III
 description: Continue learning how to collect data from a website using browser DevTools, CSS selectors, and JavaScript via the DevTools console.
 menuWeight: 3
 paths:
diff --git a/content/academy/web_scraping_for_beginners/data_collection/js_in_html.md b/content/academy/web_scraping_for_beginners/data_collection/js_in_html.md
diff --git a/content/academy/web_scraping_for_beginners/data_collection/node_continued.md b/content/academy/web_scraping_for_beginners/data_collection/node_continued.md
diff --git a/content/academy/web_scraping_for_beginners/data_collection/node_js_scraper.md b/content/academy/web_scraping_for_beginners/data_collection/node_js_scraper.md
diff --git a/content/academy/web_scraping_for_beginners/data_collection/project_setup.md b/content/academy/web_scraping_for_beginners/data_collection/project_setup.md
diff --git a/content/academy/web_scraping_for_beginners/data_collection/using_devtools.md b/content/academy/web_scraping_for_beginners/data_collection/using_devtools.md