Skip to content

Commit 97ee440

Browse files
committed
Merge branch 'master' into anti-scraping-revamp
2 parents 2767c9b + 2adc11a commit 97ee440

File tree

24 files changed

+46
-42
lines changed

24 files changed

+46
-42
lines changed

content/academy/api_scraping.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ Especially for [dynamic sites](https://blog.apify.com/what-is-a-dynamic-page/),
4343

4444
### 4. Easy on the target website
4545

46-
Depending on the website, sending large amounts of requests to their pages could result in a slight performance decrase on their end. By using their API instead, not only does your scraper run better, but it is less demanding of the target website.
46+
Depending on the website, sending large amounts of requests to their pages could result in a slight performance decrease on their end. By using their API instead, not only does your scraper run better, but it is less demanding of the target website.
4747

4848
## [](#disadvantages) Disdvantages of API Scraping
4949

@@ -65,7 +65,7 @@ APIs come in all different shapes and sizes. That means every API will vary in n
6565

6666
JSON responses are the most ideal, as they are easily manipulatable in JavaScript code. In general, no serious parsing is necessary, and the data can be easily filtered and formatted to fit a scraper's output schema.
6767

68-
APIs which ouput HTML are generally returning the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response.
68+
APIs which output HTML are generally returning the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response.
6969

7070
### 2. Encoded data
7171

content/academy/api_scraping/general_api_scraping/cookies_headers_tokens.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ paths:
1010

1111
Unfortunately, most APIs will require a valid cookie to be included in the `cookie` field within a request's headers in order to be authorized. Other APIs may require special tokens, or other data that validates the request.
1212

13-
Luckily, there are ways to retrive and set cookies for requests prior to sending them, which will be covered more in-depth within future Scraping Academy modules. The most important things to know at the moment are:
13+
Luckily, there are ways to retrieve and set cookies for requests prior to sending them, which will be covered more in-depth within future Scraping Academy modules. The most important things to know at the moment are:
1414

1515
## [](#cookies) Cookies
1616

@@ -88,7 +88,7 @@ const response = await gotScraping({
8888

8989
For our SoundCloud example, testing the endpoint from the previous section in a tool like [Postman]({{@link tools/postman.md}}) works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud.
9090

91-
Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a simple way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programatically instead.
91+
Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a simple way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead.
9292

9393
Here is a way you could dynamically scrape the `client_id` using Puppeteer:
9494

content/academy/api_scraping/graphql_scraping.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ In this section, we'll be scraping [cheddar.com](https://cheddar.com)'s GraphQL
1616

1717
![GraphQL endpoint]({{@asset api_scraping/images/graphql-endpoint.webp}})
1818

19-
As a rule of thumb, when the endpoint ends with **/graphql** and it's a **POST** request, it's a 99.99% bulleproof indicator that the target site is using GraphQL. If you want to be 100% certain though, taking a look at the request payload will most definitely give it away.
19+
As a rule of thumb, when the endpoint ends with **/graphql** and it's a **POST** request, it's a 99.99% bulletproof indicator that the target site is using GraphQL. If you want to be 100% certain though, taking a look at the request payload will most definitely give it away.
2020

2121
![GraphQL payload]({{@asset api_scraping/images/graphql-payload.webp}})
2222

content/academy/api_scraping/graphql_scraping/custom_queries.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -313,13 +313,15 @@ export default scrapeAppToken;
313313
314314
## Wrap up
315315
316-
<!-- We are actively working on writing the GraphQL scraping guide, so stay tuned for more content here! In the meantime, take a moment to review the skills you learned in this section:
316+
<!-- We are actively working on writing the GraphQL scraping guide, so stay tuned for more content here! -->
317+
318+
If you've made it this far, that means that you've conquered the king of API scraping - GraphQL, and that you're ready to take on writing scrapers for the majority of websites out there. Nice work!
319+
320+
Take a moment to review the skills you learned in this section:
317321
318322
1. Modifying the variables of copied GraphQL queries
319323
2. Introspecting a GraphQL API
320324
3. Visualizing and understanding a GraphQL API introspection
321325
4. Writing custom queries
322326
5. Dealing with cursor-based relay pagination
323-
6. Writing a GraphQL scraper with custom queries -->
324-
325-
If you've made it this far, that means that you've conquered the king of API scraping - GraphQL, and that you're ready to take on writing scrapers for the majority of websites out there. Nice work!
327+
6. Writing a GraphQL scraper with custom queries

content/academy/apify_platform/deploying_your_code/input_schema.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ Within our new **numbers** property, there are two more fields we must specify.
7171

7272
## [](#required-fields) Required fields
7373

74-
The great thing about building an input schema is that it will automatically validate your inputs based on their type, maximum value, minumum value, etc. Sometimes, you want to ensure that the user will always provide input for certain fields, as they are crucial to the actor's run. This can be done by using the **required** field, and passing in the names of the fields you'd like to require.
74+
The great thing about building an input schema is that it will automatically validate your inputs based on their type, maximum value, minimum value, etc. Sometimes, you want to ensure that the user will always provide input for certain fields, as they are crucial to the actor's run. This can be done by using the **required** field, and passing in the names of the fields you'd like to require.
7575

7676
```JSON
7777
{
@@ -99,7 +99,7 @@ Here is what the output schema we wrote will render on the platform:
9999

100100
![Rendered UI from input schema]({{@asset apify_platform/deploying_your_code/images/rendered-ui.webp}})
101101

102-
Later on, we'll be buildng more complex input schemas, as well as discussing how to write quality input schemas that allow the user to easily understand the actor and not become overwhelmed.
102+
Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to easily understand the actor and not become overwhelmed.
103103

104104
It is not expected to memorize all of the fields that properties can take, or the different editor types available, which is why it's always good to reference the [input schema documentation](https://docs.apify.com/actors/development/input-schema) when writing a schema.
105105

content/academy/expert_scraping_with_apify.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,6 @@ Part of this course will be learning more in-depth about actors; however, some b
6161

6262
## [](#next) Next up
6363

64-
[Next up]({{@link expert_scraping_with_apify/apify_sdk.md}}), we'll be learning in-depth about the most important tool in your actor-developemt toolbelt: The **Apify SDK**.
64+
[Next up]({{@link expert_scraping_with_apify/apify_sdk.md}}), we'll be learning in-depth about the most important tool in your actor-development toolbelt: The **Apify SDK**.
6565

6666
> Each lesson will have a short _(and optional)_ quiz that you can take at home to test your skills and knowledge related to the lesson's content. Some questions have straight factual answers, but some others can have varying opinionated answers.

content/academy/expert_scraping_with_apify/apify_api_and_client.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: V - Apify API & client
3-
description: Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - throught the API, and through a client.
3+
description: Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.
44
menuWeight: 6.5
55
paths:
66
- expert-scraping-with-apify/apify-api-and-client

content/academy/expert_scraping_with_apify/solutions/actor_building.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -439,7 +439,7 @@ If Amazon hasn't changed any of their selectors or significantly updated any of
439439

440440
## [](#calling-another-actor) Calling another actor
441441

442-
If you remember from our project's requirements outlined in the previous lesson, once the crawler has finished running, we have to email ourselves a public link to the dataset by using a [public actor which sends emails](https://console.apify.com/actors/e643gqfZae2TfQEbA/?addFromActorId=e643gqfZae2TfQEbA#/console). Luckily, the ability to do this programmatically is avaialable right within the Apify SDK with the [`Apify.call()`](https://sdk.apify.com/docs/api/apify#apifycallactid-input-options) function.
442+
If you remember from our project's requirements outlined in the previous lesson, once the crawler has finished running, we have to email ourselves a public link to the dataset by using a [public actor which sends emails](https://console.apify.com/actors/e643gqfZae2TfQEbA/?addFromActorId=e643gqfZae2TfQEbA#/console). Luckily, the ability to do this programmatically is available right within the Apify SDK with the [`Apify.call()`](https://sdk.apify.com/docs/api/apify#apifycallactid-input-options) function.
443443

444444
Let's add a bit of code to the end of our actor to send this email:
445445

@@ -506,7 +506,7 @@ Let's try it out now! Input **iphone** into the box labeled **keyword**, click *
506506

507507
**Q: When using Puppeteer or Playwright, how can you still use jQuery with the SDK?**
508508

509-
**A:** There are two ways. You can either use the [injectJQuery](https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage) utility function which will enable you to use jQuery inside of `page.evalute()`, or you can use Cheerio to load the page's content like this:
509+
**A:** There are two ways. You can either use the [injectJQuery](https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage) utility function which will enable you to use jQuery inside of `page.evaluate()`, or you can use Cheerio to load the page's content like this:
510510

511511
```JavaScript
512512
const $ = cheerio.load(await page.content());
@@ -558,7 +558,7 @@ const title = await page.evaluate(() => $('title').text());
558558

559559
**Q: What is the difference between the RequestList and the RequestQueue?**
560560

561-
The main differece is that once a request list has been created, no more requests can be dynamically added to it. When you want to dynamically add (or remove) requests, a requst queue must be used.
561+
The main difference is that once a request list has been created, no more requests can be dynamically added to it. When you want to dynamically add (or remove) requests, a request queue must be used.
562562

563563
Request lists are better when adding a large batch of requests, as the RequestQueue is not optimized to handle the mass adding of requests. Additionally, the RequestList doesn't consume any platform credits.
564564

content/academy/expert_scraping_with_apify/solutions/integrating_webhooks.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: II - Integrating webooks
2+
title: II - Integrating webhooks
33
description: Learn how to integrate webhooks into your actors. Webhooks are a super powerful tool, and can be used to do almost anything!
44
menuWeight: 2
55
paths:
@@ -73,15 +73,15 @@ const filtered = items.reduce((acc, curr) => {
7373
// Grab the price of the item matching our current
7474
// item's ASIN in the map. If it doesn't exist, set
7575
// "prevPrice" to null
76-
const prevPrive = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
76+
const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
7777

7878
// Grab the price of our current offer
7979
const price = +curr.offer.slice(1);
8080

8181
// If the item doesn't yet exist in the map, add it.
8282
// Or, if the current offer's price is less than the
8383
// saved one, replace the saved one
84-
if (!acc[curr.asin] || prevPrive > price) acc[curr.asin] = curr;
84+
if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;
8585

8686
// Return the map
8787
return acc;
@@ -106,10 +106,10 @@ Apify.main(async () => {
106106
const { items } = await dataset.getData();
107107

108108
const filtered = items.reduce((acc, curr) => {
109-
const prevPrive = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
109+
const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
110110
const price = +curr.offer.slice(1);
111111

112-
if (!acc[curr.asin] || prevPrive > price) acc[curr.asin] = curr;
112+
if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;
113113

114114
return acc;
115115
}, {});

content/academy/puppeteer_playwright.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ paths:
99

1010
# [](#puppeteer-playwright-course) Puppeteer & Playwright course
1111

12-
[Puppeteeer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are both libraries which allow you to write code in Node.js which automates a headless browser.
12+
[Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are both libraries which allow you to write code in Node.js which automates a headless browser.
1313

1414
> A headless browser is just a regular browser like the one you're using right now, but without the user-interface. Because they don't have a UI, they generally perform faster as they don't render any visual content. For an in-depth understanding of headless browsers, check out [this short article](https://blog.arhg.net/2009/10/what-is-headless-browser.html) about them.
1515

0 commit comments

Comments
 (0)