Skip to content

Commit d28068b

Browse files
committed
fix: build error
1 parent a8d6145 commit d28068b

File tree

9 files changed

+7
-10
lines changed

9 files changed

+7
-10
lines changed

content/academy/anti_scraping.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ Solely based on the way how the bots operate. It comperes data-rich pages visits
7373

7474
By definition, this is not an anti-scraping method, but it can heavily affect the reliability of a scraper. If your target website drastically changes its CSS selectors, and your scraper is heavily reliant on selectors, it could break. In principle, websites using this method change their HTML structure or CSS selectors randomly and frequently, making the parsing of the data harder, and requiring more maintenance of the bot.
7575

76-
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@link web_scraping_for_beginners/data_collection/js_in_html.md}}))
76+
One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping]({{@link api_scraping.md}}) and [JavaScript objects within HTML]({{@link tutorials/js_in_html.md}}))
7777

7878
### IP session consistency
7979

content/academy/expert_scraping_with_apify/solutions/actor_building.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -486,7 +486,7 @@ Some specific functionality of jQuery is not available with Cheerio. Follow offi
486486

487487
**A:** One should use CheerioCrawler when scraping any non-dynamic content. For scraping any content that doesn't require the loading of JavaScript in order to receive all of the data (such as with server-side rendered HTML pages and APIs), CheerioCrawler should be used. It is limited though, as it can only make static requests. This means that if a piece of data is loaded using JavaScript from an API call that the page makes, CheerioCrawler will never see that piece of data.
488488

489-
> Learn more about dynamic pages in the [**dynamic pages**]({{@link concepts/dynamic_pages.md}}) lesson, and learn how to overcome their challenges in the [**API scraping**]({{@link api_scraping.md}}) course and the [**JSON in HTML**]({{@link web_scraping_for_beginners/data_collection/js_in_html.md}}) lesson.
489+
> Learn more about dynamic pages in the [**dynamic pages**]({{@link concepts/dynamic_pages.md}}) lesson, and learn how to overcome their challenges in the [**API scraping**]({{@link api_scraping.md}}) course and the [**JSON in HTML**]({{@link tutorials/js_in_html.md}}) lesson.
490490
491491
Additionally, if the job being done requires some sort of interaction with the page, PlaywrightCrawler/PuppeteerCrawler should be used, as CheerioCrawler runs out of the context of the browser.
492492

content/academy/tutorials/js_in_html.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,15 @@ The advantages of using these objects instead of parsing the HTML are that parsi
1414

1515
> **Note:** In this tutorial, we'll be using [SoundCloud's website](https://soundcloud.com) as an example target, but the techniques described here can be applied to any site.
1616
17-
## [](#locating-json-in-html) Locating JSON objects within `<script>` tags
17+
## [](#locating-json-in-html) Locating JSON objects within script tags
1818

1919
Using our DevTools, we can inspect our [target page](https://soundcloud.com/tiesto/tracks), or right click the page and click **View Page Source** to see the DOM. Next, we'll find a value on the page that we can predict would be in a potential API response. For our page, we'll use the **Tracks** count of `845`. On the **View Page Source** page, we'll do **** + **F** and type in this value, which will show all matches for it within the DOM. This method can expose `<script>` tag objects which hold the target data.
2020

21-
![Find the value within the DOM using CMD + F]({{@asset web_scraping_for_beginners/data_collection/images/view-845.webp}})
21+
![Find the value within the DOM using CMD + F]({{@asset tutorials/images/view-845.webp}})
2222

2323
These data objects will usually be attached to the window object (often prefixed with two underscores - `__`). When scrolling to the beginning of the script tag on our **View Page Source** page, we see that the name of our target object is `__sc_hydration`. Heading back to DevTools and typing this into the console, the object is displayed.
2424

25-
![View the target data in the window object using the console in DevTools]({{@asset web_scraping_for_beginners/data_collection/images/view-object-in-window.webp}})
25+
![View the target data in the window object using the console in DevTools]({{@asset tutorials/images/view-object-in-window.webp}})
2626

2727
## [](#parsing-objects) Parsing
2828

@@ -52,7 +52,3 @@ console.log(data)
5252
```
5353

5454
Which of these methods you use totally depends on the type of crawler you are using. Grabbing the data directly from the `window` object within the context of the browser using Puppeteer is of course the most reliable solution; however, it is less performant than making a static HTTP request and parsing the object directly from the downloaded HTML.
55-
56-
## [](#next) Next up
57-
58-
Next up are the [**Basics of crawling**]({{@link web_scraping_for_beginners/crawling.md}}), where we will learn how to move between web pages and scrape data from all of them. We will build a scraper that first collects all the products on [Fakestore](https://demo-webstore.apify.org/), and then crawls each of them to scrape the data for each product separately.

content/academy/web_scraping_for_beginners/data_collection/save_to_csv.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,4 +121,4 @@ This marks the end of the **Basics of data collection** section of Web scraping
121121

122122
## [](#next) Next up
123123

124-
The [next lesson]({{@link web_scraping_for_beginners/data_collection/js_in_html.md}}) will be all about finding JavaScript objects inside HTML documents, and scraping the data from there instead of from HTML elements.
124+
Next up are the [**Basics of crawling**]({{@link web_scraping_for_beginners/crawling.md}}), where we will learn how to move between web pages and scrape data from all of them. We will build a scraper that first collects all the products on [Fakestore](https://demo-webstore.apify.org/), and then crawls each of them to scrape the data for each product separately.

content/docs/tutorials/quick_start.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ The above actor (and many others) uses the `apify` [NPM package](https://www.npm
115115
If you are building your own actors, you'll probably prefer to host the source code on Git. To do that, follow these steps:
116116

117117
[//]: # (TODO: Repo below is outdated, we should probably update the actor there too)
118+
118119
1. Create a new Git repository.
119120
2. Copy the boilerplate actor code from the [apify/quick-start](https://github.com/apify/actor-quick-start) actor.
120121
3. Set **Source type** to **Git repository** for your actor in the app.

0 commit comments

Comments
 (0)