Skip to content

Commit ea7d38b

Browse files
authored
Merge pull request #1073 from honzajavorek/honzajavorek/vale-actor
fix: fine-tune Vale and fix all outstanding errors
2 parents 1740bed + 0c50b9c commit ea7d38b

File tree

48 files changed

+68
-68
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+68
-68
lines changed

.github/styles/Apify/Capitalization.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,11 @@ tokens:
1010
# Also no . followed by a word character (avoids 'actors.md')
1111
- '(?<![\/\-#\w])actors(?![\/\}])(?!\.\w)'
1212

13-
# Before the word there should be no: /, -, #, ., word character
13+
# Before the word there should be no: /, -, #, ., `, word character
1414
# (avoids anchors, URLs, identifiers, code, and words like 'factors')
1515
#
1616
# After the word there should be no: /, }, -, word character (avoids paths or URLs)
1717
# Also no " =" (avoids code like "actor = ...")
1818
# Also no . followed by a word character (avoids 'actor.md' or 'actor.update()')
19-
- '(?<![\/\-#\.\w`])actor(?![\/\}\-\w])(?! =)(?!\.\w)'
19+
- '(?<![\/\-#\.`\w])actor(?![\/\}\-\w])(?! =)(?!\.\w)'
2020
nonword: false

sources/academy/glossary/concepts/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ slug: /concepts
1212

1313
---
1414

15-
There are some terms and concepts you'll see frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson.
15+
You'll see some terms and concepts frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson.
1616

17-
Because of this slight dilemma, and because there are no outside resources which compile all of these concepts into an educational and digestible form, we've decided to do just that. So, welcome to the **Concepts** section of the Apify Academy's **Glossary**!
17+
Because of this slight dilemma, and because there are no outside resources which compile all of these concepts into an educational and digestible form, we've decided to do just that. Welcome to the **Concepts** section of the Apify Academy's **Glossary**!
1818

1919
> It's important to note that there is no specific order to these concepts. All of them range in their relevance and importance to your every day scraping endeavors.

sources/academy/platform/expert_scraping_with_apify/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Before developing a pro-level Apify scraper, there are some important things you
2626

2727
### Crawlee, Apify SDK, and the Apify CLI {#crawlee-apify-sdk-and-cli}
2828

29-
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5-10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson](../../webscraping/scraping_basics_javascript/crawling/pro_scraping.md) in the **Web scraping for beginners** course (and ideally follow along). To familiarize yourself with the Apify SDK, you can refer to the [Apify Platform](../apify_platform.md) category.
29+
If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 510 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson](../../webscraping/scraping_basics_javascript/crawling/pro_scraping.md) in the **Web scraping for beginners** course (and ideally follow along). To familiarize yourself with the Apify SDK, you can refer to the [Apify Platform](../apify_platform.md) category.
3030

3131
The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson](../../glossary/tools/apify_cli.md).
3232

sources/academy/platform/expert_scraping_with_apify/solutions/saving_stats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => {
160160
161161
**Q: Is storing these types of values necessary for every single Actor?**
162162
163-
**A:** For small Actors, it might be a waste of time to do this. For large-scale Actors, it can be extremely helpful when debugging and most definitely worth the extra 10-20 minutes of development time. Usually though, the default statistics from the Crawlee and the SDK might be enough for simple run stats.
163+
**A:** For small Actors, it might be a waste of time to do this. For large-scale Actors, it can be extremely helpful when debugging and most definitely worth the extra 1020 minutes of development time. Usually though, the default statistics from the Crawlee and the SDK might be enough for simple run stats.
164164
165165
**Q: What is the difference between the `failedRequestHandler` and `errorHandler`?**
166166

sources/academy/platform/get_most_of_actors/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,4 @@ In this section, we will go over some of the practical steps you can take to ens
2222

2323
## Next up {#next}
2424

25-
So, without further ado, let's jump [right into the next lesson](./naming_your_actor.md)!
25+
Without further ado, let's jump [right into the next lesson](./naming_your_actor.md)!

sources/academy/platform/get_most_of_actors/naming_your_actor.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ slug: /get-most-of-actors/naming-your-actor
1313

1414
Naming your Actor can be tricky. Especially when you've spent a long time coding and are excited to show your brand-new creation to the world. To help users find your Actor, we've introduced naming standards. These standards improve your Actor's [search engine optimization (SEO)](https://en.wikipedia.org/wiki/Search_engine_optimization) and maintain consistency in the [Apify Store](https://apify.com/store).
1515

16-
> Your Actor's name should be 3-63 characters long.
16+
> Your Actor's name should be 363 characters long.
1717
1818
## Scrapers {#scrapers}
1919

sources/academy/tutorials/api/retry_failed_requests.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ slug: /api/retry-failed-requests
99

1010
---
1111

12-
There are many reasons why requests for a scraper could fail. The most common causes are different page layouts or proxy blocking issues ([check here on how to effectively analyze errors](https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors)). Both [Apify](https://apify.com) and [Crawlee](https://crawlee.dev/) allow you to restart your scraper run from the point where it ended, but there is no native functionality to re-scrape only failed requests. Usually, you also want to first analyze the problem, update the code, and build it before trying again.
12+
Requests of a scraper can fail for many reasons. The most common causes are different page layouts or proxy blocking issues ([check here on how to effectively analyze errors](https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors)). Both [Apify](https://apify.com) and [Crawlee](https://crawlee.dev/) allow you to restart your scraper run from the point where it ended, but there is no native functionality to re-scrape only failed requests. Usually, you also want to first analyze the problem, update the code, and build it before trying again.
1313

14-
If you attempt to restart an already finished run, it will likely immediately finish because all the requests in the [request queue](https://crawlee.dev/docs/guides/request-storage) are marked as handled. So you need to update the failed requests in the queue to be marked as pending again.
14+
If you attempt to restart an already finished run, it will likely immediately finish because all the requests in the [request queue](https://crawlee.dev/docs/guides/request-storage) are marked as handled. You need to update the failed requests in the queue to be marked as pending again.
1515

1616
The additional complication is that the [Request](https://crawlee.dev/api/core/class/Request) object doesn't have anything like the `isFailed` property. We have to approximate it using other fields. Fortunately, we can use the `errorMessages` and `retryCount` properties to identify failed requests. Unless the user explicitly has overridden these properties, we can identify failed requests with a larger amount of `errorMessages` than `retryCount`. That happens because the last error that doesn't cause a retry anymore is added to `errorMessages`.
1717

sources/academy/tutorials/node_js/analyzing_pages_and_fixing_errors.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -77,14 +77,14 @@ Read more information about logging and error handling in our developer [best pr
7777

7878
By snapshots, we mean **screenshots** if you use a [browser with Puppeteer/Playwright](../../webscraping/puppeteer_playwright/index.md) and HTML saved into a [key-value store](https://crawlee.dev/api/core/class/KeyValueStore) that you can easily display in your own browser. Snapshots are useful throughout your code but especially important in error handling.
7979

80-
Note that an error can happen only in a few pages out of a thousand and look completely random. There is not much you can do other than save and analyze a snapshot.
80+
Note that an error can happen only in a few pages out of a thousand and look completely random. You cannot do much else than to save and analyze a snapshot.
8181

8282
Snapshots can tell you if:
8383

8484
- A website has changed its layout. This can also mean A/B testing or different content for different locations.
85-
- You have been blockedyou open a [CAPTCHA](https://en.wikipedia.org/wiki/CAPTCHA) or an **Access Denied** page.
86-
- Data load later dynamicallythe page is empty.
87-
- The page was redirectedthe content is different.
85+
- You have been blockedyou open a [CAPTCHA](https://en.wikipedia.org/wiki/CAPTCHA) or an **Access Denied** page.
86+
- Data load later dynamicallythe page is empty.
87+
- The page was redirectedthe content is different.
8888

8989
You can learn how to take snapshots in Puppeteer or Playwright in [this short lesson](../../webscraping/puppeteer_playwright/page/page_methods.md)
9090

sources/academy/tutorials/node_js/dealing_with_dynamic_pages.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ https://demo-webstore.apify.org/_next/image?url=https%3A%2F%2Fm.media-amazon.com
9393

9494
The reason this is happening is because CheerioCrawler makes static HTTP requests, so it only manages to capture the content from the `DOMContentLoaded` event. Any elements or attributes generated dynamically thereafter using JavaScript (and usually XHR/Fetch requests) are not part of the downloaded HTML, and therefore are not accessible through the `$` object.
9595

96-
So, what's the solution? We need to use something that is able to allow the page to follow through with the entire load process - a headless browser.
96+
What's the solution? We need to use something that is able to allow the page to follow through with the entire load process - a headless browser.
9797

9898
## Scraping dynamic content {#scraping-dynamic-content}
9999

@@ -142,7 +142,7 @@ After running this one, we can see that our results look different from before.
142142

143143
Well... Not quite. It seems that the only images which we got the full links to were the ones that were being displayed within the view of the browser. This means that the images are lazy-loaded. **Lazy-loading** is a common technique used across the web to improve performance. Lazy-loaded items allow the user to load content incrementally, as they perform some action. In most cases, including our current one, this action is scrolling.
144144

145-
So, we've gotta scroll down the page to load these images. Luckily, because we're using Crawlee, we don't have to write the logic that will achieve that, because a utility function specifically for Puppeteer called [`infiniteScroll`](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#infiniteScroll) already exists right in the library, and can be accessed through `utils.puppeteer`. Let's add it to our code now:
145+
We've gotta scroll down the page to load these images. Luckily, because we're using Crawlee, we don't have to write the logic that will achieve that, because a utility function specifically for Puppeteer called [`infiniteScroll`](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#infiniteScroll) already exists right in the library, and can be accessed through `utils.puppeteer`. Let's add it to our code now:
146146

147147
<RunnableCodeBlock className="language-js" type="puppeteer">
148148
{Example}

sources/academy/tutorials/node_js/js_in_html.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ These data objects will usually be attached to the window object (often prefixed
2929

3030
## Parsing {#parsing-objects}
3131

32-
There are two ways to go about obtaining these objects to be used and manipulated in JavaScript code:
32+
You can obtain these objects to be used and manipulated in JavaScript in two ways:
3333

3434
### 1. Parsing them directly from the HTML
3535

0 commit comments

Comments
 (0)