Merge pull request #1011 from honzajavorek/honzajavorek/stylistics

honzajavorek · web-flow · commit 657da037a2d7 · 2024-05-22T11:37:10.000+02:00
fix: improve stylistics and several explanations
diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md
@@ -22,7 +22,7 @@ A headless browser is simply a browser that runs without a user interface (UI).
 
 Building a Playwright scraper with Crawlee is extremely easy. To show you how easy it really is, we'll reuse the Cheerio scraper code from the previous lesson. By changing only a few lines of code, we'll turn it into a full headless scraper.
 
-First, we must install Playwright into our project. It's not included in Crawlee, because it's quite large thanks to bundling all the browsers.
+First, we must install Playwright into our project. It's not included in Crawlee, because it's quite large as it bundles all the browsers.
 
 ```shell
 npm install playwright
diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/index.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/index.md
@@ -20,7 +20,7 @@ In this section, we will take a look at moving between web pages, which we call
 
 ## How do you crawl? {#how-to-crawl}
 
-Crawling websites is a fairly straightforward process. We'll start by opening the first web page and extracting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the [Basics of data extraction](../data_extraction/index.md) course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case something happens to our scraper, we won't have to extract them again. And, finally, we will visit those URLs one by one.
+Crawling websites is a fairly straightforward process. We'll start by opening the first web page and extracting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the [Basics of data extraction](../data_extraction/index.md) course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case our scraper crashes with an error, we won't have to extract them again. And, finally, we will visit those URLs one by one.
 
 At any point, we can extract URLs, data, or both. Crawling can be separate from data extraction, but it's not a requirement and, in most projects, it's actually easier and faster to do both at the same time. To summarize, it goes like this:
 
diff --git a/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md b/sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md
@@ -21,7 +21,7 @@ We mentioned the benefits of developing using a dedicated scraping library in th
 2. **Fewer bugs**. Crawlee is fully unit-tested and battle-tested on millions of scraper runs.
 3. **Faster and cheaper scrapers** because Crawlee automatically scales based on system resources, and we optimize its performance regularly.
 4. **More robust scrapers**. Annoying details like retries, proxy management, error handling, and result storage are all handled out-of-the-box by Crawlee.
-5. **Helpful community**. You can [join our Discord](https://discord.gg/qkMS6pU4cF) or talk to us [on GitHub](https://github.com/apify/crawlee). We're almost always there to talk about scraping and programming in general.
+5. **Helpful community**. You can [join our Discord](https://discord.gg/qkMS6pU4cF) or talk to us [on GitHub](https://github.com/apify/crawlee/discussions). We're almost always there to talk about scraping and programming in general.
 
 :::tip
 
diff --git a/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md b/sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md
@@ -55,6 +55,12 @@ It will produce a result like this, but it **won't be** the Sony subwoofer.
 
 ![Query a selector with JavaScript](./images/devtools-collection-query.png)
 
+:::note About the missing semicolon
+
+In the screenshot, there is a missing semicolon `;` at the end of the line. In JavaScript, semicolons are optional, so it makes no difference.
+
+:::
+
 When we look more closely by hovering over the result in the Console, we find that instead of the Sony subwoofer, we found a JBL Flip speaker. Why? Because earlier we explained that `document.querySelector('.product-item')` finds the **first element** with the `product-item` class, and the JBL speaker is the first product in the list.
 
 ![Hover over a query result](./images/devtools-collection-query-hover.png)
@@ -73,6 +79,12 @@ It will return a `NodeList` (a type of array) with many results. Expand the resu
 
 Naturally, this is the method we use mostly in web scraping, because we're usually interested in scraping all the products from a page, not just a single product.
 
+:::note Elements or nodes?
+
+The list is called a `NodeList`, because browsers understand a HTML document as a tree of nodes. Most of the nodes are HTML elements, but there can be also text nodes for plain text, and others.
+
+:::
+
 ## How to choose good selectors {#choose-good-selectors}
 
 There are always multiple ways to select an element using CSS selectors. Try to choose selectors that are **simple**, **human-readable**, **unique** and **semantically connected** to the data. Selectors that meet these criteria are sometimes called **resilient selectors**, because they're the most reliable and least likely to change with website updates. If you can, avoid randomly generated attributes like `class="F4jsL8"`. They change often and without warning.
diff --git a/sources/academy/webscraping/web_scraping_for_beginners/introduction.md b/sources/academy/webscraping/web_scraping_for_beginners/introduction.md
@@ -22,7 +22,7 @@ Web data extraction (or collection) is a process that takes a web page, like an
 
 ## What is crawling? {#what-is-crawling}
 
-Where web data extraction focuses on a single page, web crawling (sometimes called spidering 🕷) is all about movement between pages or websites. The purpose of crawling is to travel across the website to find pages with the information we want. Crawling and collection can happen simultaneously, while moving from page to page, or separately, where one scraper focuses solely on finding pages with data and another scraper collects the data. The main purpose of crawling is to collect URLs or other links that can be used to move around.
+Where web data extraction focuses on a single page, web crawling (sometimes called spidering 🕷) is all about movement between pages or websites. The purpose of crawling is to travel across the website to find pages with the information we want. Crawling and collection can happen either simultaneously, while moving from page to page, or separately, where one scraper focuses solely on finding pages with data, and another scraper collects the data. The main purpose of crawling is to collect URLs or other links that can be used to move around.
 
 ## What is web scraping? {#what-is-web-scraping}