Skip to content

Commit 657da03

Browse files
authored
Merge pull request #1011 from honzajavorek/honzajavorek/stylistics
fix: improve stylistics and several explanations
2 parents 851f6b5 + d65b249 commit 657da03

File tree

5 files changed

+16
-4
lines changed

5 files changed

+16
-4
lines changed

sources/academy/webscraping/web_scraping_for_beginners/crawling/headless_browser.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ A headless browser is simply a browser that runs without a user interface (UI).
2222
2323
Building a Playwright scraper with Crawlee is extremely easy. To show you how easy it really is, we'll reuse the Cheerio scraper code from the previous lesson. By changing only a few lines of code, we'll turn it into a full headless scraper.
2424

25-
First, we must install Playwright into our project. It's not included in Crawlee, because it's quite large thanks to bundling all the browsers.
25+
First, we must install Playwright into our project. It's not included in Crawlee, because it's quite large as it bundles all the browsers.
2626

2727
```shell
2828
npm install playwright

sources/academy/webscraping/web_scraping_for_beginners/crawling/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ In this section, we will take a look at moving between web pages, which we call
2020

2121
## How do you crawl? {#how-to-crawl}
2222

23-
Crawling websites is a fairly straightforward process. We'll start by opening the first web page and extracting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the [Basics of data extraction](../data_extraction/index.md) course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case something happens to our scraper, we won't have to extract them again. And, finally, we will visit those URLs one by one.
23+
Crawling websites is a fairly straightforward process. We'll start by opening the first web page and extracting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the [Basics of data extraction](../data_extraction/index.md) course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case our scraper crashes with an error, we won't have to extract them again. And, finally, we will visit those URLs one by one.
2424

2525
At any point, we can extract URLs, data, or both. Crawling can be separate from data extraction, but it's not a requirement and, in most projects, it's actually easier and faster to do both at the same time. To summarize, it goes like this:
2626

sources/academy/webscraping/web_scraping_for_beginners/crawling/pro_scraping.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ We mentioned the benefits of developing using a dedicated scraping library in th
2121
2. **Fewer bugs**. Crawlee is fully unit-tested and battle-tested on millions of scraper runs.
2222
3. **Faster and cheaper scrapers** because Crawlee automatically scales based on system resources, and we optimize its performance regularly.
2323
4. **More robust scrapers**. Annoying details like retries, proxy management, error handling, and result storage are all handled out-of-the-box by Crawlee.
24-
5. **Helpful community**. You can [join our Discord](https://discord.gg/qkMS6pU4cF) or talk to us [on GitHub](https://github.com/apify/crawlee). We're almost always there to talk about scraping and programming in general.
24+
5. **Helpful community**. You can [join our Discord](https://discord.gg/qkMS6pU4cF) or talk to us [on GitHub](https://github.com/apify/crawlee/discussions). We're almost always there to talk about scraping and programming in general.
2525

2626
:::tip
2727

sources/academy/webscraping/web_scraping_for_beginners/data_extraction/using_devtools.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,12 @@ It will produce a result like this, but it **won't be** the Sony subwoofer.
5555

5656
![Query a selector with JavaScript](./images/devtools-collection-query.png)
5757

58+
:::note About the missing semicolon
59+
60+
In the screenshot, there is a missing semicolon `;` at the end of the line. In JavaScript, semicolons are optional, so it makes no difference.
61+
62+
:::
63+
5864
When we look more closely by hovering over the result in the Console, we find that instead of the Sony subwoofer, we found a JBL Flip speaker. Why? Because earlier we explained that `document.querySelector('.product-item')` finds the **first element** with the `product-item` class, and the JBL speaker is the first product in the list.
5965

6066
![Hover over a query result](./images/devtools-collection-query-hover.png)
@@ -73,6 +79,12 @@ It will return a `NodeList` (a type of array) with many results. Expand the resu
7379

7480
Naturally, this is the method we use mostly in web scraping, because we're usually interested in scraping all the products from a page, not just a single product.
7581

82+
:::note Elements or nodes?
83+
84+
The list is called a `NodeList`, because browsers understand a HTML document as a tree of nodes. Most of the nodes are HTML elements, but there can be also text nodes for plain text, and others.
85+
86+
:::
87+
7688
## How to choose good selectors {#choose-good-selectors}
7789

7890
There are always multiple ways to select an element using CSS selectors. Try to choose selectors that are **simple**, **human-readable**, **unique** and **semantically connected** to the data. Selectors that meet these criteria are sometimes called **resilient selectors**, because they're the most reliable and least likely to change with website updates. If you can, avoid randomly generated attributes like `class="F4jsL8"`. They change often and without warning.

sources/academy/webscraping/web_scraping_for_beginners/introduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Web data extraction (or collection) is a process that takes a web page, like an
2222

2323
## What is crawling? {#what-is-crawling}
2424

25-
Where web data extraction focuses on a single page, web crawling (sometimes called spidering 🕷) is all about movement between pages or websites. The purpose of crawling is to travel across the website to find pages with the information we want. Crawling and collection can happen simultaneously, while moving from page to page, or separately, where one scraper focuses solely on finding pages with data and another scraper collects the data. The main purpose of crawling is to collect URLs or other links that can be used to move around.
25+
Where web data extraction focuses on a single page, web crawling (sometimes called spidering 🕷) is all about movement between pages or websites. The purpose of crawling is to travel across the website to find pages with the information we want. Crawling and collection can happen either simultaneously, while moving from page to page, or separately, where one scraper focuses solely on finding pages with data, and another scraper collects the data. The main purpose of crawling is to collect URLs or other links that can be used to move around.
2626

2727
## What is web scraping? {#what-is-web-scraping}
2828

0 commit comments

Comments
 (0)