Skip to content

Commit 20affee

Browse files
committed
style: few improvements to the text
1 parent c5a6dc0 commit 20affee

File tree

1 file changed

+6
-14
lines changed

1 file changed

+6
-14
lines changed

sources/academy/webscraping/scraping_basics_python/12_framework.md

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ slug: /scraping-basics-python/framework
1313
Before rewriting our code, let's point out several caveats in our current solution:
1414

1515
- **Hard to maintain:** All the data we need from the listing page is also available on the product page. By scraping both, we have to maintain selectors for two HTML documents. Instead, we could scrape links from the listing page and process all data on the product pages.
16-
- **Slow:** The program runs sequentially, which is considerate toward the target website, but downloading even two product pages in parallel could improve speed by 200%.
16+
- **Slow:** The program runs sequentially, which is generously considerate toward the target website, but extremely inefficient.
1717
- **No logging:** The scraper gives no sense of progress, making it tedious to use. Debugging issues becomes even more frustrating without proper logs.
18-
- **Boilerplate code:** We implement tasks like downloading and parsing HTML or exporting to CSV with custom code that feels like [boilerplate](https://en.wikipedia.org/wiki/Boilerplate_code). We could replace it with standardized solutions.
18+
- **Boilerplate code:** We implement downloading and parsing HTML, or exporting data to CSV, although we're not the first people to meet and solve these problems.
1919
- **Prone to anti-scraping:** If the target website implemented anti-scraping measures, a bare-bones program like ours would stop working.
2020
- **Browser means rewrite:** We got lucky extracting variants. If the website didn't include a fallback, we might have had no choice but to spin up a browser instance and automate clicking on buttons. Such a change in the underlying technology would require a complete rewrite of our program.
2121
- **No error handling:** The scraper stops if it encounters issues. It should allow for skipping problematic products with warnings or retrying downloads when the website returns temporary errors.
2222

23-
In this lesson, we'll tackle all the above issues by using a scraping framework while keeping the code concise.
23+
In this lesson, we'll tackle all the above issues while keeping the code concise thanks to a scraping framework.
2424

2525
:::info Why Crawlee and not Scrapy
2626

@@ -104,21 +104,13 @@ Sales
104104
└───────────────────────────────┴──────────┘
105105
```
106106

107-
If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the diagnostics, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with BeautifulSoup, extracts the title, and prints it.
107+
If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with BeautifulSoup, extracts the title, and prints it.
108108

109109
## Crawling product detail pages
110110

111+
## Extracting data
111112

112-
113-
114-
<!--
115-
116-
117-
118-
pip install 'crawlee[beautifulsoup]'
119-
120-
121-
-->
113+
## Saving data
122114

123115
:::danger Work in progress
124116

0 commit comments

Comments
 (0)