Skip to content

Commit 6747490

Browse files
docs(academy-puppeteer): clarify disadvantages of browsers and unified cheerio parsing
1 parent 3571277 commit 6747490

File tree

2 files changed

+14
-2
lines changed

2 files changed

+14
-2
lines changed

sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,12 @@ Now that we know how to execute scripts on a page, we're ready to learn a bit ab
1919
1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`.
2020
2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio)
2121

22+
::: tip
23+
24+
If you are using Crawlee, we highly recommend the [parseWithCheerio](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#parseWithCheerio) function for unified data extraction syntax. This way, switching between browser and plain HTTP scraping is a breeze.
25+
26+
:::
27+
2228
## Setup
2329

2430
Here is the base setup for our code, upon which we'll be building off of in this lesson:

sources/academy/webscraping/puppeteer_playwright/index.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,17 @@ Both packages were developed by the same team and are very similar, which is why
2121

2222
> Each lesson's activity will contain examples for both libraries, but we recommend using Playwright, as it is newer and has more features and better [documentation](https://playwright.dev/docs/intro)
2323
24-
## Advantages of using a headless browser {#advantages-of-headless-browsers}
24+
## using a headless browser {#advantages-of-headless-browsers}
2525

2626
When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc.
2727

28-
Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped).
28+
Additionally, since the requests aren't static, [dynamic content](../../glossary/concepts/dynamic_pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped). [Headful](https://playwright.dev/docs/api/class-testoptions#test-options-headless) (`headless: false`) mode to see exactly what the browser is doing.
29+
30+
Browsers can also be effective for [overcoming anti-scraping measures](../anti_scraping/index.md), especially if the website is running [JavaScript browser challenges](../anti_scraping/techniques/browser_challenges.md).
31+
32+
## Disadvantages of headless browsers
33+
34+
Browsers are slow and expensive to run. In the follow-up courses, the Apify Academy will show you how to scrape websites without a browser. Every website can potentially be reverse-engineered into a series of quick and cheap HTTP calls but it might require significant effort and specialized knowledge.
2935

3036
## Setup {#setup}
3137

0 commit comments

Comments
 (0)