Skip to content

Commit eaef0aa

Browse files
committed
add crawlee parsel
1 parent cee1d9c commit eaef0aa

File tree

2 files changed

+69
-5
lines changed

2 files changed

+69
-5
lines changed

docs/03_guides/02_crawlee.mdx

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,32 +6,41 @@ title: Using Crawlee
66
import CodeBlock from '@theme/CodeBlock';
77

88
import CrawleeBeautifulSoupExample from '!!raw-loader!./code/02_crawlee_beautifulsoup.py';
9+
import CrawleeParselExample from '!!raw-loader!./code/02_crawlee_parsel.py';
910
import CrawleePlaywrightExample from '!!raw-loader!./code/02_crawlee_playwright.py';
1011

1112
In this guide you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
1213

1314
## Introduction
1415

15-
`Crawlee` is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like [`HttpCrawler`](https://crawlee.dev/python/api/class/HttpCrawler), [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) and [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and browser-based crawlers like [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler), to suit different scraping needs.
16+
[Crawlee](https://crawlee.dev/python) is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like [`HttpCrawler`](https://crawlee.dev/python/api/class/HttpCrawler), [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) and [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and browser-based crawlers like [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler), to suit different scraping needs.
1617

17-
In this guide, you'll learn how to use Crawlee with `BeautifulSoupCrawler` and `PlaywrightCrawler` to build Apify Actors for web scraping.
18+
In this guide, you'll learn how to use Crawlee with [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) to build Apify Actors for web scraping.
1819

1920
## Actor with BeautifulSoupCrawler
2021

21-
The `BeautifulSoupCrawler` is ideal for extracting data from static HTML pages. It uses `BeautifulSoup` for parsing and [`HttpxHttpClient`](https://crawlee.dev/python/api/class/HttpxHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, `BeautifulSoupCrawler` is a great choice for your scraping tasks. Below is an example of how to use `BeautifulSoupCrawler` in an Apify Actor.
22+
The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is ideal for extracting data from static HTML pages. It uses [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing and [`ImpitHttpClient`](https://crawlee.dev/python/api/class/ImpitHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is a great choice for your scraping tasks. Below is an example of how to use it` in an Apify Actor.
2223

2324
<CodeBlock className="language-python">
2425
{CrawleeBeautifulSoupExample}
2526
</CodeBlock>
2627

28+
## Actor with ParselCrawler
29+
30+
The [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) works in the same way as [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), but it uses the [Parsel](https://parsel.readthedocs.io/en/latest/) library for HTML parsing. This allows for more powerful and flexible data extraction using [XPath](https://en.wikipedia.org/wiki/XPath) selectors. It should be faster than [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). Below is an example of how to use [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) in an Apify Actor.
31+
32+
<CodeBlock className="language-python">
33+
{CrawleeParselExample}
34+
</CodeBlock>
35+
2736
## Actor with PlaywrightCrawler
2837

29-
The `PlaywrightCrawler` is built for handling dynamic web pages that rely on JavaScript for content generation. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use `PlaywrightCrawler` in an Apify Actor.
38+
The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) is built for handling dynamic web pages that rely on JavaScript for content generation. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) in an Apify Actor.
3039

3140
<CodeBlock className="language-python">
3241
{CrawleePlaywrightExample}
3342
</CodeBlock>
3443

3544
## Conclusion
3645

37-
In this guide, you learned how to use the `Crawlee` library in your Apify Actors. By using the `BeautifulSoupCrawler` and `PlaywrightCrawler` crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
46+
In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
from __future__ import annotations
2+
3+
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
4+
5+
from apify import Actor
6+
7+
8+
async def main() -> None:
9+
# Enter the context of the Actor.
10+
async with Actor:
11+
# Retrieve the Actor input, and use default values if not provided.
12+
actor_input = await Actor.get_input() or {}
13+
start_urls = [
14+
url.get('url')
15+
for url in actor_input.get(
16+
'start_urls',
17+
[{'url': 'https://apify.com'}],
18+
)
19+
]
20+
21+
# Exit if no start URLs are provided.
22+
if not start_urls:
23+
Actor.log.info('No start URLs specified in Actor input, exiting...')
24+
await Actor.exit()
25+
26+
# Create a crawler.
27+
crawler = ParselCrawler(
28+
# Limit the crawl to max requests.
29+
# Remove or increase it for crawling all links.
30+
max_requests_per_crawl=50,
31+
)
32+
33+
# Define a request handler, which will be called for every request.
34+
@crawler.router.default_handler
35+
async def request_handler(context: ParselCrawlingContext) -> None:
36+
url = context.request.url
37+
Actor.log.info(f'Scraping {url}...')
38+
39+
# Extract the desired data.
40+
data = {
41+
'url': context.request.url,
42+
'title': context.selector.xpath('//title/text()').get(),
43+
'h1s': context.selector.xpath('//h1/text()').getall(),
44+
'h2s': context.selector.xpath('//h2/text()').getall(),
45+
'h3s': context.selector.xpath('//h3/text()').getall(),
46+
}
47+
48+
# Store the extracted data to the default dataset.
49+
await context.push_data(data)
50+
51+
# Enqueue additional links found on the current page.
52+
await context.enqueue_links()
53+
54+
# Run the crawler with the starting requests.
55+
await crawler.run(start_urls)

0 commit comments

Comments
 (0)