add crawlee parsel

vdusek · vdusek · commit eaef0aaa26ce · 2025-09-02T16:47:07.000+02:00
diff --git a/docs/03_guides/02_crawlee.mdx b/docs/03_guides/02_crawlee.mdx
@@ -6,32 +6,41 @@ title: Using Crawlee
 import CodeBlock from '@theme/CodeBlock';
 
 import CrawleeBeautifulSoupExample from '!!raw-loader!./code/02_crawlee_beautifulsoup.py';
+import CrawleeParselExample from '!!raw-loader!./code/02_crawlee_parsel.py';
 import CrawleePlaywrightExample from '!!raw-loader!./code/02_crawlee_playwright.py';
 
 In this guide you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
 
 ## Introduction
 
-`Crawlee` is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like [`HttpCrawler`](https://crawlee.dev/python/api/class/HttpCrawler), [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) and [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and browser-based crawlers like [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler), to suit different scraping needs.
+[Crawlee](https://crawlee.dev/python) is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like [`HttpCrawler`](https://crawlee.dev/python/api/class/HttpCrawler), [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) and [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and browser-based crawlers like [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler), to suit different scraping needs.
 
-In this guide, you'll learn how to use Crawlee with `BeautifulSoupCrawler` and `PlaywrightCrawler` to build Apify Actors for web scraping.
+In this guide, you'll learn how to use Crawlee with [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) to build Apify Actors for web scraping.
 
 ## Actor with BeautifulSoupCrawler
 
-The `BeautifulSoupCrawler` is ideal for extracting data from static HTML pages. It uses `BeautifulSoup` for parsing and [`HttpxHttpClient`](https://crawlee.dev/python/api/class/HttpxHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, `BeautifulSoupCrawler` is a great choice for your scraping tasks. Below is an example of how to use `BeautifulSoupCrawler` in an Apify Actor.
+The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is ideal for extracting data from static HTML pages. It uses [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing and [`ImpitHttpClient`](https://crawlee.dev/python/api/class/ImpitHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is a great choice for your scraping tasks. Below is an example of how to use it` in an Apify Actor.
 
 <CodeBlock className="language-python">
     {CrawleeBeautifulSoupExample}
 </CodeBlock>
 
+## Actor with ParselCrawler
+
+The [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) works in the same way as [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), but it uses the [Parsel](https://parsel.readthedocs.io/en/latest/) library for HTML parsing. This allows for more powerful and flexible data extraction using [XPath](https://en.wikipedia.org/wiki/XPath) selectors. It should be faster than [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). Below is an example of how to use [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) in an Apify Actor.
+
+<CodeBlock className="language-python">
+    {CrawleeParselExample}
+</CodeBlock>
+
 ## Actor with PlaywrightCrawler
 
-The `PlaywrightCrawler` is built for handling dynamic web pages that rely on JavaScript for content generation. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use `PlaywrightCrawler` in an Apify Actor.
+The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) is built for handling dynamic web pages that rely on JavaScript for content generation. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) in an Apify Actor.
 
 <CodeBlock className="language-python">
     {CrawleePlaywrightExample}
 </CodeBlock>
 
 ## Conclusion
 
-In this guide, you learned how to use the `Crawlee` library in your Apify Actors. By using the `BeautifulSoupCrawler` and `PlaywrightCrawler` crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
+In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
diff --git a/docs/03_guides/code/02_crawlee_parsel.py b/docs/03_guides/code/02_crawlee_parsel.py
@@ -0,0 +1,55 @@
+from __future__ import annotations
+
+from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
+
+from apify import Actor
+
+
+async def main() -> None:
+    # Enter the context of the Actor.
+    async with Actor:
+        # Retrieve the Actor input, and use default values if not provided.
+        actor_input = await Actor.get_input() or {}
+        start_urls = [
+            url.get('url')
+            for url in actor_input.get(
+                'start_urls',
+                [{'url': 'https://apify.com'}],
+            )
+        ]
+
+        # Exit if no start URLs are provided.
+        if not start_urls:
+            Actor.log.info('No start URLs specified in Actor input, exiting...')
+            await Actor.exit()
+
+        # Create a crawler.
+        crawler = ParselCrawler(
+            # Limit the crawl to max requests.
+            # Remove or increase it for crawling all links.
+            max_requests_per_crawl=50,
+        )
+
+        # Define a request handler, which will be called for every request.
+        @crawler.router.default_handler
+        async def request_handler(context: ParselCrawlingContext) -> None:
+            url = context.request.url
+            Actor.log.info(f'Scraping {url}...')
+
+            # Extract the desired data.
+            data = {
+                'url': context.request.url,
+                'title': context.selector.xpath('//title/text()').get(),
+                'h1s': context.selector.xpath('//h1/text()').getall(),
+                'h2s': context.selector.xpath('//h2/text()').getall(),
+                'h3s': context.selector.xpath('//h3/text()').getall(),
+            }
+
+            # Store the extracted data to the default dataset.
+            await context.push_data(data)
+
+            # Enqueue additional links found on the current page.
+            await context.enqueue_links()
+
+        # Run the crawler with the starting requests.
+        await crawler.run(start_urls)