add parsel impit

vdusek · vdusek · commit f3e3473c8a37 · 2025-09-03T14:55:02.000+02:00
diff --git a/docs/03_guides/01_beautifulsoup_httpx.mdx b/docs/03_guides/01_beautifulsoup_httpx.mdx
@@ -11,20 +11,20 @@ In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.co
 
 ## Introduction
 
-`BeautifulSoup` is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction.
+[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction.
 
-`HTTPX` is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests.
+[HTTPX](https://www.python-httpx.org/) is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests.
 
-To create an `Actor` which uses those libraries, start from the [BeautifulSoup & Python](https://apify.com/templates/categories/python) Actor template. This template includes the `BeautifulSoup` and `HTTPX` libraries preinstalled, allowing you to begin development immediately.
+To create an Actor which uses those libraries, start from the [BeautifulSoup & Python](https://apify.com/templates/categories/python) Actor template. This template includes the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [HTTPX](https://www.python-httpx.org/) libraries preinstalled, allowing you to begin development immediately.
 
 ## Example Actor
 
-Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses `HTTPX` for fetching pages and `BeautifulSoup` for parsing their content to extract titles and links to other pages.
+Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages.
 
 <CodeBlock className="language-python">
     {BeautifulSoupHttpxExample}
 </CodeBlock>
 
 ## Conclusion
 
-In this guide, you learned how to use the `BeautifulSoup` with the `HTTPX` in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
+In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
diff --git a/docs/03_guides/02_parsel_impit.mdx b/docs/03_guides/02_parsel_impit.mdx
@@ -0,0 +1,28 @@
+---
+id: parsel-impit
+title: Using Parsel with Impit
+---
+
+import CodeBlock from '@theme/CodeBlock';
+
+import ParselImpitExample from '!!raw-loader!./code/02_parsel_impit.py';
+
+In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors.
+
+## Introduction
+
+[Parsel](https://github.com/scrapy/parsel) is a Python library for extracting data from HTML and XML documents using CSS selectors and [XPath](https://en.wikipedia.org/wiki/XPath) expressions. It offers an intuitive API for navigating and extracting structured data, making it a popular choice for web scraping. Compared to [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), it also delivers better performance.
+
+[Impit](https://github.com/apify/impit) is Apify's high-performance HTTP client for Python. It supports both synchronous and asynchronous workflows and is built for large-scale web scraping, where making thousands of requests efficiently is essential. With built-in browser impersonation and anti-blocking features, it simplifies handling modern websites.
+
+## Example Actor
+
+The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links.
+
+<CodeBlock className="language-python">
+    {ParselImpitExample}
+</CodeBlock>
+
+## Conclusion
+
+In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
diff --git a/docs/03_guides/05_crawlee.mdx b/docs/03_guides/05_crawlee.mdx
@@ -5,9 +5,9 @@ title: Using Crawlee
 
 import CodeBlock from '@theme/CodeBlock';
 
-import CrawleeBeautifulSoupExample from '!!raw-loader!./code/02_crawlee_beautifulsoup.py';
-import CrawleeParselExample from '!!raw-loader!./code/02_crawlee_parsel.py';
-import CrawleePlaywrightExample from '!!raw-loader!./code/02_crawlee_playwright.py';
+import CrawleeBeautifulSoupExample from '!!raw-loader!./code/05_crawlee_beautifulsoup.py';
+import CrawleeParselExample from '!!raw-loader!./code/05_crawlee_parsel.py';
+import CrawleePlaywrightExample from '!!raw-loader!./code/05_crawlee_playwright.py';
 
 In this guide you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
 
diff --git a/docs/03_guides/06_scrapy.mdx b/docs/03_guides/06_scrapy.mdx
diff --git a/docs/03_guides/code/01_beautifulsoup_httpx.py b/docs/03_guides/code/01_beautifulsoup_httpx.py
@@ -1,9 +1,7 @@
-from __future__ import annotations
-
 from urllib.parse import urljoin
 
+import httpx
 from bs4 import BeautifulSoup
-from httpx import AsyncClient
 
 from apify import Actor, Request
 
@@ -32,7 +30,7 @@ async def main() -> None:
             await request_queue.add_request(new_request)
 
         # Create an HTTPX client to fetch the HTML content of the URLs.
-        async with AsyncClient() as client:
+        async with httpx.AsyncClient() as client:
             # Process the URLs from the request queue.
             while request := await request_queue.fetch_next_request():
                 url = request.url
diff --git a/docs/03_guides/code/02_parsel_impit.py b/docs/03_guides/code/02_parsel_impit.py
@@ -0,0 +1,89 @@
+from urllib.parse import urljoin
+
+import impit
+import parsel
+
+from apify import Actor, Request
+
+
+async def main() -> None:
+    # Enter the context of the Actor.
+    async with Actor:
+        # Retrieve the Actor input, and use default values if not provided.
+        actor_input = await Actor.get_input() or {}
+        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
+        max_depth = actor_input.get('max_depth', 1)
+
+        # Exit if no start URLs are provided.
+        if not start_urls:
+            Actor.log.info('No start URLs specified in Actor input, exiting...')
+            await Actor.exit()
+
+        # Open the default request queue for handling URLs to be processed.
+        request_queue = await Actor.open_request_queue()
+
+        # Enqueue the start URLs with an initial crawl depth of 0.
+        for start_url in start_urls:
+            url = start_url.get('url')
+            Actor.log.info(f'Enqueuing {url} ...')
+            new_request = Request.from_url(url, user_data={'depth': 0})
+            await request_queue.add_request(new_request)
+
+        # Create an Impit client to fetch the HTML content of the URLs.
+        async with impit.AsyncClient() as client:
+            # Process the URLs from the request queue.
+            while request := await request_queue.fetch_next_request():
+                url = request.url
+
+                if not isinstance(request.user_data['depth'], (str, int)):
+                    raise TypeError('Request.depth is an unexpected type.')
+
+                depth = int(request.user_data['depth'])
+                Actor.log.info(f'Scraping {url} (depth={depth}) ...')
+
+                try:
+                    # Fetch the HTTP response from the specified URL using Impit.
+                    response = await client.get(url)
+
+                    # Parse the HTML content using Parsel Selector.
+                    selector = parsel.Selector(text=response.text)
+
+                    # If the current depth is less than max_depth, find nested links
+                    # and enqueue them.
+                    if depth < max_depth:
+                        # Extract all links using CSS selector
+                        links = selector.css('a::attr(href)').getall()
+                        for link_href in links:
+                            link_url = urljoin(url, link_href)
+
+                            if link_url.startswith(('http://', 'https://')):
+                                Actor.log.info(f'Enqueuing {link_url} ...')
+                                new_request = Request.from_url(
+                                    link_url,
+                                    user_data={'depth': depth + 1},
+                                )
+                                await request_queue.add_request(new_request)
+
+                    # Extract the desired data using Parsel selectors.
+                    title = selector.css('title::text').get()
+                    h1s = selector.css('h1::text').getall()
+                    h2s = selector.css('h2::text').getall()
+                    h3s = selector.css('h3::text').getall()
+
+                    data = {
+                        'url': url,
+                        'title': title,
+                        'h1s': h1s,
+                        'h2s': h2s,
+                        'h3s': h3s,
+                    }
+
+                    # Store the extracted data to the default dataset.
+                    await Actor.push_data(data)
+
+                except Exception:
+                    Actor.log.exception(f'Cannot extract data from {url}.')
+
+                finally:
+                    # Mark the request as handled to ensure it is not processed again.
+                    await request_queue.mark_request_as_handled(request)
diff --git a/docs/03_guides/code/03_playwright.py b/docs/03_guides/code/03_playwright.py
@@ -1,5 +1,3 @@
-from __future__ import annotations
-
 from urllib.parse import urljoin
 
 from playwright.async_api import async_playwright
diff --git a/docs/03_guides/code/04_selenium.py b/docs/03_guides/code/04_selenium.py
@@ -1,5 +1,3 @@
-from __future__ import annotations
-
 import asyncio
 from urllib.parse import urljoin
 
diff --git a/docs/03_guides/code/05_crawlee_beautifulsoup.py b/docs/03_guides/code/05_crawlee_beautifulsoup.py
@@ -1,5 +1,3 @@
-from __future__ import annotations
-
 from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
 
 from apify import Actor
diff --git a/docs/03_guides/code/05_crawlee_parsel.py b/docs/03_guides/code/05_crawlee_parsel.py
diff --git a/docs/03_guides/code/05_crawlee_playwright.py b/docs/03_guides/code/05_crawlee_playwright.py
@@ -1,5 +1,3 @@
-from __future__ import annotations
-
 from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
 
 from apify import Actor

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-from __future__ import annotations`
`2`		`-`
`3`	`1`	`from urllib.parse import urljoin`
`4`	`2`
`5`	`3`	`from playwright.async_api import async_playwright`
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-from __future__ import annotations`
`2`		`-`
`3`	`1`	`import asyncio`
`4`	`2`	`from urllib.parse import urljoin`
`5`	`3`
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-from __future__ import annotations`
`2`		`-`
`3`	`1`	`from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext`
`4`	`2`
`5`	`3`	`from apify import Actor`