Skip to content

Commit f3e3473

Browse files
committed
add parsel impit
1 parent eaef0aa commit f3e3473

11 files changed

+127
-20
lines changed

docs/03_guides/01_beautifulsoup_httpx.mdx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,20 +11,20 @@ In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.co
1111

1212
## Introduction
1313

14-
`BeautifulSoup` is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction.
14+
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction.
1515

16-
`HTTPX` is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests.
16+
[HTTPX](https://www.python-httpx.org/) is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests.
1717

18-
To create an `Actor` which uses those libraries, start from the [BeautifulSoup & Python](https://apify.com/templates/categories/python) Actor template. This template includes the `BeautifulSoup` and `HTTPX` libraries preinstalled, allowing you to begin development immediately.
18+
To create an Actor which uses those libraries, start from the [BeautifulSoup & Python](https://apify.com/templates/categories/python) Actor template. This template includes the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [HTTPX](https://www.python-httpx.org/) libraries preinstalled, allowing you to begin development immediately.
1919

2020
## Example Actor
2121

22-
Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses `HTTPX` for fetching pages and `BeautifulSoup` for parsing their content to extract titles and links to other pages.
22+
Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages.
2323

2424
<CodeBlock className="language-python">
2525
{BeautifulSoupHttpxExample}
2626
</CodeBlock>
2727

2828
## Conclusion
2929

30-
In this guide, you learned how to use the `BeautifulSoup` with the `HTTPX` in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
30+
In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/02_parsel_impit.mdx

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
id: parsel-impit
3+
title: Using Parsel with Impit
4+
---
5+
6+
import CodeBlock from '@theme/CodeBlock';
7+
8+
import ParselImpitExample from '!!raw-loader!./code/02_parsel_impit.py';
9+
10+
In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors.
11+
12+
## Introduction
13+
14+
[Parsel](https://github.com/scrapy/parsel) is a Python library for extracting data from HTML and XML documents using CSS selectors and [XPath](https://en.wikipedia.org/wiki/XPath) expressions. It offers an intuitive API for navigating and extracting structured data, making it a popular choice for web scraping. Compared to [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), it also delivers better performance.
15+
16+
[Impit](https://github.com/apify/impit) is Apify's high-performance HTTP client for Python. It supports both synchronous and asynchronous workflows and is built for large-scale web scraping, where making thousands of requests efficiently is essential. With built-in browser impersonation and anti-blocking features, it simplifies handling modern websites.
17+
18+
## Example Actor
19+
20+
The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links.
21+
22+
<CodeBlock className="language-python">
23+
{ParselImpitExample}
24+
</CodeBlock>
25+
26+
## Conclusion
27+
28+
In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/03_guides/02_crawlee.mdx renamed to docs/03_guides/05_crawlee.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ title: Using Crawlee
55

66
import CodeBlock from '@theme/CodeBlock';
77

8-
import CrawleeBeautifulSoupExample from '!!raw-loader!./code/02_crawlee_beautifulsoup.py';
9-
import CrawleeParselExample from '!!raw-loader!./code/02_crawlee_parsel.py';
10-
import CrawleePlaywrightExample from '!!raw-loader!./code/02_crawlee_playwright.py';
8+
import CrawleeBeautifulSoupExample from '!!raw-loader!./code/05_crawlee_beautifulsoup.py';
9+
import CrawleeParselExample from '!!raw-loader!./code/05_crawlee_parsel.py';
10+
import CrawleePlaywrightExample from '!!raw-loader!./code/05_crawlee_playwright.py';
1111

1212
In this guide you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
1313

File renamed without changes.

docs/03_guides/code/01_beautifulsoup_httpx.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
1-
from __future__ import annotations
2-
31
from urllib.parse import urljoin
42

3+
import httpx
54
from bs4 import BeautifulSoup
6-
from httpx import AsyncClient
75

86
from apify import Actor, Request
97

@@ -32,7 +30,7 @@ async def main() -> None:
3230
await request_queue.add_request(new_request)
3331

3432
# Create an HTTPX client to fetch the HTML content of the URLs.
35-
async with AsyncClient() as client:
33+
async with httpx.AsyncClient() as client:
3634
# Process the URLs from the request queue.
3735
while request := await request_queue.fetch_next_request():
3836
url = request.url
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
from urllib.parse import urljoin
2+
3+
import impit
4+
import parsel
5+
6+
from apify import Actor, Request
7+
8+
9+
async def main() -> None:
10+
# Enter the context of the Actor.
11+
async with Actor:
12+
# Retrieve the Actor input, and use default values if not provided.
13+
actor_input = await Actor.get_input() or {}
14+
start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
15+
max_depth = actor_input.get('max_depth', 1)
16+
17+
# Exit if no start URLs are provided.
18+
if not start_urls:
19+
Actor.log.info('No start URLs specified in Actor input, exiting...')
20+
await Actor.exit()
21+
22+
# Open the default request queue for handling URLs to be processed.
23+
request_queue = await Actor.open_request_queue()
24+
25+
# Enqueue the start URLs with an initial crawl depth of 0.
26+
for start_url in start_urls:
27+
url = start_url.get('url')
28+
Actor.log.info(f'Enqueuing {url} ...')
29+
new_request = Request.from_url(url, user_data={'depth': 0})
30+
await request_queue.add_request(new_request)
31+
32+
# Create an Impit client to fetch the HTML content of the URLs.
33+
async with impit.AsyncClient() as client:
34+
# Process the URLs from the request queue.
35+
while request := await request_queue.fetch_next_request():
36+
url = request.url
37+
38+
if not isinstance(request.user_data['depth'], (str, int)):
39+
raise TypeError('Request.depth is an unexpected type.')
40+
41+
depth = int(request.user_data['depth'])
42+
Actor.log.info(f'Scraping {url} (depth={depth}) ...')
43+
44+
try:
45+
# Fetch the HTTP response from the specified URL using Impit.
46+
response = await client.get(url)
47+
48+
# Parse the HTML content using Parsel Selector.
49+
selector = parsel.Selector(text=response.text)
50+
51+
# If the current depth is less than max_depth, find nested links
52+
# and enqueue them.
53+
if depth < max_depth:
54+
# Extract all links using CSS selector
55+
links = selector.css('a::attr(href)').getall()
56+
for link_href in links:
57+
link_url = urljoin(url, link_href)
58+
59+
if link_url.startswith(('http://', 'https://')):
60+
Actor.log.info(f'Enqueuing {link_url} ...')
61+
new_request = Request.from_url(
62+
link_url,
63+
user_data={'depth': depth + 1},
64+
)
65+
await request_queue.add_request(new_request)
66+
67+
# Extract the desired data using Parsel selectors.
68+
title = selector.css('title::text').get()
69+
h1s = selector.css('h1::text').getall()
70+
h2s = selector.css('h2::text').getall()
71+
h3s = selector.css('h3::text').getall()
72+
73+
data = {
74+
'url': url,
75+
'title': title,
76+
'h1s': h1s,
77+
'h2s': h2s,
78+
'h3s': h3s,
79+
}
80+
81+
# Store the extracted data to the default dataset.
82+
await Actor.push_data(data)
83+
84+
except Exception:
85+
Actor.log.exception(f'Cannot extract data from {url}.')
86+
87+
finally:
88+
# Mark the request as handled to ensure it is not processed again.
89+
await request_queue.mark_request_as_handled(request)

docs/03_guides/code/03_playwright.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
from __future__ import annotations
2-
31
from urllib.parse import urljoin
42

53
from playwright.async_api import async_playwright

docs/03_guides/code/04_selenium.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
from __future__ import annotations
2-
31
import asyncio
42
from urllib.parse import urljoin
53

docs/03_guides/code/02_crawlee_beautifulsoup.py renamed to docs/03_guides/code/05_crawlee_beautifulsoup.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
from __future__ import annotations
2-
31
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
42

53
from apify import Actor
File renamed without changes.

0 commit comments

Comments
 (0)