-
Notifications
You must be signed in to change notification settings - Fork 15
docs: Add Parsel and Crawlee Parsel guides #577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
cee1d9c
reorder of sections
vdusek eaef0aa
add crawlee parsel
vdusek f3e3473
add parsel impit
vdusek d8e3b51
fix scrapy integration test path
vdusek 4342563
rm unnecessary future import
vdusek 704b76d
address the feedback
vdusek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
--- | ||
id: beautifulsoup-httpx | ||
title: Using BeautifulSoup with HTTPX | ||
--- | ||
|
||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import BeautifulSoupHttpxExample from '!!raw-loader!./code/01_beautifulsoup_httpx.py'; | ||
|
||
In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library with the [HTTPX](https://www.python-httpx.org/) library in your Apify Actors. | ||
|
||
## Introduction | ||
|
||
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction. | ||
|
||
[HTTPX](https://www.python-httpx.org/) is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests. | ||
|
||
To create an Actor which uses those libraries, start from the [BeautifulSoup & Python](https://apify.com/templates/categories/python) Actor template. This template includes the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [HTTPX](https://www.python-httpx.org/) libraries preinstalled, allowing you to begin development immediately. | ||
|
||
## Example Actor | ||
|
||
Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages. | ||
|
||
<CodeBlock className="language-python"> | ||
{BeautifulSoupHttpxExample} | ||
</CodeBlock> | ||
|
||
## Conclusion | ||
|
||
In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
--- | ||
id: parsel-impit | ||
title: Using Parsel with Impit | ||
--- | ||
|
||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import ParselImpitExample from '!!raw-loader!./code/02_parsel_impit.py'; | ||
|
||
In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors. | ||
|
||
## Introduction | ||
|
||
[Parsel](https://github.com/scrapy/parsel) is a Python library for extracting data from HTML and XML documents using CSS selectors and [XPath](https://en.wikipedia.org/wiki/XPath) expressions. It offers an intuitive API for navigating and extracting structured data, making it a popular choice for web scraping. Compared to [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/), it also delivers better performance. | ||
|
||
[Impit](https://github.com/apify/impit) is Apify's high-performance HTTP client for Python. It supports both synchronous and asynchronous workflows and is built for large-scale web scraping, where making thousands of requests efficiently is essential. With built-in browser impersonation and anti-blocking features, it simplifies handling modern websites. | ||
|
||
## Example Actor | ||
|
||
The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links. | ||
|
||
<CodeBlock className="language-python"> | ||
{ParselImpitExample} | ||
</CodeBlock> | ||
|
||
## Conclusion | ||
|
||
In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! |
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
--- | ||
id: crawlee | ||
title: Using Crawlee | ||
--- | ||
|
||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import CrawleeBeautifulSoupExample from '!!raw-loader!./code/05_crawlee_beautifulsoup.py'; | ||
import CrawleeParselExample from '!!raw-loader!./code/05_crawlee_parsel.py'; | ||
import CrawleePlaywrightExample from '!!raw-loader!./code/05_crawlee_playwright.py'; | ||
|
||
In this guide you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. | ||
|
||
## Introduction | ||
|
||
[Crawlee](https://crawlee.dev/python) is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like [`HttpCrawler`](https://crawlee.dev/python/api/class/HttpCrawler), [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) and [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and browser-based crawlers like [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler), to suit different scraping needs. | ||
|
||
In this guide, you'll learn how to use Crawlee with [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) to build Apify Actors for web scraping. | ||
|
||
## Actor with BeautifulSoupCrawler | ||
|
||
The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is ideal for extracting data from static HTML pages. It uses [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing and [`ImpitHttpClient`](https://crawlee.dev/python/api/class/ImpitHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) is a great choice for your scraping tasks. Below is an example of how to use it` in an Apify Actor. | ||
|
||
<CodeBlock className="language-python"> | ||
{CrawleeBeautifulSoupExample} | ||
</CodeBlock> | ||
|
||
## Actor with ParselCrawler | ||
|
||
The [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) works in the same way as [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), but it uses the [Parsel](https://parsel.readthedocs.io/en/latest/) library for HTML parsing. This allows for more powerful and flexible data extraction using [XPath](https://en.wikipedia.org/wiki/XPath) selectors. It should be faster than [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). Below is an example of how to use [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler) in an Apify Actor. | ||
|
||
<CodeBlock className="language-python"> | ||
{CrawleeParselExample} | ||
</CodeBlock> | ||
|
||
## Actor with PlaywrightCrawler | ||
|
||
The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) is built for handling dynamic web pages that rely on JavaScript for content generation. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) in an Apify Actor. | ||
|
||
<CodeBlock className="language-python"> | ||
{CrawleePlaywrightExample} | ||
</CodeBlock> | ||
|
||
## Conclusion | ||
|
||
In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vdusek marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
from urllib.parse import urljoin | ||
|
||
import impit | ||
import parsel | ||
|
||
from apify import Actor, Request | ||
|
||
|
||
async def main() -> None: | ||
# Enter the context of the Actor. | ||
async with Actor: | ||
# Retrieve the Actor input, and use default values if not provided. | ||
actor_input = await Actor.get_input() or {} | ||
start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) | ||
max_depth = actor_input.get('max_depth', 1) | ||
|
||
# Exit if no start URLs are provided. | ||
if not start_urls: | ||
Actor.log.info('No start URLs specified in Actor input, exiting...') | ||
await Actor.exit() | ||
|
||
# Open the default request queue for handling URLs to be processed. | ||
request_queue = await Actor.open_request_queue() | ||
|
||
# Enqueue the start URLs with an initial crawl depth of 0. | ||
for start_url in start_urls: | ||
url = start_url.get('url') | ||
Actor.log.info(f'Enqueuing {url} ...') | ||
new_request = Request.from_url(url, user_data={'depth': 0}) | ||
await request_queue.add_request(new_request) | ||
|
||
# Create an Impit client to fetch the HTML content of the URLs. | ||
async with impit.AsyncClient() as client: | ||
# Process the URLs from the request queue. | ||
while request := await request_queue.fetch_next_request(): | ||
url = request.url | ||
|
||
if not isinstance(request.user_data['depth'], (str, int)): | ||
raise TypeError('Request.depth is an unexpected type.') | ||
|
||
depth = int(request.user_data['depth']) | ||
Actor.log.info(f'Scraping {url} (depth={depth}) ...') | ||
|
||
try: | ||
# Fetch the HTTP response from the specified URL using Impit. | ||
response = await client.get(url) | ||
|
||
# Parse the HTML content using Parsel Selector. | ||
selector = parsel.Selector(text=response.text) | ||
|
||
# If the current depth is less than max_depth, find nested links | ||
# and enqueue them. | ||
if depth < max_depth: | ||
# Extract all links using CSS selector | ||
links = selector.css('a::attr(href)').getall() | ||
for link_href in links: | ||
link_url = urljoin(url, link_href) | ||
|
||
if link_url.startswith(('http://', 'https://')): | ||
Actor.log.info(f'Enqueuing {link_url} ...') | ||
new_request = Request.from_url( | ||
link_url, | ||
user_data={'depth': depth + 1}, | ||
) | ||
await request_queue.add_request(new_request) | ||
|
||
# Extract the desired data using Parsel selectors. | ||
title = selector.css('title::text').get() | ||
h1s = selector.css('h1::text').getall() | ||
h2s = selector.css('h2::text').getall() | ||
h3s = selector.css('h3::text').getall() | ||
|
||
data = { | ||
'url': url, | ||
'title': title, | ||
'h1s': h1s, | ||
'h2s': h2s, | ||
'h3s': h3s, | ||
} | ||
|
||
# Store the extracted data to the default dataset. | ||
await Actor.push_data(data) | ||
|
||
except Exception: | ||
Actor.log.exception(f'Cannot extract data from {url}.') | ||
|
||
finally: | ||
# Mark the request as handled to ensure it is not processed again. | ||
await request_queue.mark_request_as_handled(request) |
2 changes: 0 additions & 2 deletions
2
docs/02_guides/code/03_playwright.py → docs/03_guides/code/03_playwright.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 0 additions & 2 deletions
2
docs/02_guides/code/04_selenium.py → docs/03_guides/code/04_selenium.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,3 @@ | ||
from __future__ import annotations | ||
|
||
import asyncio | ||
from urllib.parse import urljoin | ||
|
||
|
2 changes: 0 additions & 2 deletions
2
...2_guides/code/02_crawlee_beautifulsoup.py → ...3_guides/code/05_crawlee_beautifulsoup.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.