Skip to content

Commit a895901

Browse files
committed
add AdaptivePlaywrightCrawler example
1 parent 7a9e092 commit a895901

File tree

5 files changed

+89
-45
lines changed

5 files changed

+89
-45
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import asyncio
2+
3+
from crawlee.crawlers import (
4+
AdaptivePlaywrightCrawler,
5+
AdaptivePlaywrightCrawlerStatisticState,
6+
AdaptivePlaywrightCrawlingContext,
7+
)
8+
from crawlee.statistics import Statistics
9+
10+
from .selectolax_parser import SelectolaxLexborParser
11+
12+
13+
async def main() -> None:
14+
crawler: AdaptivePlaywrightCrawler = AdaptivePlaywrightCrawler(
15+
max_requests_per_crawl=10,
16+
# Use custom Selectolax parser for static content parsing.
17+
static_parser=SelectolaxLexborParser(),
18+
# Set up statistics with AdaptivePlaywrightCrawlerStatisticState.
19+
statistics=Statistics(state_model=AdaptivePlaywrightCrawlerStatisticState),
20+
)
21+
22+
@crawler.router.default_handler
23+
async def handle_request(context: AdaptivePlaywrightCrawlingContext) -> None:
24+
context.log.info(f'Processing {context.request.url} ...')
25+
data = {
26+
'url': context.request.url,
27+
'title': await context.query_selector_one('title'),
28+
}
29+
30+
await context.push_data(data)
31+
32+
await context.enqueue_links()
33+
34+
await crawler.run(['https://crawlee.dev/'])
35+
36+
37+
if __name__ == '__main__':
38+
asyncio.run(main())

docs/guides/code_examples/crawler_custom_parser/selectolax_context.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ class SelectolaxLexborContext(ParsedHttpCrawlingContext[LexborHTMLParser]):
1414
context methods (push_data, enqueue_links, etc.) plus custom helpers.
1515
"""
1616

17+
# It is only for convenience and not strictly necessary, as the
18+
# parsed_content field is already available from the base class.
1719
@property
1820
def parser(self) -> LexborHTMLParser:
1921
"""Convenient alias for accessing the parsed document."""

docs/guides/crawler_custom_parser.mdx

Lines changed: 44 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import SelectolaxParserSource from '!!raw-loader!./code_examples/crawler_custom_
2020
import SelectolaxContextSource from '!!raw-loader!./code_examples/crawler_custom_parser/selectolax_context.py';
2121
import SelectolaxCrawlerSource from '!!raw-loader!./code_examples/crawler_custom_parser/selectolax_crawler.py';
2222
import SelectolaxCrawlerRunSource from '!!raw-loader!./code_examples/crawler_custom_parser/selectolax_crawler_run.py';
23+
import AdaptiveCrawlerRunSource from '!!raw-loader!./code_examples/crawler_custom_parser/selectolax_adaptive_run.py';
2324

2425
Crawlee provides <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> and <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> as built-in solutions for HTML parsing. However, you may want to use a different parsing library that better fits your specific needs.
2526

@@ -32,47 +33,35 @@ There are two approaches to integrate a custom parser:
3233

3334
The <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> gives you direct access to raw HTTP responses, allowing you to integrate any parsing library of your choice. When using this approach, helpers like <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> and <ApiLink to="class/ExtractLinksFunction">`extract_links`</ApiLink> are not available, and it requires minimal setup.
3435

35-
The following sections demonstrate how to use various parsing libraries with <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> to extract data from a page and enqueue discovered links for further crawling.
36-
37-
### lxml
38-
39-
[lxml](https://lxml.de/) is a high-performance XML and HTML parser that provides Python bindings to the C libraries libxml2 and libxslt. It supports XPath 1.0, XSLT 1.0, and EXSLT extensions for element selection. The `make_links_absolute` method is particularly useful for converting relative URLs to absolute ones before link extraction.
40-
41-
<RunnableCodeBlock className="language-python" language="python">
42-
{LxmlParser}
43-
</RunnableCodeBlock>
44-
45-
### lxml with SaxonC-HE
46-
47-
Using [SaxonC-HE](https://pypi.org/project/saxonche/) together with lxml enables XPath 3.1 support, which provides advanced features like `distinct-values()` function and more powerful string manipulation. In this setup, lxml converts HTML to well-formed XML that SaxonC-HE can process.
48-
49-
<RunnableCodeBlock className="language-python" language="python">
50-
{LxmlSaxoncheParser}
51-
</RunnableCodeBlock>
52-
53-
### selectolax
54-
55-
[selectolax](https://github.com/rushter/selectolax) is a fast HTML parser that offers two backends: the default `Modest` engine and `Lexbor`. It provides a simple API with CSS selector support. The example below uses the `Lexbor` backend for optimal performance.
56-
57-
<RunnableCodeBlock className="language-python" language="python">
58-
{LexborParser}
59-
</RunnableCodeBlock>
60-
61-
### PyQuery
62-
63-
[PyQuery](https://pyquery.readthedocs.io/) brings jQuery-like syntax to Python for HTML manipulation. Built on top of `lxml`, it combines familiar jQuery CSS selectors with Python's ease of use. This is a good choice if you're comfortable with jQuery syntax and want a straightforward API for DOM traversal and manipulation.
64-
65-
<RunnableCodeBlock className="language-python" language="python">
66-
{PyqueryParser}
67-
</RunnableCodeBlock>
68-
69-
### Scrapling
70-
71-
[Scrapling](https://github.com/D4Vinci/Scrapling) is a scraping library that provides both CSS selectors and XPath 1.0. It offers automatic text extraction and a Scrapy/BeautifulSoup-like API with pseudo-elements support similar to Parsel.
72-
73-
<RunnableCodeBlock className="language-python" language="python">
74-
{ScraplingParser}
75-
</RunnableCodeBlock>
36+
The following examples demonstrate integration with various parsing libraries: [lxml](https://lxml.de/) for high-performance XPath 1.0 parsing, [lxml with SaxonC-HE](https://pypi.org/project/saxonche/) for XPath 3.1 support, [selectolax](https://github.com/rushter/selectolax) for fast CSS selector-based parsing, [PyQuery](https://pyquery.readthedocs.io/) for jQuery-like syntax, and [scrapling](https://github.com/D4Vinci/Scrapling) for CSS and XPath selectors with Scrapy/Parsel-like API and BeautifulSoup-style find methods.
37+
38+
<Tabs groupId="custom_parsers">
39+
<TabItem value="lxml" label="lxml">
40+
<RunnableCodeBlock className="language-python" language="python">
41+
{LxmlParser}
42+
</RunnableCodeBlock>
43+
</TabItem>
44+
<TabItem value="saxonche" label="lxml with SaxonC-HE">
45+
<RunnableCodeBlock className="language-python" language="python">
46+
{LxmlSaxoncheParser}
47+
</RunnableCodeBlock>
48+
</TabItem>
49+
<TabItem value="selectolax" label="selectolax">
50+
<RunnableCodeBlock className="language-python" language="python">
51+
{LexborParser}
52+
</RunnableCodeBlock>
53+
</TabItem>
54+
<TabItem value="pyquery" label="PyQuery">
55+
<RunnableCodeBlock className="language-python" language="python">
56+
{PyqueryParser}
57+
</RunnableCodeBlock>
58+
</TabItem>
59+
<TabItem value="scrapling" label="Scrapling">
60+
<RunnableCodeBlock className="language-python" language="python">
61+
{ScraplingParser}
62+
</RunnableCodeBlock>
63+
</TabItem>
64+
</Tabs>
7665

7766
## Creating a custom crawler
7867

@@ -106,11 +95,21 @@ The crawler class connects the parser and context. Extend <ApiLink to="class/Abs
10695

10796
### Using the crawler
10897

109-
The custom crawler works like any built-in crawler. Request handlers receive your custom context with full access to framework helpers like <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink>:
98+
The custom crawler works like any built-in crawler. Request handlers receive your custom context with full access to framework helpers like <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink>. Additionally, the custom parser can be used with <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> for adaptive crawling:
99+
100+
<Tabs groupId="crawlers">
101+
<TabItem value="selectolax_crawler" label="SelectolaxCrawler">
102+
<CodeBlock className="language-python" language="python">
103+
{SelectolaxCrawlerRunSource}
104+
</CodeBlock>
105+
</TabItem>
106+
<TabItem value="adaptive_playwright_crawler" label="AdaptivePlaywrightCrawler with SelectolaxParser">
107+
<CodeBlock className="language-python" language="python">
108+
{AdaptiveCrawlerRunSource}
109+
</CodeBlock>
110+
</TabItem>
111+
</Tabs>
110112

111-
<CodeBlock className="language-python" language="python">
112-
{SelectolaxCrawlerRunSource}
113-
</CodeBlock>
114113

115114
## Conclusion
116115

src/crawlee/crawlers/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,14 @@
2323
'AdaptivePlaywrightCrawler',
2424
'AdaptivePlaywrightCrawlingContext',
2525
'AdaptivePlaywrightPreNavCrawlingContext',
26+
'AdaptivePlaywrightCrawlerStatisticState',
2627
'RenderingType',
2728
'RenderingTypePrediction',
2829
'RenderingTypePredictor',
2930
):
3031
from ._adaptive_playwright import (
3132
AdaptivePlaywrightCrawler,
33+
AdaptivePlaywrightCrawlerStatisticState,
3234
AdaptivePlaywrightCrawlingContext,
3335
AdaptivePlaywrightPreNavCrawlingContext,
3436
RenderingType,

src/crawlee/crawlers/_adaptive_playwright/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,12 @@
1515
from ._rendering_type_predictor import RenderingType, RenderingTypePrediction, RenderingTypePredictor
1616
with _try_import(__name__, 'BeautifulSoupCrawlingContext'):
1717
from ._adaptive_playwright_crawler import AdaptivePlaywrightCrawler
18+
with _try_import(__name__, 'AdaptivePlaywrightCrawlerStatisticState'):
19+
from ._adaptive_playwright_crawler import AdaptivePlaywrightCrawlerStatisticState
1820

1921
__all__ = [
2022
'AdaptivePlaywrightCrawler',
23+
'AdaptivePlaywrightCrawlerStatisticState',
2124
'AdaptivePlaywrightCrawlingContext',
2225
'AdaptivePlaywrightPreNavCrawlingContext',
2326
'RenderingType',

0 commit comments

Comments
 (0)