Skip to content

Commit 6394389

Browse files
Pijukateljanbuchar
authored andcommitted
feat: Add adaptive context helpers (apify#964)
### Description Add adaptive context helpers and documentation for AdaptivePlaywrightCrawler. ### Issues - Closes: apify#249 --------- Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Jan Buchar <[email protected]>
1 parent 32f222c commit 6394389

22 files changed

+684
-69
lines changed

docs/examples/code/adaptive_playwright_crawler.py

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import asyncio
2+
from datetime import timedelta
23

34
from playwright.async_api import Route
45

@@ -10,40 +11,43 @@
1011

1112

1213
async def main() -> None:
14+
# Crawler created by following factory method will use `beautifulsoup`
15+
# for parsing static content.
1316
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
1417
max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'headless': False}
1518
)
1619

17-
@crawler.router.handler(label='label')
20+
@crawler.router.default_handler
1821
async def request_handler_for_label(
1922
context: AdaptivePlaywrightCrawlingContext,
2023
) -> None:
21-
# Do some processing using `page`
22-
some_locator = context.page.locator('div').first
23-
await some_locator.wait_for()
24-
# Do stuff with locator...
25-
context.log.info(f'Playwright processing of: {context.request.url} ...')
26-
27-
@crawler.router.default_handler
28-
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
29-
context.log.info(f'User handler processing: {context.request.url} ...')
3024
# Do some processing using `parsed_content`
3125
context.log.info(context.parsed_content.title)
3226

27+
# Locate element h2 within 5 seconds
28+
h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))
29+
# Do stuff with element found by the selector
30+
context.log.info(h2)
31+
3332
# Find more links and enqueue them.
3433
await context.enqueue_links()
35-
await context.push_data({'Top crawler Url': context.request.url})
34+
# Save some data.
35+
await context.push_data({'Visited url': context.request.url})
3636

3737
@crawler.pre_navigation_hook
3838
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
39-
"""Hook executed both in static sub crawler and playwright sub crawler."""
40-
# Trying to access context.page in this hook would raise `AdaptiveContextError`
41-
# for pages crawled without playwright.
39+
"""Hook executed both in static sub crawler and playwright sub crawler.
40+
41+
Trying to access `context.page` in this hook would raise `AdaptiveContextError`
42+
for pages crawled without playwright."""
4243
context.log.info(f'pre navigation hook for: {context.request.url} ...')
4344

4445
@crawler.pre_navigation_hook(playwright_only=True)
4546
async def hook_playwright(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
46-
"""Hook executed only in playwright sub crawler."""
47+
"""Hook executed only in playwright sub crawler.
48+
49+
It is safe to access `page` object.
50+
"""
4751

4852
async def some_routing_function(route: Route) -> None:
4953
await route.continue_()
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
id: adaptive-playwright-crawler
3+
title: AdaptivePlaywrightCrawler
4+
---
5+
6+
import ApiLink from '@site/src/components/ApiLink';
7+
import CodeBlock from '@theme/CodeBlock';
8+
9+
import AdaptivePlaywrightCrawlerExample from '!!raw-loader!./code/adaptive_playwright_crawler.py';
10+
11+
This example demonstrates how to use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
12+
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit.
13+
14+
A [pre-navigation hook](/python/docs/guides/adaptive-playwright-crawler#page-configuration-with-pre-navigation-hooks) can be used to perform actions before navigating to the URL. This hook provides further flexibility in controlling environment and preparing for navigation. Hooks will be executed both for the pages crawled by HTTP-bases sub crawler and playwright based sub crawler. Use `playwright_only=True` to mark hooks that should be executed only for playwright sub crawler.
15+
16+
For more detailed description please see [AdaptivePlaywrightCrawler guide](/python/docs/guides/adaptive-playwright-crawler 'AdaptivePlaywrightCrawler guide')
17+
18+
<CodeBlock className="language-python">
19+
{AdaptivePlaywrightCrawlerExample}
20+
</CodeBlock>
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
from datetime import timedelta
2+
3+
from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext
4+
5+
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser()
6+
7+
8+
@crawler.router.default_handler
9+
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
10+
# Locate element h2 within 5 seconds
11+
h2 = await context.query_selector_one('h2', timedelta(milliseconds=5000))
12+
# Do stuff with element found by the selector
13+
context.log.info(h2)
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from crawlee.crawlers import AdaptivePlaywrightCrawler
2+
3+
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
4+
# Arguments relevant only for PlaywrightCrawler
5+
playwright_crawler_specific_kwargs={'headless': False, 'browser_type': 'chromium'},
6+
# Arguments relevant only for BeautifulSoupCrawler
7+
static_crawler_specific_kwargs={'additional_http_error_status_codes': [204]},
8+
# Common arguments relevant to all crawlers
9+
max_crawl_depth=5,
10+
)
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from crawlee.crawlers import AdaptivePlaywrightCrawler
2+
3+
crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser(
4+
# Arguments relevant only for PlaywrightCrawler
5+
playwright_crawler_specific_kwargs={'headless': False, 'browser_type': 'chromium'},
6+
# Arguments relevant only for ParselCrawler
7+
static_crawler_specific_kwargs={'additional_http_error_status_codes': [204]},
8+
# Common arguments relevant to all crawlers
9+
max_crawl_depth=5,
10+
)
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
from crawlee import Request
2+
from crawlee._types import RequestHandlerRunResult
3+
from crawlee.crawlers import (
4+
AdaptivePlaywrightCrawler,
5+
RenderingType,
6+
RenderingTypePrediction,
7+
RenderingTypePredictor,
8+
)
9+
10+
11+
class CustomRenderingTypePredictor(RenderingTypePredictor):
12+
def __init__(self) -> None:
13+
self._learning_data = list[tuple[Request, RenderingType]]()
14+
15+
def predict(self, request: Request) -> RenderingTypePrediction:
16+
# Some custom logic that produces some `RenderingTypePrediction`
17+
# based on the `request` input.
18+
rendering_type: RenderingType = (
19+
'static' if 'abc' in request.url else 'client only'
20+
)
21+
22+
return RenderingTypePrediction(
23+
# Recommends `static` rendering type -> HTTP-based sub crawler will be used.
24+
rendering_type=rendering_type,
25+
# Recommends that both sub crawlers should run with 20% chance. When both sub
26+
# crawlers are running, the predictor can compare results and learn.
27+
# High number means that predictor is not very confident about the
28+
# `rendering_type`, low number means that predictor is very confident.
29+
detection_probability_recommendation=0.2,
30+
)
31+
32+
def store_result(self, request: Request, rendering_type: RenderingType) -> None:
33+
# This function allows predictor to store new learning data and retrain itself
34+
# if needed. `request` is input for prediction and `rendering_type` is the correct
35+
# prediction.
36+
self._learning_data.append((request, rendering_type))
37+
# retrain
38+
39+
40+
def result_checker(result: RequestHandlerRunResult) -> bool:
41+
# Some function that inspects produced `result` and returns `True` if the result
42+
# is correct.
43+
return bool(result) # Check something on result
44+
45+
46+
def result_comparator(
47+
result_1: RequestHandlerRunResult, result_2: RequestHandlerRunResult
48+
) -> bool:
49+
# Some function that inspects two results and returns `True` if they are
50+
# considered equivalent. It is used when comparing results produced by HTTP-based
51+
# sub crawler and playwright based sub crawler.
52+
return (
53+
result_1.push_data_calls == result_2.push_data_calls
54+
) # For example compare `push_data` calls.
55+
56+
57+
crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser(
58+
rendering_type_predictor=CustomRenderingTypePredictor(),
59+
result_checker=result_checker,
60+
result_comparator=result_comparator,
61+
)
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
from playwright.async_api import Route
2+
3+
from crawlee.crawlers import (
4+
AdaptivePlaywrightCrawler,
5+
AdaptivePlaywrightPreNavCrawlingContext,
6+
)
7+
8+
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser()
9+
10+
11+
@crawler.pre_navigation_hook
12+
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
13+
"""Hook executed both in static sub crawler and playwright sub crawler.
14+
15+
Trying to access `context.page` in this hook would raise `AdaptiveContextError`
16+
for pages crawled without playwright."""
17+
18+
context.log.info(f'pre navigation hook for: {context.request.url}')
19+
20+
21+
@crawler.pre_navigation_hook(playwright_only=True)
22+
async def hook_playwright(context: AdaptivePlaywrightPreNavCrawlingContext) -> None:
23+
"""Hook executed only in playwright sub crawler."""
24+
25+
async def some_routing_function(route: Route) -> None:
26+
await route.continue_()
27+
28+
await context.page.route('*/**', some_routing_function)
29+
context.log.info(f'Playwright only pre navigation hook for: {context.request.url}')
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
id: adaptive-playwright-crawler
3+
title: AdaptivePlaywrightCrawler
4+
description: How to use the AdaptivePlaywrightCrawler.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
import CodeBlock from '@theme/CodeBlock';
9+
import Tabs from '@theme/Tabs';
10+
import TabItem from '@theme/TabItem';
11+
12+
import AdaptivePlaywrightCrawlerInitBeautifulSoup from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_beautifulsoup.py';
13+
import AdaptivePlaywrightCrawlerInitParsel from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_parsel.py';
14+
import AdaptivePlaywrightCrawlerInitPrediction from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_prediction.py';
15+
import AdaptivePlaywrightCrawlerHandler from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_handler.py';
16+
import AdaptivePlaywrightCrawlerPreNavHooks from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_pre_nav_hooks.py';
17+
18+
19+
20+
An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
21+
It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit.
22+
23+
Detection is done based on the <ApiLink to="class/RenderingTypePredictor">`RenderingTypePredictor`</ApiLink> with default implementation <ApiLink to="class/DefaultRenderingTypePredictor">`DefaultRenderingTypePredictor`</ApiLink>. It predicts which crawling method should be used and learns from already crawled pages.
24+
25+
## When to use AdaptivePlaywrightCrawler
26+
27+
Use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> in scenarios where some target pages have to be crawled with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites.
28+
29+
Another use case is performing selector-based data extraction without prior knowledge of whether the selector exists in the static page or is dynamically added by a code executed in a browsing client.
30+
31+
## Request handler and adaptive context helpers
32+
33+
Request handler for <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> works on special context type - <ApiLink to="class/AdaptivePlaywrightCrawlingContext">`AdaptivePlaywrightCrawlingContext`</ApiLink>. This context is sometimes created by HTTP-based sub crawler and sometimes by playwright based sub crawler. Due to its dynamic nature, you can't always access [page](https://playwright.dev/python/docs/api/class-page) object. To overcome this limitation, there are three helper methods on this context that can be called regardless of how the context was created.
34+
35+
<ApiLink to="class/AdaptivePlaywrightCrawlingContext#wait_for_selector">`wait_for_selector`</ApiLink> accepts `css` selector as first argument and timeout as second argument. The function will try to locate this selector a return once it is found(within timeout). In practice this means that if HTTP-based sub crawler was used, the function will find the selector only if it is part of the static content. If not, the adaptive crawler will fall back to the playwright sub crawler and will wait try to locate the selector within the timeout using playwright.
36+
37+
<ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_one`</ApiLink> accepts `css` selector as first argument and timeout as second argument. This function acts similar to `wait_for_selector`, but it also returns one selector if any selector is found. Return value type is determined by used HTTP-based sub crawler. For example, it will be `Selector` for <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and `Tag` for <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
38+
39+
<ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_all`</ApiLink> same as <ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_one`</ApiLink>, but returns all found selectors.
40+
41+
<ApiLink to="class/AdaptivePlaywrightCrawlingContext#parse_with_static_parser">`parse_with_static_parser`</ApiLink> will re-parse the whole page. Return value type is determined by used HTTP-based sub crawler. It has optional arguments: `selector` and `timeout`. If those optional arguments are used then the function first calls <ApiLink to="class/AdaptivePlaywrightCrawlingContext#wait_for_selector">`wait_for_selector`</ApiLink> and then do the parsing. This can be used in scenario where some specific element can signal, that page is already complete.
42+
43+
See the following example about how to create request handler and use context helpers:
44+
45+
<CodeBlock className="language-python">
46+
{AdaptivePlaywrightCrawlerHandler}
47+
</CodeBlock>
48+
49+
50+
## Crawler configuration
51+
To use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> it is recommended to use one of the prepared factory methods that will create the crawler with specific HTTP-based sub crawler variant: <ApiLink to="class/AdaptivePlaywrightCrawler#with_beautifulsoup_static_parser">`AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser`</ApiLink> or <ApiLink to="class/AdaptivePlaywrightCrawler#with_parsel_static_parser">`AdaptivePlaywrightCrawler.with_parsel_static_parser`</ApiLink>.
52+
53+
<ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is internally composed of two sub crawlers and you can do a detailed configuration of both of them. For detailed configuration options of the sub crawlers, please refer to their pages: <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink>, <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.
54+
55+
In the following example you can see how to create and configure <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> with two different HTTP-based sub crawlers:
56+
57+
58+
<Tabs>
59+
<TabItem value="BeautifulSoupCrawler" label="BeautifulSoupCrawler" default>
60+
<CodeBlock className="language-python">
61+
{AdaptivePlaywrightCrawlerInitBeautifulSoup}
62+
</CodeBlock>
63+
</TabItem>
64+
<TabItem value="ParselCrawler" label="ParselCrawler" default>
65+
<CodeBlock className="language-python">
66+
{AdaptivePlaywrightCrawlerInitParsel}
67+
</CodeBlock>
68+
</TabItem>
69+
</Tabs>
70+
71+
### Prediction related arguments
72+
73+
To control which pages are crawled by which method you can use following arguments:
74+
75+
<ApiLink to="class/RenderingTypePredictor">`RenderingTypePredictor`</ApiLink> - Class that can give recommendations about which sub crawler should be used for specific url. Predictor will also recommend to use both sub crawlers for some page from time to time, to check that the given recommendation was correct. Predictor should be able to learn from previous results and gradually give more reliable recommendations.
76+
77+
`result_checker` - Is a function that checks result created from crawling a page. By default, it always returns `True`.
78+
79+
`result_comparator` - Is a function that compares two results (HTTP-based sub crawler result and playwright based sub crawler result) and returns `True` if they are considered the same. By default, this function compares calls of context helper `push_data` by each sub crawler. This function is used by `rendering_type_predictor` to evaluate whether HTTP-based crawler has the same results as playwright based sub crawler.
80+
81+
See the following example about how to pass prediction related arguments:
82+
83+
<CodeBlock className="language-python">
84+
{AdaptivePlaywrightCrawlerInitPrediction}
85+
</CodeBlock>
86+
87+
88+
89+
## Page configuration with pre-navigation hooks
90+
In some use cases, you may need to configure the [page](https://playwright.dev/python/docs/api/class-page) before it navigates to the target URL. For instance, you might set navigation timeouts or manipulate other page-level settings. For such cases you can use the <ApiLink to="class/AdaptivePlaywrightCrawler#pre_navigation_hook">`pre_navigation_hook`</ApiLink> method of the <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. This method is called before the page navigates to the target URL and allows you to configure the page instance. Due to the dynamic nature of <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> it is possible that the hook will be executed for HTTP-based sub crawler or playwright-based sub crawler. Using [page](https://playwright.dev/python/docs/api/class-page) object for hook that will be executed on HTTP-based sub crawler will raise an exception. To overcome this you can use optional argument `playwright_only` = `True` when registering the hook.
91+
92+
See the following example about how to register the pre navigation hooks:
93+
94+
<CodeBlock className="language-python">
95+
{AdaptivePlaywrightCrawlerPreNavHooks}
96+
</CodeBlock>

0 commit comments

Comments
 (0)