Skip to content

Commit 0f23205

Browse files
Pijukateljanbuchar
andauthored
feat: Add pre-navigation hooks router to AbstractHttpCrawler (#791)
### Description This makes it possible for users to register their pre-navigation hooks for http based crawlers. Add tests. Update docs. ### Issues - Closes: #635 --------- Co-authored-by: Jan Buchar <[email protected]>
1 parent 2670635 commit 0f23205

File tree

6 files changed

+70
-4
lines changed

6 files changed

+70
-4
lines changed

docs/examples/beautifulsoup_crawler.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ import CodeBlock from '@theme/CodeBlock';
88

99
import BeautifulSoupExample from '!!raw-loader!./code/beautifulsoup_crawler.py';
1010

11-
This example demonstrates how to use <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library and extract some data from it - the page title and all `<h1>`, `<h2>` and `<h3>` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code.
11+
This example demonstrates how to use <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library and extract some data from it - the page title and all `<h1>`, `<h2>` and `<h3>` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code. It also shows how you can add optional pre-navigation hook to the crawler. Pre-navigation hooks are user defined functions that execute before sending the request.
1212

1313
<CodeBlock className="language-python">
1414
{BeautifulSoupExample}

docs/examples/code/beautifulsoup_crawler.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import asyncio
22
from datetime import timedelta
33

4+
from crawlee.basic_crawler import BasicCrawlingContext
45
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
56

67

@@ -39,6 +40,12 @@ async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
3940
# the data will be stored as JSON files in ./storage/datasets/default.
4041
await context.push_data(data)
4142

43+
# Register pre navigation hook which will be called before each request.
44+
# This hook is optional and does not need to be defined at all.
45+
@crawler.pre_navigation_hook
46+
async def some_hook(context: BasicCrawlingContext) -> None:
47+
pass
48+
4249
# Run the crawler with the initial list of URLs.
4350
await crawler.run(['https://crawlee.dev'])
4451

docs/examples/code/parsel_crawler.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import asyncio
22

3+
from crawlee.basic_crawler import BasicCrawlingContext
34
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext
45

56
# Regex for identifying email addresses on a webpage.
@@ -30,6 +31,12 @@ async def request_handler(context: ParselCrawlingContext) -> None:
3031
# Enqueue all links found on the page.
3132
await context.enqueue_links()
3233

34+
# Register pre navigation hook which will be called before each request.
35+
# This hook is optional and does not need to be defined at all.
36+
@crawler.pre_navigation_hook
37+
async def some_hook(context: BasicCrawlingContext) -> None:
38+
pass
39+
3340
# Run the crawler with the initial list of URLs.
3441
await crawler.run(['https://github.com'])
3542

docs/examples/parsel_crawler.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ import CodeBlock from '@theme/CodeBlock';
88

99
import ParselCrawlerExample from '!!raw-loader!./code/parsel_crawler.py';
1010

11-
This example shows how to use <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using [Parsel](https://pypi.org/project/parsel/) library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping.
11+
This example shows how to use <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using [Parsel](https://pypi.org/project/parsel/) library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping. It also shows how you can add optional pre-navigation hook to the crawler. Pre-navigation hooks are user defined functions that execute before sending the request.
1212

1313
<CodeBlock className="language-python">
1414
{ParselCrawlerExample}

src/crawlee/abstract_http_crawler/_abstract_http_crawler.py

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
import logging
44
from abc import ABC
5-
from typing import TYPE_CHECKING, Any, Generic
5+
from typing import TYPE_CHECKING, Any, Callable, Generic
66

77
from pydantic import ValidationError
88
from typing_extensions import NotRequired, TypeVar
@@ -21,7 +21,7 @@
2121
from crawlee.http_clients import HttpxHttpClient
2222

2323
if TYPE_CHECKING:
24-
from collections.abc import AsyncGenerator, Iterable
24+
from collections.abc import AsyncGenerator, Awaitable, Iterable
2525

2626
from typing_extensions import Unpack
2727

@@ -70,6 +70,7 @@ def __init__(
7070
**kwargs: Unpack[BasicCrawlerOptions[TCrawlingContext]],
7171
) -> None:
7272
self._parser = parser
73+
self._pre_navigation_hooks: list[Callable[[BasicCrawlingContext], Awaitable[None]]] = []
7374

7475
kwargs.setdefault(
7576
'http_client',
@@ -92,11 +93,19 @@ def _create_static_content_crawler_pipeline(self) -> ContextPipeline[ParsedHttpC
9293
"""Create static content crawler context pipeline with expected pipeline steps."""
9394
return (
9495
ContextPipeline()
96+
.compose(self._execute_pre_navigation_hooks)
9597
.compose(self._make_http_request)
9698
.compose(self._parse_http_response)
9799
.compose(self._handle_blocked_request)
98100
)
99101

102+
async def _execute_pre_navigation_hooks(
103+
self, context: BasicCrawlingContext
104+
) -> AsyncGenerator[BasicCrawlingContext, None]:
105+
for hook in self._pre_navigation_hooks:
106+
await hook(context)
107+
yield context
108+
100109
async def _parse_http_response(
101110
self, context: HttpCrawlingContext
102111
) -> AsyncGenerator[ParsedHttpCrawlingContext[TParseResult], None]:
@@ -207,3 +216,11 @@ async def _handle_blocked_request(
207216
if blocked_info := self._parser.is_blocked(context.parsed_content):
208217
raise SessionError(blocked_info.reason)
209218
yield context
219+
220+
def pre_navigation_hook(self, hook: Callable[[BasicCrawlingContext], Awaitable[None]]) -> None:
221+
"""Register a hook to be called before each navigation.
222+
223+
Args:
224+
hook: A coroutine function to be called before each navigation.
225+
"""
226+
self._pre_navigation_hooks.append(hook)

tests/unit/http_crawler/test_http_crawler.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121

2222
from yarl import URL
2323

24+
from crawlee._types import BasicCrawlingContext
2425
from crawlee.http_clients._base import BaseHttpClient
2526
from crawlee.http_crawler import HttpCrawlingContext
2627

@@ -354,3 +355,37 @@ async def request_handler(context: HttpCrawlingContext) -> None:
354355

355356
response_args = responses[0]['args']
356357
assert response_args == query_params, 'Reconstructed query params must match the original query params.'
358+
359+
360+
@respx.mock
361+
async def test_http_crawler_pre_navigation_hooks_executed_before_request() -> None:
362+
"""Test that pre-navigation hooks are executed in correct order."""
363+
execution_order = []
364+
test_url = 'http://www.something.com'
365+
366+
crawler = HttpCrawler()
367+
368+
# Register final context handler.
369+
@crawler.router.default_handler
370+
async def default_request_handler(context: HttpCrawlingContext) -> None: # noqa: ARG001 # Unused arg in test
371+
execution_order.append('final handler')
372+
373+
# Register pre navigation hook.
374+
@crawler.pre_navigation_hook
375+
async def hook1(context: BasicCrawlingContext) -> None: # noqa: ARG001 # Unused arg in test
376+
execution_order.append('pre-navigation-hook 1')
377+
378+
# Register pre navigation hook.
379+
@crawler.pre_navigation_hook
380+
async def hook2(context: BasicCrawlingContext) -> None: # noqa: ARG001 # Unused arg in test
381+
execution_order.append('pre-navigation-hook 2')
382+
383+
def mark_request_execution(request: Request) -> Response: # noqa: ARG001 # Unused arg in test
384+
# Helper function to track execution order.
385+
execution_order.append('request')
386+
return Response(200)
387+
388+
respx.get(test_url).mock(side_effect=mark_request_execution)
389+
await crawler.run([Request.from_url(url=test_url)])
390+
391+
assert execution_order == ['pre-navigation-hook 1', 'pre-navigation-hook 2', 'request', 'final handler']

0 commit comments

Comments
 (0)