Skip to content

Commit 8695998

Browse files
Pijukateljanbucharvdusek
authored andcommitted
refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance (apify#746)
Reworked http based crawlers inheritance. StaticContentCrawler is parent of BeautifulSoupCrawler, ParselCrawler and HttpCrawler. StaticContentCrawler is generic. Specific versions depend on the type of parser used for parsing http response. **Breaking change:** Renamed BeautifulSoupParser to BeautifulSoupParserType (it is just string literal to properly set BeautiflSoup) BeautifulSoupParser is used for new class that is the parser used by BeautifulSoupCrawler - Closes: [ Reconsider crawler inheritance apify#350 ](apify#350) --------- Co-authored-by: Jan Buchar <[email protected]> Co-authored-by: Vlada Dusek <[email protected]>
1 parent d676346 commit 8695998

21 files changed

+605
-508
lines changed

docs/guides/http_crawlers.mdx

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
id: http-crawlers
3+
title: HTTP crawlers
4+
description: Crawlee supports multiple HTTP crawlers that can be used to extract data from server-rendered webpages.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
import Tabs from '@theme/Tabs';
9+
import TabItem from '@theme/TabItem';
10+
import CodeBlock from '@theme/CodeBlock';
11+
12+
Generic class <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> is parent to <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> and it could be used as parent for your crawler with custom content parsing requirements.
13+
14+
It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions.
15+
16+
## `BeautifulSoupCrawler`
17+
<ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> uses <ApiLink to="class/BeautifulSoupParser">`BeautifulSoupParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> in the `.soup` or `.parsed_content` attribute.
18+
19+
## `ParselCrawler`
20+
<ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> uses <ApiLink to="class/ParselParser">`ParselParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> in the `.selector` or `.parsed_content` attribute.
21+
22+
## `HttpCrawler`
23+
<ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> uses <ApiLink to="class/NoParser">`NoParser`</ApiLink> that does not parse the HTTP response at all and is to be used if no parsing is required.
24+
25+
## Creating your own HTTP crawler.
26+
### Why?
27+
In case you want to use some custom parser for parsing HTTP responses, and the rest of the <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> functionality suit your needs.
28+
29+
### How?
30+
You need to define at least 2 new classes and decide what will be the type returned by the parser's `parse` method.
31+
Parser will inherit from <ApiLink to="class/AbstractHttpParser">`AbstractHttpParser`</ApiLink> and it will need to implement all it's abstract methods.
32+
Crawler will inherit from <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> and it will need to implement all it's abstract methods.
33+
Newly defined parser is then used in the `parser` argument of `AbstractHttpCrawler.__init__` method.
34+
35+
To get better idea and as an example please see one of our own HTTP-based crawlers mentioned above.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from ._abstract_http_crawler import AbstractHttpCrawler, HttpCrawlerOptions
2+
from ._abstract_http_parser import AbstractHttpParser
3+
from ._http_crawling_context import ParsedHttpCrawlingContext
4+
5+
__all__ = [
6+
'AbstractHttpCrawler',
7+
'AbstractHttpParser',
8+
'HttpCrawlerOptions',
9+
'ParsedHttpCrawlingContext',
10+
]
Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
from __future__ import annotations
2+
3+
import logging
4+
from abc import ABC
5+
from typing import TYPE_CHECKING, Any, Generic
6+
7+
from pydantic import ValidationError
8+
from typing_extensions import NotRequired, TypeVar
9+
10+
from crawlee import EnqueueStrategy
11+
from crawlee._request import BaseRequestData
12+
from crawlee._utils.docs import docs_group
13+
from crawlee._utils.urls import convert_to_absolute_url, is_url_absolute
14+
from crawlee.abstract_http_crawler._http_crawling_context import (
15+
HttpCrawlingContext,
16+
ParsedHttpCrawlingContext,
17+
TParseResult,
18+
)
19+
from crawlee.basic_crawler import BasicCrawler, BasicCrawlerOptions, ContextPipeline
20+
from crawlee.errors import SessionError
21+
from crawlee.http_clients import HttpxHttpClient
22+
23+
if TYPE_CHECKING:
24+
from collections.abc import AsyncGenerator, Iterable
25+
26+
from typing_extensions import Unpack
27+
28+
from crawlee._types import BasicCrawlingContext, EnqueueLinksFunction, EnqueueLinksKwargs
29+
from crawlee.abstract_http_crawler._abstract_http_parser import AbstractHttpParser
30+
31+
TCrawlingContext = TypeVar('TCrawlingContext', bound=ParsedHttpCrawlingContext)
32+
33+
34+
@docs_group('Data structures')
35+
class HttpCrawlerOptions(Generic[TCrawlingContext], BasicCrawlerOptions[TCrawlingContext]):
36+
"""Arguments for the `AbstractHttpCrawler` constructor.
37+
38+
It is intended for typing forwarded `__init__` arguments in the subclasses.
39+
"""
40+
41+
additional_http_error_status_codes: NotRequired[Iterable[int]]
42+
"""Additional HTTP status codes to treat as errors, triggering automatic retries when encountered."""
43+
44+
ignore_http_error_status_codes: NotRequired[Iterable[int]]
45+
"""HTTP status codes typically considered errors but to be treated as successful responses."""
46+
47+
48+
@docs_group('Abstract classes')
49+
class AbstractHttpCrawler(Generic[TCrawlingContext, TParseResult], BasicCrawler[TCrawlingContext], ABC):
50+
"""A web crawler for performing HTTP requests.
51+
52+
The `AbstractHttpCrawler` builds on top of the `BasicCrawler`, which means it inherits all of its features. On top
53+
of that it implements the HTTP communication using the HTTP clients. The class allows integration with
54+
any HTTP client that implements the `BaseHttpClient` interface. The HTTP client is provided to the crawler
55+
as an input parameter to the constructor.
56+
AbstractHttpCrawler is generic class and is expected to be used together with specific parser that will be used to
57+
parse http response and type of expected TCrawlingContext which is available to the user function.
58+
See prepared specific version of it: BeautifulSoupCrawler, ParselCrawler or HttpCrawler for example.
59+
60+
The HTTP client-based crawlers are ideal for websites that do not require JavaScript execution. However,
61+
if you need to execute client-side JavaScript, consider using a browser-based crawler like the `PlaywrightCrawler`.
62+
"""
63+
64+
def __init__(
65+
self,
66+
*,
67+
parser: AbstractHttpParser[TParseResult],
68+
additional_http_error_status_codes: Iterable[int] = (),
69+
ignore_http_error_status_codes: Iterable[int] = (),
70+
**kwargs: Unpack[BasicCrawlerOptions[TCrawlingContext]],
71+
) -> None:
72+
self._parser = parser
73+
74+
kwargs.setdefault(
75+
'http_client',
76+
HttpxHttpClient(
77+
additional_http_error_status_codes=additional_http_error_status_codes,
78+
ignore_http_error_status_codes=ignore_http_error_status_codes,
79+
),
80+
)
81+
82+
if '_context_pipeline' not in kwargs:
83+
raise ValueError(
84+
'Please pass in a `_context_pipeline`. You should use the '
85+
'AbstractHttpCrawler._create_static_content_crawler_pipeline() method to initialize it.'
86+
)
87+
88+
kwargs.setdefault('_logger', logging.getLogger(__name__))
89+
super().__init__(**kwargs)
90+
91+
def _create_static_content_crawler_pipeline(self) -> ContextPipeline[ParsedHttpCrawlingContext[TParseResult]]:
92+
"""Create static content crawler context pipeline with expected pipeline steps."""
93+
return (
94+
ContextPipeline()
95+
.compose(self._make_http_request)
96+
.compose(self._parse_http_response)
97+
.compose(self._handle_blocked_request)
98+
)
99+
100+
async def _parse_http_response(
101+
self, context: HttpCrawlingContext
102+
) -> AsyncGenerator[ParsedHttpCrawlingContext[TParseResult], None]:
103+
"""Parse HTTP response and create context enhanced by the parsing result and enqueue links function.
104+
105+
Args:
106+
context: The current crawling context, that includes HTTP response.
107+
108+
Yields:
109+
The original crawling context enhanced by the parsing result and enqueue links function.
110+
"""
111+
parsed_content = await self._parser.parse(context.http_response)
112+
yield ParsedHttpCrawlingContext.from_http_crawling_context(
113+
context=context,
114+
parsed_content=parsed_content,
115+
enqueue_links=self._create_enqueue_links_function(context, parsed_content),
116+
)
117+
118+
def _create_enqueue_links_function(
119+
self, context: HttpCrawlingContext, parsed_content: TParseResult
120+
) -> EnqueueLinksFunction:
121+
"""Create a callback function for extracting links from parsed content and enqueuing them to the crawl.
122+
123+
Args:
124+
context: The current crawling context.
125+
parsed_content: The parsed http response.
126+
127+
Returns:
128+
Awaitable that is used for extracting links from parsed content and enqueuing them to the crawl.
129+
"""
130+
131+
async def enqueue_links(
132+
*,
133+
selector: str = 'a',
134+
label: str | None = None,
135+
user_data: dict[str, Any] | None = None,
136+
**kwargs: Unpack[EnqueueLinksKwargs],
137+
) -> None:
138+
kwargs.setdefault('strategy', EnqueueStrategy.SAME_HOSTNAME)
139+
140+
requests = list[BaseRequestData]()
141+
user_data = user_data or {}
142+
if label is not None:
143+
user_data.setdefault('label', label)
144+
for link in self._parser.find_links(parsed_content, selector=selector):
145+
url = link
146+
if not is_url_absolute(url):
147+
url = convert_to_absolute_url(context.request.url, url)
148+
try:
149+
request = BaseRequestData.from_url(url, user_data=user_data)
150+
except ValidationError as exc:
151+
context.log.debug(
152+
f'Skipping URL "{url}" due to invalid format: {exc}. '
153+
'This may be caused by a malformed URL or unsupported URL scheme. '
154+
'Please ensure the URL is correct and retry.'
155+
)
156+
continue
157+
158+
requests.append(request)
159+
160+
await context.add_requests(requests, **kwargs)
161+
162+
return enqueue_links
163+
164+
async def _make_http_request(self, context: BasicCrawlingContext) -> AsyncGenerator[HttpCrawlingContext, None]:
165+
"""Make http request and create context enhanced by HTTP response.
166+
167+
Args:
168+
context: The current crawling context.
169+
170+
Yields:
171+
The original crawling context enhanced by HTTP response.
172+
"""
173+
result = await self._http_client.crawl(
174+
request=context.request,
175+
session=context.session,
176+
proxy_info=context.proxy_info,
177+
statistics=self._statistics,
178+
)
179+
180+
yield HttpCrawlingContext.from_basic_crawling_context(context=context, http_response=result.http_response)
181+
182+
async def _handle_blocked_request(
183+
self, context: ParsedHttpCrawlingContext[TParseResult]
184+
) -> AsyncGenerator[ParsedHttpCrawlingContext[TParseResult], None]:
185+
"""Try to detect if the request is blocked based on the HTTP status code or the parsed response content.
186+
187+
Args:
188+
context: The current crawling context.
189+
190+
Raises:
191+
SessionError: If the request is considered blocked.
192+
193+
Yields:
194+
The original crawling context if no errors are detected.
195+
"""
196+
if self._retry_on_blocked:
197+
status_code = context.http_response.status_code
198+
199+
# TODO: refactor to avoid private member access
200+
# https://github.com/apify/crawlee-python/issues/708
201+
if (
202+
context.session
203+
and status_code not in self._http_client._ignore_http_error_status_codes # noqa: SLF001
204+
and context.session.is_blocked_status_code(status_code=status_code)
205+
):
206+
raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')
207+
if blocked_info := self._parser.is_blocked(context.parsed_content):
208+
raise SessionError(blocked_info.reason)
209+
yield context
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
from __future__ import annotations
2+
3+
from abc import ABC, abstractmethod
4+
from typing import TYPE_CHECKING, Generic
5+
6+
from crawlee._utils.blocked import RETRY_CSS_SELECTORS
7+
from crawlee._utils.docs import docs_group
8+
from crawlee.abstract_http_crawler._http_crawling_context import TParseResult
9+
from crawlee.basic_crawler import BlockedInfo
10+
11+
if TYPE_CHECKING:
12+
from collections.abc import Iterable
13+
14+
from crawlee.http_clients import HttpResponse
15+
16+
17+
@docs_group('Abstract classes')
18+
class AbstractHttpParser(Generic[TParseResult], ABC):
19+
"""Parser used for parsing http response and inspecting parsed result to find links or detect blocking."""
20+
21+
@abstractmethod
22+
async def parse(self, response: HttpResponse) -> TParseResult:
23+
"""Parse http response.
24+
25+
Args:
26+
response: HTTP response to be parsed.
27+
28+
Returns:
29+
Parsed HTTP response.
30+
"""
31+
32+
def is_blocked(self, parsed_content: TParseResult) -> BlockedInfo:
33+
"""Detect if blocked and return BlockedInfo with additional information.
34+
35+
Default implementation that expects `is_matching_selector` abstract method to be implemented.
36+
Override this method if your parser has different way of blockage detection.
37+
38+
Args:
39+
parsed_content: Parsed HTTP response. Result of `parse` method.
40+
41+
Returns:
42+
`BlockedInfo` object that contains non-empty string description of reason if blockage was detected. Empty
43+
string in reason signifies no blockage detected.
44+
"""
45+
reason = ''
46+
if parsed_content is not None:
47+
matched_selectors = [
48+
selector for selector in RETRY_CSS_SELECTORS if self.is_matching_selector(parsed_content, selector)
49+
]
50+
51+
if matched_selectors:
52+
reason = (
53+
f"Assuming the session is blocked - HTTP response matched the following selectors: "
54+
f"{'; '.join(matched_selectors)}"
55+
)
56+
57+
return BlockedInfo(reason=reason)
58+
59+
@abstractmethod
60+
def is_matching_selector(self, parsed_content: TParseResult, selector: str) -> bool:
61+
"""Find if selector has match in parsed content.
62+
63+
Args:
64+
parsed_content: Parsed HTTP response. Result of `parse` method.
65+
selector: String used to define matching pattern.
66+
67+
Returns:
68+
True if selector has match in parsed content.
69+
"""
70+
71+
@abstractmethod
72+
def find_links(self, parsed_content: TParseResult, selector: str) -> Iterable[str]:
73+
"""Find all links in result using selector.
74+
75+
Args:
76+
parsed_content: Parsed HTTP response. Result of `parse` method.
77+
selector: String used to define matching pattern for finding links.
78+
79+
Returns:
80+
Iterable of strings that contain found links.
81+
"""
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
from __future__ import annotations
2+
3+
from dataclasses import dataclass, fields
4+
from typing import Generic
5+
6+
from typing_extensions import Self, TypeVar
7+
8+
from crawlee._types import BasicCrawlingContext, EnqueueLinksFunction
9+
from crawlee._utils.docs import docs_group
10+
from crawlee.http_clients import HttpCrawlingResult, HttpResponse
11+
12+
TParseResult = TypeVar('TParseResult')
13+
14+
15+
@dataclass(frozen=True)
16+
@docs_group('Data structures')
17+
class HttpCrawlingContext(BasicCrawlingContext, HttpCrawlingResult):
18+
"""The crawling context used by the `AbstractHttpCrawler`."""
19+
20+
@classmethod
21+
def from_basic_crawling_context(cls, context: BasicCrawlingContext, http_response: HttpResponse) -> Self:
22+
"""Convenience constructor that creates `HttpCrawlingContext` from existing `BasicCrawlingContext`."""
23+
context_kwargs = {field.name: getattr(context, field.name) for field in fields(context)}
24+
return cls(http_response=http_response, **context_kwargs)
25+
26+
27+
@dataclass(frozen=True)
28+
@docs_group('Data structures')
29+
class ParsedHttpCrawlingContext(Generic[TParseResult], HttpCrawlingContext):
30+
"""The crawling context used by `AbstractHttpCrawler`.
31+
32+
It provides access to key objects as well as utility functions for handling crawling tasks.
33+
"""
34+
35+
parsed_content: TParseResult
36+
enqueue_links: EnqueueLinksFunction
37+
38+
@classmethod
39+
def from_http_crawling_context(
40+
cls, context: HttpCrawlingContext, parsed_content: TParseResult, enqueue_links: EnqueueLinksFunction
41+
) -> Self:
42+
"""Convenience constructor that creates new context from existing HttpCrawlingContext."""
43+
context_kwargs = {field.name: getattr(context, field.name) for field in fields(context)}
44+
return cls(parsed_content=parsed_content, enqueue_links=enqueue_links, **context_kwargs)
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from crawlee._types import BasicCrawlingContext
22

33
from ._basic_crawler import BasicCrawler, BasicCrawlerOptions
4+
from ._blocked_info import BlockedInfo
45
from ._context_pipeline import ContextPipeline
56

6-
__all__ = ['BasicCrawler', 'BasicCrawlerOptions', 'BasicCrawlingContext', 'ContextPipeline']
7+
__all__ = ['BasicCrawler', 'BasicCrawlerOptions', 'BasicCrawlingContext', 'BlockedInfo', 'ContextPipeline']

0 commit comments

Comments
 (0)