Skip to content
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
8c8dd24
WIP
Pijukatel Nov 21, 2024
48812b1
Draft proposal for discussion.
Pijukatel Nov 21, 2024
853ee85
Remove redundant type
Pijukatel Nov 21, 2024
17e08a1
BeautifulSoupParser
Pijukatel Nov 22, 2024
188afdb
Being stuck on mypy and generics
Pijukatel Nov 22, 2024
96356d6
Almost there. Figure out the reason for casts in middleware
Pijukatel Nov 22, 2024
def0e72
Solved BScrawler. Next ParselCrawler.
Pijukatel Nov 26, 2024
54ce154
Reworked ParselCrawler
Pijukatel Nov 26, 2024
4692fe9
Ready for review.
Pijukatel Nov 26, 2024
e2e3cd9
Merge remote-tracking branch 'origin/master' into new-class-hier-curr…
Pijukatel Nov 26, 2024
bb8cd12
Edit forgotten comment .
Pijukatel Nov 26, 2024
f869be6
Remove mistaken edits in docs
Pijukatel Nov 26, 2024
81e46cd
Merge branch 'master' into new-class-hier-current-middleware
Pijukatel Nov 26, 2024
f994e32
Reformat after merge.
Pijukatel Nov 26, 2024
bbc27af
Fix CI reported issues on previous Python versions
Pijukatel Nov 26, 2024
7567164
Update docstrings in child crawlers to not repeat text after parent.
Pijukatel Nov 26, 2024
9335967
Revert incorrect docstring update.
Pijukatel Nov 26, 2024
b4877cb
Review comments
Pijukatel Nov 26, 2024
2929be1
Reverted back name change in doc strings.
Pijukatel Nov 26, 2024
19bc041
Fix CI reported issues.
Pijukatel Nov 26, 2024
fe19345
Fix incorrectly name BS argument
Pijukatel Nov 26, 2024
6ab5a09
Changes by Honza
Pijukatel Nov 27, 2024
2af695b
Polish proposed changes,
Pijukatel Nov 27, 2024
0b0f4ce
Review comments
Pijukatel Nov 27, 2024
03832fb
Review commnets about interl imports in docs
Pijukatel Nov 27, 2024
005c7cf
Extract is_matching_selector from Parser and put
Pijukatel Nov 27, 2024
fc2de60
Update src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_cont…
Pijukatel Nov 28, 2024
578cdc0
Update src/crawlee/http_crawler/_http_crawler.py
Pijukatel Nov 28, 2024
280cecb
Update src/crawlee/parsel_crawler/_parsel_crawling_context.py
Pijukatel Nov 28, 2024
a88a5e4
Review comments.
Pijukatel Nov 28, 2024
b1c0fad
Use correctly BeautifulSoupParser type
Pijukatel Nov 28, 2024
4e3fbd5
Add doc page describing new classes.
Pijukatel Nov 28, 2024
9fc66d8
Update docs more
Pijukatel Nov 28, 2024
434bd6b
Apply suggestions from code review
Pijukatel Nov 29, 2024
18562de
Review comments.
Pijukatel Nov 29, 2024
b9255be
More review comments
Pijukatel Nov 29, 2024
d70e8a8
Update docs names
Pijukatel Nov 29, 2024
3e87db5
Update docs/guides/static_content_crawlers.mdx
Pijukatel Dec 3, 2024
460e1ac
Review comments.
Pijukatel Dec 3, 2024
8c4ec82
Review comments
Pijukatel Dec 3, 2024
e7c7817
Apply suggestions from code review
Pijukatel Dec 3, 2024
05cec1a
Rename StaticCOntentCrawler to AbstractContentCrawler and related fil…
Pijukatel Dec 3, 2024
bed215e
Renaming to AbstractHttpCrawler 2
Pijukatel Dec 3, 2024
c43b564
Renaming to AbstractHttpCrawler 2
Pijukatel Dec 3, 2024
a1db9e2
Apply suggestions from code review
Pijukatel Dec 3, 2024
fae917e
Review comments
Pijukatel Dec 3, 2024
b563bf9
Expand docs by short description of how to create your own HTTPbase c…
Pijukatel Dec 3, 2024
89a8e83
Update src/crawlee/abstract_http_crawler/_abstract_http_crawler.py
Pijukatel Dec 3, 2024
139b21b
Update src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py
Pijukatel Dec 4, 2024
bd7846f
Apply suggestions from code review
Pijukatel Dec 4, 2024
454f9ec
Review comments
Pijukatel Dec 4, 2024
6bba552
Move BlockedInfo to its own file.
Pijukatel Dec 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions docs/guides/http_crawlers.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
id: http-crawlers
title: HTTP crawlers
description: Crawlee supports multiple http crawlers that can be used to extract data from server-rendered webpages.
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';

Generic class <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> is parent to <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> and it could be used as parent for your crawler with custom content parsing requirements.

It already includes almost all the functionality to crawl webpages and the only missing part is the parser that should be used to parse HTTP responses, and a context dataclass that defines what context helpers will be available to user handler functions.

## `BeautifulSoupCrawler`
<ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> uses <ApiLink to="class/BeautifulSoupParser">`BeautifulSoupParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> in the `.soup` or `.parsed_content` attribute.

## `ParselCrawler`
<ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> uses <ApiLink to="class/ParselParser">`ParselParser`</ApiLink> to parse the HTTP response and makes it available in <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> in the `.selector` or `.parsed_content` attribute.

## `HttpCrawler`
<ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> uses <ApiLink to="class/NoParser">`NoParser`</ApiLink> that does not parse the HTTP response at all and is to be used if no parsing is required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this is great and definitely better than nothing. However, it is quite short and might not look good when rendered on the page. For comparison, take a look at guides like HTTP Clients or Result Storages. It should aim for similar depth and verbosity, including usage examples.

This should not be a blocker for the merging, as we have been improving the docs all the time. If you decide not-to-update it now, please open a new issue for it. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was scratching my head trying to come up with something for those docs. The problem is, that the only example I can think of, is implementing your own HTTP based Crawler (other examples in other files already show how to crawlee). But such example exists already in our code base and it is BSCrawler and ParselCrawler, so I can just point to those two.
If you think something specific is missing, please let me know and I can do add that.

11 changes: 11 additions & 0 deletions src/crawlee/abstract_http_crawler/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from ._abstract_http_crawler import AbstractHttpCrawler, HttpCrawlerOptions
from ._abstract_http_parser import AbstractHttpParser, BlockedInfo
from ._http_crawling_context import ParsedHttpCrawlingContext

__all__ = [
'AbstractHttpCrawler',
'AbstractHttpParser',
'BlockedInfo',
'HttpCrawlerOptions',
'ParsedHttpCrawlingContext',
]
204 changes: 204 additions & 0 deletions src/crawlee/abstract_http_crawler/_abstract_http_crawler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
from __future__ import annotations

import logging
from abc import ABC
from typing import TYPE_CHECKING, Any, Generic

from pydantic import ValidationError
from typing_extensions import NotRequired, TypeVar

from crawlee import EnqueueStrategy
from crawlee._request import BaseRequestData
from crawlee._utils.docs import docs_group
from crawlee._utils.urls import convert_to_absolute_url, is_url_absolute
from crawlee.abstract_http_crawler._http_crawling_context import (
HttpCrawlingContext,
ParsedHttpCrawlingContext,
TParseResult,
)
from crawlee.basic_crawler import BasicCrawler, BasicCrawlerOptions, ContextPipeline
from crawlee.errors import SessionError
from crawlee.http_clients import HttpxHttpClient

if TYPE_CHECKING:
from collections.abc import AsyncGenerator, Iterable

from typing_extensions import Unpack

from crawlee._types import BasicCrawlingContext, EnqueueLinksFunction, EnqueueLinksKwargs
from crawlee.abstract_http_crawler._abstract_http_parser import AbstractHttpParser

TCrawlingContext = TypeVar('TCrawlingContext', bound=ParsedHttpCrawlingContext)


@docs_group('Data structures')
class HttpCrawlerOptions(Generic[TCrawlingContext], BasicCrawlerOptions[TCrawlingContext]):
additional_http_error_status_codes: NotRequired[Iterable[int]]
"""Additional HTTP status codes to treat as errors, triggering automatic retries when encountered."""

ignore_http_error_status_codes: NotRequired[Iterable[int]]
"""HTTP status codes typically considered errors but to be treated as successful responses."""


@docs_group('Abstract classes')
class AbstractHttpCrawler(Generic[TCrawlingContext, TParseResult], BasicCrawler[TCrawlingContext], ABC):
"""A web crawler for performing HTTP requests.

The `AbstractHttpCrawler` builds on top of the `BasicCrawler`, which means it inherits all of its features. On top
of that it implements the HTTP communication using the HTTP clients. The class allows integration with
any HTTP client that implements the `BaseHttpClient` interface. The HTTP client is provided to the crawler
as an input parameter to the constructor.
AbstractHttpCrawler is generic class and is expected to be used together with specific parser that will be used to
parse http response and type of expected TCrawlingContext which is available to the user function.
See prepared specific version of it: BeautifulSoupCrawler, ParselCrawler or HttpCrawler for example.

The HTTP client-based crawlers are ideal for websites that do not require JavaScript execution. However,
if you need to execute client-side JavaScript, consider using a browser-based crawler like the `PlaywrightCrawler`.
"""

def __init__(
self,
*,
parser: AbstractHttpParser[TParseResult],
additional_http_error_status_codes: Iterable[int] = (),
ignore_http_error_status_codes: Iterable[int] = (),
**kwargs: Unpack[BasicCrawlerOptions[TCrawlingContext]],
) -> None:
self._parser = parser

kwargs.setdefault(
'http_client',
HttpxHttpClient(
additional_http_error_status_codes=additional_http_error_status_codes,
ignore_http_error_status_codes=ignore_http_error_status_codes,
),
)

if '_context_pipeline' not in kwargs:
raise ValueError(
'Please pass in a `_context_pipeline`. You should use the '
'AbstractHttpCrawler._create_static_content_crawler_pipeline() method to initialize it.'
)

kwargs.setdefault('_logger', logging.getLogger(__name__))
super().__init__(**kwargs)

def _create_static_content_crawler_pipeline(self) -> ContextPipeline[ParsedHttpCrawlingContext[TParseResult]]:
"""Create static content crawler context pipeline with expected pipeline steps."""
return (
ContextPipeline()
.compose(self._make_http_request)
.compose(self._parse_http_response)
.compose(self._handle_blocked_request)
)

async def _parse_http_response(
self, context: HttpCrawlingContext
) -> AsyncGenerator[ParsedHttpCrawlingContext[TParseResult], None]:
"""Parse http response and create context enhanced by the parsing result and enqueue links function.

Args:
context: The current crawling context, that includes http response.

Yields:
The original crawling context enhanced by the parsing result and enqueue links function.
"""
parsed_content = await self._parser.parse(context.http_response)
yield ParsedHttpCrawlingContext.from_http_crawling_context(
context=context,
parsed_content=parsed_content,
enqueue_links=self._create_enqueue_links_function(context, parsed_content),
)

def _create_enqueue_links_function(
self, context: HttpCrawlingContext, parsed_content: TParseResult
) -> EnqueueLinksFunction:
"""Create a callback function for extracting links from parsed content and enqueuing them to the crawl.

Args:
context: The current crawling context.
parsed_content: The parsed http response.

Returns:
Awaitable that is used for extracting links from parsed content and enqueuing them to the crawl.
"""

async def enqueue_links(
*,
selector: str = 'a',
label: str | None = None,
user_data: dict[str, Any] | None = None,
**kwargs: Unpack[EnqueueLinksKwargs],
) -> None:
kwargs.setdefault('strategy', EnqueueStrategy.SAME_HOSTNAME)

requests = list[BaseRequestData]()
user_data = user_data or {}
if label is not None:
user_data.setdefault('label', label)
for link in self._parser.find_links(parsed_content, selector=selector):
url = link
if not is_url_absolute(url):
url = convert_to_absolute_url(context.request.url, url)
try:
request = BaseRequestData.from_url(url, user_data=user_data)
except ValidationError as exc:
context.log.debug(
f'Skipping URL "{url}" due to invalid format: {exc}. '
'This may be caused by a malformed URL or unsupported URL scheme. '
'Please ensure the URL is correct and retry.'
)
continue

requests.append(request)

await context.add_requests(requests, **kwargs)

return enqueue_links

async def _make_http_request(self, context: BasicCrawlingContext) -> AsyncGenerator[HttpCrawlingContext, None]:
"""Make http request and create context enhanced by http response.

Args:
context: The current crawling context.

Yields:
The original crawling context enhanced by http response.
"""
result = await self._http_client.crawl(
request=context.request,
session=context.session,
proxy_info=context.proxy_info,
statistics=self._statistics,
)

yield HttpCrawlingContext.from_basic_crawling_context(context=context, http_response=result.http_response)

async def _handle_blocked_request(
self, context: ParsedHttpCrawlingContext[TParseResult]
) -> AsyncGenerator[ParsedHttpCrawlingContext[TParseResult], None]:
"""Try to detect if the request is blocked based on the HTTP status code or the parsed response content.

Args:
context: The current crawling context.

Raises:
SessionError: If the request is considered blocked.

Yields:
The original crawling context if no errors are detected.
"""
if self._retry_on_blocked:
status_code = context.http_response.status_code

# TODO: refactor to avoid private member access
# https://github.com/apify/crawlee-python/issues/708
if (
context.session
and status_code not in self._http_client._ignore_http_error_status_codes # noqa: SLF001
and context.session.is_blocked_status_code(status_code=status_code)
):
raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')
if blocked_info := self._parser.is_blocked(context.parsed_content):
raise SessionError(blocked_info.reason)
yield context
93 changes: 93 additions & 0 deletions src/crawlee/abstract_http_crawler/_abstract_http_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
from __future__ import annotations

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import TYPE_CHECKING, Generic

from crawlee._utils.blocked import RETRY_CSS_SELECTORS
from crawlee._utils.docs import docs_group
from crawlee.abstract_http_crawler._http_crawling_context import TParseResult

if TYPE_CHECKING:
from collections.abc import Iterable

from crawlee.http_clients import HttpResponse


@docs_group('Classes')
@dataclass(frozen=True)
class BlockedInfo:
"""Information about whether the crawling is blocked. If reason is empty, then it means it is not blocked."""

reason: str

def __bool__(self) -> bool:
"""No reason means no blocking."""
return bool(self.reason)


@docs_group('Abstract classes')
class AbstractHttpParser(Generic[TParseResult], ABC):
"""Parser used for parsing http response and inspecting parsed result to find links or detect blocking."""

@abstractmethod
async def parse(self, response: HttpResponse) -> TParseResult:
"""Parse http response.

Args:
response: Http response to be parsed.

Returns:
Parsed http response.
"""

def is_blocked(self, parsed_content: TParseResult) -> BlockedInfo:
"""Detect if blocked and return BlockedInfo with additional information.

Default implementation that expects is_matching_selector abstract method to be implemented.
Override this method if your parser has different way of blockage detection.

Args:
parsed_content: Parsed http response. Result of parse method.

Returns:
BlockedInfo object that contains non-empty string description of reason if blockage was detected. Empty
string in reason signifies no blockage detected.
"""
reason = ''
if parsed_content is not None:
matched_selectors = [
selector for selector in RETRY_CSS_SELECTORS if self.is_matching_selector(parsed_content, selector)
]

if matched_selectors:
reason = (
f"Assuming the session is blocked - HTTP response matched the following selectors: "
f"{'; '.join(matched_selectors)}"
)

return BlockedInfo(reason=reason)

@abstractmethod
def is_matching_selector(self, parsed_content: TParseResult, selector: str) -> bool:
"""Find if selector has match in parsed content.

Args:
parsed_content: Parsed http response. Result of parse method.
selector: String used to define matching pattern.

Returns:
True if selector has match in parsed content.
"""

@abstractmethod
def find_links(self, parsed_content: TParseResult, selector: str) -> Iterable[str]:
"""Find all links in result using selector.

Args:
parsed_content: Parsed http response. Result of parse method.
selector: String used to define matching pattern for finding links.

Returns:
Iterable of strings that contain found links.
"""
44 changes: 44 additions & 0 deletions src/crawlee/abstract_http_crawler/_http_crawling_context.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from __future__ import annotations

from dataclasses import dataclass, fields
from typing import Generic

from typing_extensions import Self, TypeVar

from crawlee._types import BasicCrawlingContext, EnqueueLinksFunction
from crawlee._utils.docs import docs_group
from crawlee.http_clients import HttpCrawlingResult, HttpResponse

TParseResult = TypeVar('TParseResult')


@dataclass(frozen=True)
@docs_group('Data structures')
class HttpCrawlingContext(BasicCrawlingContext, HttpCrawlingResult):
"""The crawling context used by the `AbstractHttpCrawler`."""

@classmethod
def from_basic_crawling_context(cls, context: BasicCrawlingContext, http_response: HttpResponse) -> Self:
"""Convenience constructor that creates HttpCrawlingContext from existing BasicCrawlingContext."""
context_kwargs = {field.name: getattr(context, field.name) for field in fields(context)}
return cls(http_response=http_response, **context_kwargs)


@dataclass(frozen=True)
@docs_group('Data structures')
class ParsedHttpCrawlingContext(Generic[TParseResult], HttpCrawlingContext):
"""The crawling context used by AbstractHttpCrawler.

It provides access to key objects as well as utility functions for handling crawling tasks.
"""

parsed_content: TParseResult
enqueue_links: EnqueueLinksFunction

@classmethod
def from_http_crawling_context(
cls, context: HttpCrawlingContext, parsed_content: TParseResult, enqueue_links: EnqueueLinksFunction
) -> Self:
"""Convenience constructor that creates new context from existing HttpCrawlingContext."""
context_kwargs = {field.name: getattr(context, field.name) for field in fields(context)}
return cls(parsed_content=parsed_content, enqueue_links=enqueue_links, **context_kwargs)
5 changes: 3 additions & 2 deletions src/crawlee/beautifulsoup_crawler/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
try:
from ._beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupParser
from ._beautifulsoup_crawler import BeautifulSoupCrawler
from ._beautifulsoup_crawling_context import BeautifulSoupCrawlingContext
from ._beautifulsoup_parser import BeautifulSoupParserType
except ImportError as exc:
raise ImportError(
"To import anything from this subpackage, you need to install the 'beautifulsoup' extra."
"For example, if you use pip, run `pip install 'crawlee[beautifulsoup]'`.",
) from exc

__all__ = ['BeautifulSoupCrawler', 'BeautifulSoupCrawlingContext', 'BeautifulSoupParser']
__all__ = ['BeautifulSoupCrawler', 'BeautifulSoupCrawlingContext', 'BeautifulSoupParserType']
Loading
Loading