-
Notifications
You must be signed in to change notification settings - Fork 488
refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 40 commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
8c8dd24
WIP
Pijukatel 48812b1
Draft proposal for discussion.
Pijukatel 853ee85
Remove redundant type
Pijukatel 17e08a1
BeautifulSoupParser
Pijukatel 188afdb
Being stuck on mypy and generics
Pijukatel 96356d6
Almost there. Figure out the reason for casts in middleware
Pijukatel def0e72
Solved BScrawler. Next ParselCrawler.
Pijukatel 54ce154
Reworked ParselCrawler
Pijukatel 4692fe9
Ready for review.
Pijukatel e2e3cd9
Merge remote-tracking branch 'origin/master' into new-class-hier-curr…
Pijukatel bb8cd12
Edit forgotten comment .
Pijukatel f869be6
Remove mistaken edits in docs
Pijukatel 81e46cd
Merge branch 'master' into new-class-hier-current-middleware
Pijukatel f994e32
Reformat after merge.
Pijukatel bbc27af
Fix CI reported issues on previous Python versions
Pijukatel 7567164
Update docstrings in child crawlers to not repeat text after parent.
Pijukatel 9335967
Revert incorrect docstring update.
Pijukatel b4877cb
Review comments
Pijukatel 2929be1
Reverted back name change in doc strings.
Pijukatel 19bc041
Fix CI reported issues.
Pijukatel fe19345
Fix incorrectly name BS argument
Pijukatel 6ab5a09
Changes by Honza
Pijukatel 2af695b
Polish proposed changes,
Pijukatel 0b0f4ce
Review comments
Pijukatel 03832fb
Review commnets about interl imports in docs
Pijukatel 005c7cf
Extract is_matching_selector from Parser and put
Pijukatel fc2de60
Update src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_cont…
Pijukatel 578cdc0
Update src/crawlee/http_crawler/_http_crawler.py
Pijukatel 280cecb
Update src/crawlee/parsel_crawler/_parsel_crawling_context.py
Pijukatel a88a5e4
Review comments.
Pijukatel b1c0fad
Use correctly BeautifulSoupParser type
Pijukatel 4e3fbd5
Add doc page describing new classes.
Pijukatel 9fc66d8
Update docs more
Pijukatel 434bd6b
Apply suggestions from code review
Pijukatel 18562de
Review comments.
Pijukatel b9255be
More review comments
Pijukatel d70e8a8
Update docs names
Pijukatel 3e87db5
Update docs/guides/static_content_crawlers.mdx
Pijukatel 460e1ac
Review comments.
Pijukatel 8c4ec82
Review comments
Pijukatel e7c7817
Apply suggestions from code review
Pijukatel 05cec1a
Rename StaticCOntentCrawler to AbstractContentCrawler and related fil…
Pijukatel bed215e
Renaming to AbstractHttpCrawler 2
Pijukatel c43b564
Renaming to AbstractHttpCrawler 2
Pijukatel a1db9e2
Apply suggestions from code review
Pijukatel fae917e
Review comments
Pijukatel b563bf9
Expand docs by short description of how to create your own HTTPbase c…
Pijukatel 89a8e83
Update src/crawlee/abstract_http_crawler/_abstract_http_crawler.py
Pijukatel 139b21b
Update src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py
Pijukatel bd7846f
Apply suggestions from code review
Pijukatel 454f9ec
Review comments
Pijukatel 6bba552
Move BlockedInfo to its own file.
Pijukatel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
id: static-content-crawlers | ||
title: Static content crawlers | ||
description: Crawlee supports multiple static content crawlers that can be used to extract data from server-rendered webpages. | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; | ||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
Generic class <ApiLink to="class/StaticContentCrawler">`StaticContentCrawler`</ApiLink> is parent to <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> and it could be used as parent for your crawler with custom content parsing requirements. | ||
|
||
It already includes almost all the functionality to crawl webpages and the only missing part if the parser, that should be used to parse http responses, and context object that defines what context will be available to user handler functions. | ||
|
||
## `BeautifulSoupCrawler` | ||
<ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> uses <ApiLink to="class/BeautifulSoupParser">`BeautifulSoupParser`</ApiLink> to parse http response and this is available in <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> in .soup or .parsed_content. | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
## `ParselCrawler` | ||
<ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> uses <ApiLink to="class/ParselParser">`ParselParser`</ApiLink> to parse http response and this is available in <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> in .selector or .parsed_content. | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
## `HttpCrawler` | ||
<ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> uses <ApiLink to="class/NoParser">`NoParser`</ApiLink> that does not parse http response at all and is to be used if no parsing is required. | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,11 @@ | ||
try: | ||
from ._beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupParser | ||
from ._beautifulsoup_crawler import BeautifulSoupCrawler | ||
from ._beautifulsoup_crawling_context import BeautifulSoupCrawlingContext | ||
from ._beautifulsoup_parser import BeautifulSoupParserType | ||
except ImportError as exc: | ||
raise ImportError( | ||
"To import anything from this subpackage, you need to install the 'beautifulsoup' extra." | ||
"For example, if you use pip, run `pip install 'crawlee[beautifulsoup]'`.", | ||
) from exc | ||
|
||
__all__ = ['BeautifulSoupCrawler', 'BeautifulSoupCrawlingContext', 'BeautifulSoupParser'] | ||
__all__ = ['BeautifulSoupCrawler', 'BeautifulSoupCrawlingContext', 'BeautifulSoupParserType'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
33 changes: 14 additions & 19 deletions
33
src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_context.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,21 @@ | ||
from __future__ import annotations | ||
from dataclasses import dataclass, fields | ||
|
||
from dataclasses import dataclass | ||
from typing import TYPE_CHECKING | ||
from bs4 import BeautifulSoup | ||
from typing_extensions import Self | ||
|
||
from crawlee._types import BasicCrawlingContext, EnqueueLinksFunction | ||
from crawlee._utils.docs import docs_group | ||
from crawlee.http_crawler import HttpCrawlingResult | ||
|
||
if TYPE_CHECKING: | ||
from bs4 import BeautifulSoup | ||
from crawlee.static_content_crawler._static_crawling_context import ParsedHttpCrawlingContext | ||
|
||
|
||
@dataclass(frozen=True) | ||
@docs_group('Data structures') | ||
class BeautifulSoupCrawlingContext(HttpCrawlingResult, BasicCrawlingContext): | ||
"""The crawling context used by the `BeautifulSoupCrawler`. | ||
|
||
It provides access to key objects as well as utility functions for handling crawling tasks. | ||
""" | ||
|
||
soup: BeautifulSoup | ||
"""The `BeautifulSoup` object for the current page.""" | ||
|
||
enqueue_links: EnqueueLinksFunction | ||
"""The BeautifulSoup `EnqueueLinksFunction` implementation.""" | ||
class BeautifulSoupCrawlingContext(ParsedHttpCrawlingContext[BeautifulSoup]): | ||
Pijukatel marked this conversation as resolved.
Show resolved
Hide resolved
vdusek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
@property | ||
def soup(self) -> BeautifulSoup: | ||
"""Convenience alias.""" | ||
return self.parsed_content | ||
|
||
@classmethod | ||
def from_parsed_http_crawling_context(cls, context: ParsedHttpCrawlingContext[BeautifulSoup]) -> Self: | ||
"""Convenience constructor that creates new context from existing ParsedHttpCrawlingContext[BeautifulSoup].""" | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
return cls(**{field.name: getattr(context, field.name) for field in fields(context)}) |
41 changes: 41 additions & 0 deletions
41
src/crawlee/beautifulsoup_crawler/_beautifulsoup_parser.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
from __future__ import annotations | ||
|
||
from typing import TYPE_CHECKING, Literal | ||
|
||
from bs4 import BeautifulSoup, Tag | ||
from typing_extensions import override | ||
|
||
from crawlee.static_content_crawler._static_content_parser import StaticContentParser | ||
|
||
if TYPE_CHECKING: | ||
from collections.abc import Iterable | ||
|
||
from crawlee.http_clients import HttpResponse | ||
|
||
|
||
class BeautifulSoupParser(StaticContentParser[BeautifulSoup]): | ||
"""Parser for parsing http response using BeautifulSoup.""" | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
def __init__(self, parser: BeautifulSoupParserType = 'lxml') -> None: | ||
self._parser = parser | ||
|
||
@override | ||
async def parse(self, response: HttpResponse) -> BeautifulSoup: | ||
return BeautifulSoup(response.read(), features=self._parser) | ||
|
||
@override | ||
def is_matching_selector(self, parsed_content: BeautifulSoup, selector: str) -> bool: | ||
return parsed_content.select_one(selector) is not None | ||
|
||
@override | ||
def find_links(self, parsed_content: BeautifulSoup, selector: str) -> Iterable[str]: | ||
link: Tag | ||
urls: list[str] = [] | ||
for link in parsed_content.select(selector): | ||
url = link.attrs.get('href') | ||
if url: | ||
urls.append(url.strip()) | ||
return urls | ||
|
||
|
||
BeautifulSoupParserType = Literal['html.parser', 'lxml', 'xml', 'html5lib'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,10 @@ | ||
from crawlee.http_clients import HttpCrawlingResult | ||
from crawlee.static_content_crawler._static_crawling_context import HttpCrawlingContext | ||
|
||
from ._http_crawler import HttpCrawler | ||
from ._http_crawling_context import HttpCrawlingContext, HttpCrawlingResult | ||
|
||
__all__ = ['HttpCrawler', 'HttpCrawlingContext', 'HttpCrawlingResult'] | ||
__all__ = [ | ||
'HttpCrawler', | ||
janbuchar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
'HttpCrawlingContext', | ||
'HttpCrawlingResult', | ||
vdusek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.