-
Notifications
You must be signed in to change notification settings - Fork 488
refactor!: Refactor HttpCrawler, BeautifulSoupCrawler, ParselCrawler inheritance #746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 13 commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
8c8dd24
WIP
Pijukatel 48812b1
Draft proposal for discussion.
Pijukatel 853ee85
Remove redundant type
Pijukatel 17e08a1
BeautifulSoupParser
Pijukatel 188afdb
Being stuck on mypy and generics
Pijukatel 96356d6
Almost there. Figure out the reason for casts in middleware
Pijukatel def0e72
Solved BScrawler. Next ParselCrawler.
Pijukatel 54ce154
Reworked ParselCrawler
Pijukatel 4692fe9
Ready for review.
Pijukatel e2e3cd9
Merge remote-tracking branch 'origin/master' into new-class-hier-curr…
Pijukatel bb8cd12
Edit forgotten comment .
Pijukatel f869be6
Remove mistaken edits in docs
Pijukatel 81e46cd
Merge branch 'master' into new-class-hier-current-middleware
Pijukatel f994e32
Reformat after merge.
Pijukatel bbc27af
Fix CI reported issues on previous Python versions
Pijukatel 7567164
Update docstrings in child crawlers to not repeat text after parent.
Pijukatel 9335967
Revert incorrect docstring update.
Pijukatel b4877cb
Review comments
Pijukatel 2929be1
Reverted back name change in doc strings.
Pijukatel 19bc041
Fix CI reported issues.
Pijukatel fe19345
Fix incorrectly name BS argument
Pijukatel 6ab5a09
Changes by Honza
Pijukatel 2af695b
Polish proposed changes,
Pijukatel 0b0f4ce
Review comments
Pijukatel 03832fb
Review commnets about interl imports in docs
Pijukatel 005c7cf
Extract is_matching_selector from Parser and put
Pijukatel fc2de60
Update src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_cont…
Pijukatel 578cdc0
Update src/crawlee/http_crawler/_http_crawler.py
Pijukatel 280cecb
Update src/crawlee/parsel_crawler/_parsel_crawling_context.py
Pijukatel a88a5e4
Review comments.
Pijukatel b1c0fad
Use correctly BeautifulSoupParser type
Pijukatel 4e3fbd5
Add doc page describing new classes.
Pijukatel 9fc66d8
Update docs more
Pijukatel 434bd6b
Apply suggestions from code review
Pijukatel 18562de
Review comments.
Pijukatel b9255be
More review comments
Pijukatel d70e8a8
Update docs names
Pijukatel 3e87db5
Update docs/guides/static_content_crawlers.mdx
Pijukatel 460e1ac
Review comments.
Pijukatel 8c4ec82
Review comments
Pijukatel e7c7817
Apply suggestions from code review
Pijukatel 05cec1a
Rename StaticCOntentCrawler to AbstractContentCrawler and related fil…
Pijukatel bed215e
Renaming to AbstractHttpCrawler 2
Pijukatel c43b564
Renaming to AbstractHttpCrawler 2
Pijukatel a1db9e2
Apply suggestions from code review
Pijukatel fae917e
Review comments
Pijukatel b563bf9
Expand docs by short description of how to create your own HTTPbase c…
Pijukatel 89a8e83
Update src/crawlee/abstract_http_crawler/_abstract_http_crawler.py
Pijukatel 139b21b
Update src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py
Pijukatel bd7846f
Apply suggestions from code review
Pijukatel 454f9ec
Review comments
Pijukatel 6bba552
Move BlockedInfo to its own file.
Pijukatel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
27 changes: 3 additions & 24 deletions
27
src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawling_context.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,5 @@ | ||
from __future__ import annotations | ||
from bs4 import BeautifulSoup | ||
|
||
from dataclasses import dataclass | ||
from typing import TYPE_CHECKING | ||
from crawlee.http_crawler import ParsedHttpCrawlingContext | ||
|
||
from crawlee._types import BasicCrawlingContext, EnqueueLinksFunction | ||
from crawlee._utils.docs import docs_group | ||
from crawlee.http_crawler import HttpCrawlingResult | ||
|
||
if TYPE_CHECKING: | ||
from bs4 import BeautifulSoup | ||
|
||
|
||
@dataclass(frozen=True) | ||
@docs_group('Data structures') | ||
class BeautifulSoupCrawlingContext(HttpCrawlingResult, BasicCrawlingContext): | ||
"""The crawling context used by the `BeautifulSoupCrawler`. | ||
It provides access to key objects as well as utility functions for handling crawling tasks. | ||
""" | ||
|
||
soup: BeautifulSoup | ||
"""The `BeautifulSoup` object for the current page.""" | ||
|
||
enqueue_links: EnqueueLinksFunction | ||
"""The BeautifulSoup `EnqueueLinksFunction` implementation.""" | ||
BeautifulSoupCrawlingContext = ParsedHttpCrawlingContext[BeautifulSoup] | ||
janbuchar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
44 changes: 44 additions & 0 deletions
44
src/crawlee/beautifulsoup_crawler/_beautifulsoup_parser.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
from __future__ import annotations | ||
|
||
from typing import TYPE_CHECKING, Iterable | ||
|
||
from bs4 import BeautifulSoup, Tag | ||
from typing_extensions import override | ||
|
||
from crawlee._utils.blocked import RETRY_CSS_SELECTORS | ||
from crawlee.http_crawler import BlockedInfo, StaticContentParser | ||
|
||
if TYPE_CHECKING: | ||
from crawlee.http_clients import HttpResponse | ||
|
||
|
||
class BeautifulSoupContentParser(StaticContentParser[BeautifulSoup]): | ||
"""Parser for parsing http response using BeautifulSoup.""" | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
def __init__(self, parser: str = 'lxml') -> None: | ||
janbuchar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
self._parser = parser | ||
|
||
@override | ||
async def parse(self, response: HttpResponse) -> BeautifulSoup: | ||
return BeautifulSoup(response.read(), parser=self._parser) | ||
|
||
@override | ||
def is_blocked(self, parsed_content: BeautifulSoup) -> BlockedInfo: | ||
reason = '' | ||
if parsed_content is not None: | ||
matched_selectors = [ | ||
selector for selector in RETRY_CSS_SELECTORS if parsed_content.select_one(selector) is not None | ||
] | ||
if matched_selectors: | ||
reason = f"Assuming the session is blocked - HTTP response matched the following selectors: {'; '.join( | ||
matched_selectors)}" | ||
return BlockedInfo(reason=reason) | ||
|
||
@override | ||
def find_links(self, parsed_content: BeautifulSoup, selector: str) -> Iterable[str]: | ||
link: Tag | ||
urls: list[str] = [] | ||
for link in parsed_content.select(selector): | ||
if (url := link.attrs.get('href')) is not None: | ||
urls.append(url.strip()) # noqa: PERF401 #Mypy has problems using is not None for type inference in list comprehension. | ||
Pijukatel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
return urls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,13 @@ | ||
from ._http_crawler import HttpCrawler | ||
from ._http_crawling_context import HttpCrawlingContext, HttpCrawlingResult | ||
from ._http_crawler import HttpCrawler, HttpCrawlerGeneric | ||
from ._http_crawling_context import HttpCrawlingContext, HttpCrawlingResult, ParsedHttpCrawlingContext | ||
from ._http_parser import BlockedInfo, StaticContentParser | ||
|
||
__all__ = ['HttpCrawler', 'HttpCrawlingContext', 'HttpCrawlingResult'] | ||
__all__ = [ | ||
'BlockedInfo', | ||
'HttpCrawler', | ||
janbuchar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
'HttpCrawlerGeneric', | ||
'HttpCrawlingContext', | ||
'HttpCrawlingResult', | ||
vdusek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
'ParsedHttpCrawlingContext', | ||
'StaticContentParser', | ||
] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.