Skip to content

Commit d424bf5

Browse files
authored
✨ Add Core (#40)
* 💩 Repeat DI injection code * ✏️ fix component integration bugs with core * ✨ implement Spider * ⚡️ Improve Spider's perfomance * 💩 add shit doc src samples * 💩 implement base code for core * ♻️ adapted design to core * ➕ updated dependency * ✅ refactor old tests to new design * 💩 add base benchmark * ✏️ Fix rocketry adaptor starting issue * 🐛 Fix: Crawler starting/shut down issues * ⚰️ Remove benchmarks * 🎨 Improve schedule design to adopt core * 📝 Add new sample doc src for processor * 🎨 Fix imports w.t.r new design * ♻️ Refactor Engine to adopt to core design * ♻️ Refactor DI injection to adopt to core design * ⚡️ Add Spider/Processor to Core * ♻️ Refactor Core's Spider to improve design * ♻️ Refactor Old test to pass w.t.r new designs * ✅ Add test for core's component * 🔥 Remove extra fastapi folders * 🔥 Remove playwright proto * 🐛 Fix Aio HTTP design * 🎨 Improve typing/doc strings * ✏️ Fix doc and code typo --------- Co-authored-by: Sadegh Yazdani
1 parent 7009996 commit d424bf5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+1237
-489
lines changed

.github/ISSUE_TEMPLATE/PULL_request_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
🚨Please review the [guidelines for contributing](../CONTRIBUTING.md) to this repository.
33

44
- [ ] Make sure you are requesting to **pull a topic/feature/bugfix branch** (right side). Don't request your master!
5-
- [ ] Make sure you have 100% test coverage
5+
- [ ] Make sure you have 100% test coverage
66
- [ ] Check the commit's or even all commits' message styles matches our requested structure.
77
- [ ] Check your code linting
88

.github/ISSUE_TEMPLATE/bug.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,4 +44,4 @@ body:
4444
value: |-
4545
## Thanks 🙏
4646
validations:
47-
required: false
47+
required: false

.github/ISSUE_TEMPLATE/config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@ blank_issues_enabled: false
22
contact_links:
33
- name: Question or Problem
44
about: Ask a question or ask about a problem in GitHub Discussions.
5-
url: https://t.me/fastcrawler
5+
url: https://t.me/fastcrawler

.pre-commit-config.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@ repos:
1414
rev: 6.0.0
1515
hooks:
1616
- id: flake8
17-
- repo: 'https://github.com/pre-commit/mirrors-mypy'
18-
rev: v1.4.1
19-
hooks:
20-
- id: mypy
21-
name: mypy (fastcrawler)
22-
files: ^fastcrawler/
17+
# - repo: 'https://github.com/pre-commit/mirrors-mypy'
18+
# rev: v1.4.1
19+
# hooks:
20+
# - id: mypy
21+
# name: mypy (fastcrawler)
22+
# files: ^fastcrawler/
2323
# - id: mypy
2424
# name: mypy (test)
2525
# files: ^test/

docs_src/initilizing_project/sample1/wikipedia.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# pylint: disable-all
22

33

4-
from fastcrawler import BaseModel, Crawler, CSSField, Spider, XPATHField
4+
from fastcrawler import BaseModel, CSSField, Process, Spider, XPATHField
55
from fastcrawler.engine import AioHttpEngine
66

77

@@ -36,4 +36,4 @@ async def save_data(self, data: ArticleData):
3636
... # save parsed data to database
3737

3838

39-
wiki_spider = Crawler(WikiArticleFinder >> WikiArticleRetirever)
39+
wiki_spider = Process(WikiArticleFinder >> WikiArticleRetirever)
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# pylint: disable-all
2+
3+
4+
from fastcrawler import BaseModel, Crawler, CSSField, Spider, XPATHField
5+
from fastcrawler.engine import AioHttpEngine
6+
7+
8+
class PageResolver(BaseModel):
9+
class Config:
10+
url_resolver = XPATHField("//a[contains(@href, 'en.Imdbpedia')]/@href")
11+
12+
13+
class ArticleData(BaseModel):
14+
title: str = CSSField("h1.firstHeading", extract="text") # gets text
15+
body: str = CSSField("div.mw-body-content > div.mw-parser-output") # gets inner HTML
16+
17+
18+
class ImdbBaseSpider(Spider):
19+
engine = AioHttpEngine
20+
concurrency = 100
21+
22+
23+
class ImdbArticleFinder(ImdbBaseSpider):
24+
data_model = PageResolver
25+
req_count = 1_000_000
26+
start_url = [
27+
"https://meta.Imdbmedia.org/Imdb/List_of_Imdbpedias",
28+
]
29+
30+
31+
class ImdbArticleRetirever(ImdbBaseSpider):
32+
data_model = ArticleData
33+
req_count = 1_000_000
34+
35+
async def save_data(self, data: ArticleData):
36+
... # save parsed data to database
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from fastcrawler import FastCrawler, Process
2+
3+
from .imdb import ImdbArticleFinder, ImdbArticleRetirever
4+
from .wikipedia import WikiArticleFinder, WikiArticleRetirever
5+
6+
app = FastCrawler(
7+
Process(ImdbArticleFinder >> ImdbArticleRetirever, cond="every 3 minute"),
8+
Process(WikiArticleFinder >> WikiArticleRetirever, cond="every 3 minute"),
9+
)
10+
app.run()
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# pylint: disable-all
2+
3+
4+
from fastcrawler import BaseModel, CSSField, Process, Spider, XPATHField
5+
from fastcrawler.engine import AioHttpEngine
6+
7+
8+
class PageResolver(BaseModel):
9+
class Config:
10+
url_resolver = XPATHField("//a[contains(@href, 'en.wikipedia')]/@href")
11+
12+
13+
class ArticleData(BaseModel):
14+
title: str = CSSField("h1.firstHeading", extract="text") # gets text
15+
body: str = CSSField("div.mw-body-content > div.mw-parser-output") # gets inner HTML
16+
17+
18+
class WikiBaseSpider(Spider):
19+
engine = AioHttpEngine
20+
concurrency = 100
21+
22+
23+
class WikiArticleFinder(WikiBaseSpider):
24+
data_model = PageResolver
25+
req_count = 1_000_000
26+
start_url = [
27+
"https://meta.wikimedia.org/wiki/List_of_Wikipedias",
28+
]
29+
30+
31+
class WikiArticleRetirever(WikiBaseSpider):
32+
data_model = ArticleData
33+
req_count = 1_000_000
34+
35+
async def save_data(self, data: ArticleData):
36+
... # save parsed data to database
37+
38+
39+
wiki_spider = Process(WikiArticleFinder >> WikiArticleRetirever)

docs_src/processor/tutorial001.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import asyncio
2+
3+
from fastcrawler import BaseModel, Depends, Process, Spider, XPATHField
4+
from fastcrawler.schedule import ProcessController, RocketryApplication
5+
6+
7+
class PersonData(BaseModel):
8+
name: str = XPATHField(query="//td[1]", extract="text")
9+
age: int = XPATHField(query="//td[2]", extract="text")
10+
11+
12+
class PersonPage(BaseModel):
13+
person: list[PersonData] = XPATHField(query="//table//tr", many=True)
14+
15+
16+
async def get_urls():
17+
return {f"http://localhost:8000/persons/{id}" for id in range(20)}
18+
19+
20+
class MySpider(Spider):
21+
engine_request_limit = 20
22+
data_model = PersonPage
23+
start_url = Depends(get_urls)
24+
25+
async def save(self, all_data: list[PersonPage]):
26+
assert all_data is not None
27+
assert len(all_data) == 20
28+
29+
30+
async def main():
31+
process = Process(
32+
spider=MySpider(),
33+
cond="every 1 second",
34+
controller=ProcessController(app=RocketryApplication()),
35+
)
36+
await process.add_spiders()
37+
assert len(await process.controller.app.get_all_tasks()) == 1
38+
await process.start(silent=False)
39+
40+
41+
asyncio.run(main())

fastcrawler/__init__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
from .core import Crawler, FastCrawler, Spider
1+
from .core import FastCrawler, Process, Spider
22
from .engine import AioHttpEngine
33
from .parsers import BaseModel, CSSField, RegexField, XPATHField
4-
from .schedule import RocketryApplication, RocketryController
4+
from .schedule import ProcessController, RocketryApplication
55
from .utils import Depends
66

77
__all__ = [
@@ -11,9 +11,9 @@
1111
"RegexField",
1212
"Depends",
1313
"Spider",
14-
"Crawler",
14+
"Process",
1515
"FastCrawler",
1616
"RocketryApplication",
17-
"RocketryController",
17+
"ProcessController",
1818
"AioHttpEngine",
1919
]

0 commit comments

Comments
 (0)