Skip to content

Commit 246cfc4

Browse files
authored
feat: Add example of how to integrate Camoufox into PlaywrightCrawler (#789)
### Description Show how to integrate `Camoufox` into `PlaywrightCrawler`. Added one docs page and one example code. Added `playwright-camoufox` template to cli call `crawlee create`. Fixed wrong template indentation when `Apify integration` was selected in cli. ### Issues - Closes: #684
1 parent 53184d0 commit 246cfc4

File tree

13 files changed

+179
-16
lines changed

13 files changed

+179
-16
lines changed
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import asyncio
2+
3+
# Camoufox is external package and needs to be installed. It is not included in crawlee.
4+
from camoufox import AsyncNewBrowser
5+
from typing_extensions import override
6+
7+
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
8+
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
9+
10+
11+
class CamoufoxPlugin(PlaywrightBrowserPlugin):
12+
"""Example browser plugin that uses Camoufox browser, but otherwise keeps the functionality of
13+
PlaywrightBrowserPlugin."""
14+
15+
@override
16+
async def new_browser(self) -> PlaywrightBrowserController:
17+
if not self._playwright:
18+
raise RuntimeError('Playwright browser plugin is not initialized.')
19+
20+
return PlaywrightBrowserController(
21+
browser=await AsyncNewBrowser(self._playwright, headless=True, **self._browser_options),
22+
max_open_pages_per_browser=1, # Increase, if camoufox can handle it in your use case.
23+
header_generator=None, # This turns off the crawlee header_generation. Camoufox has its own.
24+
)
25+
26+
27+
async def main() -> None:
28+
crawler = PlaywrightCrawler(
29+
# Limit the crawl to max requests. Remove or increase it for crawling all links.
30+
max_requests_per_crawl=10,
31+
# Custom browser pool. This gives users full control over browsers used by the crawler.
32+
browser_pool=BrowserPool(plugins=[CamoufoxPlugin()]),
33+
)
34+
35+
# Define the default request handler, which will be called for every request.
36+
@crawler.router.default_handler
37+
async def request_handler(context: PlaywrightCrawlingContext) -> None:
38+
context.log.info(f'Processing {context.request.url} ...')
39+
40+
# Extract some data from the page using Playwright's API.
41+
posts = await context.page.query_selector_all('.athing')
42+
for post in posts:
43+
# Get the HTML elements for the title and rank within each post.
44+
title_element = await post.query_selector('.title a')
45+
46+
# Extract the data we want from the elements.
47+
title = await title_element.inner_text() if title_element else None
48+
49+
# Push the extracted data to the default dataset.
50+
await context.push_data({'title': title})
51+
52+
# Find a link to the next page and enqueue it if it exists.
53+
await context.enqueue_links(selector='.morelink')
54+
55+
# Run the crawler with the initial list of URLs.
56+
await crawler.run(['https://news.ycombinator.com/'])
57+
58+
59+
if __name__ == '__main__':
60+
asyncio.run(main())
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
id: playwright-crawler-with-camoufox
3+
title: Playwright crawler with Camoufox
4+
---
5+
6+
import ApiLink from '@site/src/components/ApiLink';
7+
import CodeBlock from '@theme/CodeBlock';
8+
9+
import PlaywrightCrawlerExampleWithCamoufox from '!!raw-loader!./code/playwright_crawler_with_camoufox.py';
10+
11+
This example demonstrates how to integrate Camoufox into <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> using <ApiLink to="class/BrowserPool">`BrowserPool`</ApiLink> with custom <ApiLink to="class/PlaywrightBrowserPlugin">`PlaywrightBrowserPlugin`</ApiLink>.
12+
13+
Camoufox is a stealthy minimalistic build of Firefox. For details please visit its homepage https://camoufox.com/ .
14+
To be able to run this example you will need to install camoufox, as it is external tool, and it is not part of the crawlee. For installation please see https://pypi.org/project/camoufox/.
15+
16+
**Warning!** Camoufox is using custom build of firefox. This build can be hundreds of MB large.
17+
You can either pre-download this file using following command `python3 -m camoufox fetch` or camoufox will download it automatically once you try to run it, and it does not find existing binary.
18+
For more details please refer to: https://github.com/daijro/camoufox/tree/main/pythonlib#camoufox-python-interface
19+
20+
**Project template -** It is possible to generate project with Python code which includes Camoufox integration into crawlee through crawlee cli. Call `crawlee create` and pick `Playwright-camoufox` when asked for Crawler type.
21+
22+
The example code after PlayWrightCrawler instantiation is similar to example describing the use of Playwright Crawler. The main difference is that in this example Camoufox will be used as the browser through BrowserPool.
23+
24+
<CodeBlock className="language-python">
25+
{PlaywrightCrawlerExampleWithCamoufox}
26+
</CodeBlock>

pyproject.toml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -205,8 +205,11 @@ warn_unreachable = true
205205
warn_unused_ignores = true
206206

207207
[[tool.mypy.overrides]]
208-
# Example code that shows integration of crawlee imports apify, despite apify not being dependency of crawlee.
209-
module = "apify"
208+
# Example codes are sometimes showing integration of crawlee with external tool, which is not dependency of crawlee.
209+
module =[
210+
"apify", # Example code shows integration of apify and crawlee.
211+
"camoufox" # Example code shows integration of camoufox and crawlee.
212+
]
210213
ignore_missing_imports = true
211214

212215
[tool.basedpyright]

src/crawlee/project_template/cookiecutter.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
{
22
"project_name": "crawlee-python-project",
33
"__package_name": "{{ cookiecutter.project_name|lower|replace('-', '_') }}",
4-
"crawler_type": ["beautifulsoup", "parsel", "playwright"],
4+
"crawler_type": ["beautifulsoup", "parsel", "playwright", "playwright-camoufox"],
5+
"__crawler_type": "{{ cookiecutter.crawler_type|lower|replace('-', '_') }}",
56
"http_client": ["httpx", "curl-impersonate"],
67
"package_manager": ["poetry", "pip", "manual"],
78
"enable_apify_integration": false,

src/crawlee/project_template/templates/main.py

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,11 @@
2020
# % endif
2121
# % endblock
2222
# % endfilter
23+
# % if self.pre_main is defined
2324

25+
{{self.pre_main()}}
26+
27+
# % endif
2428
async def main() -> None:
2529
"""The crawler entry point."""
2630
# % filter truncate(0, end='')
@@ -30,17 +34,16 @@ async def main() -> None:
3034

3135
# % if cookiecutter.enable_apify_integration
3236
async with Actor:
33-
# % filter indent(width=8, first=False)
34-
{{ self.instantiation() }}
35-
# % endfilter
37+
# % set indent_width = 8
3638
# % else
37-
# % filter indent(width=4, first=False)
38-
{{ self.instantiation() }}
39-
# % endfilter
39+
# % set indent_width = 4
4040
# % endif
41+
# % filter indent(width=indent_width, first=True)
42+
{{self.instantiation()}}
4143

42-
await crawler.run(
43-
[
44-
'{{ cookiecutter.start_url }}',
45-
]
46-
)
44+
await crawler.run(
45+
[
46+
'{{ cookiecutter.start_url }}',
47+
]
48+
)
49+
# % endfilter
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# % extends 'main.py'
2+
3+
# % block import
4+
from camoufox import AsyncNewBrowser
5+
from typing_extensions import override
6+
7+
from crawlee._utils.context import ensure_context
8+
from crawlee.browsers import PlaywrightBrowserPlugin, PlaywrightBrowserController, BrowserPool
9+
from crawlee.playwright_crawler import PlaywrightCrawler
10+
# % endblock
11+
12+
# % block pre_main
13+
class CamoufoxPlugin(PlaywrightBrowserPlugin):
14+
"""Example browser plugin that uses Camoufox Browser, but otherwise keeps the functionality of
15+
PlaywrightBrowserPlugin."""
16+
17+
@ensure_context
18+
@override
19+
async def new_browser(self) -> PlaywrightBrowserController:
20+
if not self._playwright:
21+
raise RuntimeError('Playwright browser plugin is not initialized.')
22+
23+
return PlaywrightBrowserController(
24+
browser=await AsyncNewBrowser(self._playwright, headless=True),
25+
max_open_pages_per_browser=1, # Increase, if camoufox can handle it in your usecase.
26+
header_generator=None, # This turns off the crawlee header_generation. Camoufox has its own.
27+
)
28+
# % endblock
29+
30+
# % block instantiation
31+
crawler = PlaywrightCrawler(
32+
max_requests_per_crawl=10,
33+
request_handler=router,
34+
browser_pool=BrowserPool(plugins=[CamoufoxPlugin()])
35+
)
36+
# % endblock
File renamed without changes.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from crawlee.playwright_crawler import PlaywrightCrawlingContext
2+
from crawlee.router import Router
3+
4+
router = Router[PlaywrightCrawlingContext]()
5+
6+
7+
@router.default_handler
8+
async def default_handler(context: PlaywrightCrawlingContext) -> None:
9+
"""Default request handler."""
10+
context.log.info(f'Processing {context.request.url} ...')
11+
title = await context.page.query_selector('title')
12+
await context.push_data(
13+
{
14+
'url': context.request.loaded_url,
15+
'title': await title.inner_text() if title else None,
16+
}
17+
)
18+
19+
await context.enqueue_links()

src/crawlee/project_template/{{cookiecutter.project_name}}/Dockerfile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
# You can also use any other image from Docker Hub.
44
# % if cookiecutter.crawler_type == 'playwright'
55
FROM apify/actor-python-playwright:3.13
6+
# % elif cookiecutter.crawler_type == 'camoufox'
7+
# Currently camoufox has issues installing on Python 3.13
8+
FROM apify/actor-python-playwright:3.12
69
# % else
710
FROM apify/actor-python:3.13
811
# % endif

src/crawlee/project_template/{{cookiecutter.project_name}}/_pyproject.toml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1+
# % if cookiecutter.crawler_type == 'camoufox'
2+
# % set extras = ['playwright']
3+
# % else
14
# % set extras = [cookiecutter.crawler_type]
5+
# % endif
26
# % if cookiecutter.http_client == 'curl-impersonate'
37
# % do extras.append('curl-impersonate')
48
# % endif
@@ -14,6 +18,9 @@ package-mode = false
1418

1519
[tool.poetry.dependencies]
1620
python = "^3.9"
21+
# % if cookiecutter.crawler_type == 'camoufox'
22+
camoufox = {version= ">=0.4.5", extras = ["geoip"]}
23+
# % endif
1724
# % if cookiecutter.enable_apify_integration
1825
apify = "*"
1926
# % endif

0 commit comments

Comments
 (0)