Skip to content

Commit b2b4724

Browse files
committed
Merge remote-tracking branch 'origin/master' into crawler-persistance
2 parents 52daca0 + af237da commit b2b4724

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1860
-1156
lines changed

.github/workflows/build_and_deploy_docs.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
ref: ${{ github.event_name == 'workflow_call' && inputs.ref || github.ref }}
3131

3232
- name: Set up Node
33-
uses: actions/setup-node@v5
33+
uses: actions/setup-node@v6
3434
with:
3535
node-version: ${{ env.NODE_VERSION }}
3636

@@ -40,7 +40,7 @@ jobs:
4040
python-version: ${{ env.PYTHON_VERSION }}
4141

4242
- name: Set up uv package manager
43-
uses: astral-sh/setup-uv@v6
43+
uses: astral-sh/setup-uv@v7
4444
with:
4545
python-version: ${{ env.PYTHON_VERSION }}
4646

.github/workflows/templates_e2e_tests.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
uses: actions/checkout@v5
2828

2929
- name: Setup node
30-
uses: actions/setup-node@v5
30+
uses: actions/setup-node@v6
3131
with:
3232
node-version: ${{ env.NODE_VERSION }}
3333

@@ -44,7 +44,7 @@ jobs:
4444
run: pipx install poetry
4545

4646
- name: Set up uv package manager
47-
uses: astral-sh/setup-uv@v6
47+
uses: astral-sh/setup-uv@v7
4848
with:
4949
python-version: ${{ env.PYTHON_VERSION }}
5050

CHANGELOG.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,39 @@
33
All notable changes to this project will be documented in this file.
44

55
<!-- git-cliff-unreleased-start -->
6-
## 1.0.1 - **not yet released**
6+
## 1.0.3 - **not yet released**
7+
8+
### 🐛 Bug Fixes
9+
10+
- Add support for Pydantic v2.12 ([#1471](https://github.com/apify/crawlee-python/pull/1471)) ([35c1108](https://github.com/apify/crawlee-python/commit/35c110878c2f445a2866be2522ea8703e9b371dd)) by [@Mantisus](https://github.com/Mantisus), closes [#1464](https://github.com/apify/crawlee-python/issues/1464)
11+
- Fix database version warning message ([#1485](https://github.com/apify/crawlee-python/pull/1485)) ([18a545e](https://github.com/apify/crawlee-python/commit/18a545ee8add92e844acd0068f9cb8580a82e1c9)) by [@Mantisus](https://github.com/Mantisus)
12+
- Fix `reclaim_request` in `SqlRequestQueueClient` to correctly update the request state ([#1486](https://github.com/apify/crawlee-python/pull/1486)) ([1502469](https://github.com/apify/crawlee-python/commit/150246957f8f7f1ceb77bb77e3a02a903c50cae1)) by [@Mantisus](https://github.com/Mantisus), closes [#1484](https://github.com/apify/crawlee-python/issues/1484)
13+
- Fix `KeyValueStore.auto_saved_value` failing in some scenarios ([#1438](https://github.com/apify/crawlee-python/pull/1438)) ([b35dee7](https://github.com/apify/crawlee-python/commit/b35dee78180e57161b826641d45a61b8d8f6ef51)) by [@Pijukatel](https://github.com/Pijukatel), closes [#1354](https://github.com/apify/crawlee-python/issues/1354)
14+
15+
16+
<!-- git-cliff-unreleased-end -->
17+
## [1.0.2](https://github.com/apify/crawlee-python/releases/tag/v1.0.2) (2025-10-08)
18+
19+
### 🐛 Bug Fixes
20+
21+
- Use Self type in the open() method of storage clients ([#1462](https://github.com/apify/crawlee-python/pull/1462)) ([4ec6f6c](https://github.com/apify/crawlee-python/commit/4ec6f6c08f81632197f602ff99151338b3eba6e7)) by [@janbuchar](https://github.com/janbuchar)
22+
- Add storages name validation ([#1457](https://github.com/apify/crawlee-python/pull/1457)) ([84de11a](https://github.com/apify/crawlee-python/commit/84de11a3a603503076f5b7df487c9abab68a9015)) by [@Mantisus](https://github.com/Mantisus), closes [#1434](https://github.com/apify/crawlee-python/issues/1434)
23+
- Pin pydantic version to &lt;2.12.0 to avoid compatibility issues ([#1467](https://github.com/apify/crawlee-python/pull/1467)) ([f11b86f](https://github.com/apify/crawlee-python/commit/f11b86f7ed57f98e83dc1b52f15f2017a919bf59)) by [@vdusek](https://github.com/vdusek)
24+
25+
26+
## [1.0.1](https://github.com/apify/crawlee-python/releases/tag/v1.0.1) (2025-10-06)
727

828
### 🐛 Bug Fixes
929

1030
- Fix memory leak in `PlaywrightCrawler` on browser context creation ([#1446](https://github.com/apify/crawlee-python/pull/1446)) ([bb181e5](https://github.com/apify/crawlee-python/commit/bb181e58d8070fba38e62d6e57fe981a00e5f035)) by [@Pijukatel](https://github.com/Pijukatel), closes [#1443](https://github.com/apify/crawlee-python/issues/1443)
1131
- Update templates to handle optional httpx client ([#1440](https://github.com/apify/crawlee-python/pull/1440)) ([c087efd](https://github.com/apify/crawlee-python/commit/c087efd39baedf46ca3e5cae1ddc1acd6396e6c1)) by [@Pijukatel](https://github.com/Pijukatel)
1232

1333

14-
<!-- git-cliff-unreleased-end -->
1534
## [1.0.0](https://github.com/apify/crawlee-python/releases/tag/v1.0.0) (2025-09-29)
1635

36+
- Check out the [Release blog post](https://crawlee.dev/blog/crawlee-for-python-v1) for more details.
37+
- Check out the [Upgrading guide](https://crawlee.dev/python/docs/upgrading/upgrading-to-v1) to ensure a smooth update.
38+
1739
### 🚀 Features
1840

1941
- Add utility for load and parse Sitemap and `SitemapRequestLoader` ([#1169](https://github.com/apify/crawlee-python/pull/1169)) ([66599f8](https://github.com/apify/crawlee-python/commit/66599f8d085f3a8622e130019b6fdce2325737de)) by [@Mantisus](https://github.com/Mantisus), closes [#1161](https://github.com/apify/crawlee-python/issues/1161)
@@ -196,6 +218,9 @@ All notable changes to this project will be documented in this file.
196218

197219
## [0.6.0](https://github.com/apify/crawlee-python/releases/tag/v0.6.0) (2025-03-03)
198220

221+
- Check out the [Release blog post](https://crawlee.dev/blog/crawlee-for-python-v06) for more details.
222+
- Check out the [Upgrading guide](https://crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v06) to ensure a smooth update.
223+
199224
### 🚀 Features
200225

201226
- Integrate browserforge fingerprints ([#829](https://github.com/apify/crawlee-python/pull/829)) ([2b156b4](https://github.com/apify/crawlee-python/commit/2b156b4ba688f9111195422e6058dff30eb1f782)) by [@Pijukatel](https://github.com/Pijukatel), closes [#549](https://github.com/apify/crawlee-python/issues/549)
@@ -276,6 +301,9 @@ All notable changes to this project will be documented in this file.
276301

277302
## [0.5.0](https://github.com/apify/crawlee-python/releases/tag/v0.5.0) (2025-01-02)
278303

304+
- Check out the [Release blog post](https://crawlee.dev/blog/crawlee-for-python-v05) for more details.
305+
- Check out the [Upgrading guide](https://crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v05) to ensure a smooth update.
306+
279307
### 🚀 Features
280308

281309
- Add possibility to use None as no proxy in tiered proxies ([#760](https://github.com/apify/crawlee-python/pull/760)) ([0fbd017](https://github.com/apify/crawlee-python/commit/0fbd01723b9fe2e3410e0f358cab2f22848b08d0)) by [@Pijukatel](https://github.com/Pijukatel), closes [#687](https://github.com/apify/crawlee-python/issues/687)
@@ -367,6 +395,8 @@ All notable changes to this project will be documented in this file.
367395

368396
## [0.4.0](https://github.com/apify/crawlee-python/releases/tag/v0.4.0) (2024-11-01)
369397

398+
- Check out the [Upgrading guide](https://crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v04) to ensure a smooth update.
399+
370400
### 🚀 Features
371401

372402
- [**breaking**] Add headers in unique key computation ([#609](https://github.com/apify/crawlee-python/pull/609)) ([6c4746f](https://github.com/apify/crawlee-python/commit/6c4746fa8ff86952a812b32a1d70dc910e76b43e)) by [@Prathamesh010](https://github.com/Prathamesh010), closes [#548](https://github.com/apify/crawlee-python/issues/548)
@@ -476,6 +506,8 @@ All notable changes to this project will be documented in this file.
476506

477507
## [0.3.0](https://github.com/apify/crawlee-python/releases/tag/v0.3.0) (2024-08-27)
478508

509+
- Check out the [Upgrading guide](https://crawlee.dev/python/docs/upgrading/upgrading-to-v0x#upgrading-to-v03) to ensure a smooth update.
510+
479511
### 🚀 Features
480512

481513
- Implement ParselCrawler that adds support for Parsel ([#348](https://github.com/apify/crawlee-python/pull/348)) ([a3832e5](https://github.com/apify/crawlee-python/commit/a3832e527f022f32cce4a80055da3b7967b74522)) by [@asymness](https://github.com/asymness), closes [#335](https://github.com/apify/crawlee-python/issues/335)
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
import asyncio
2+
import shutil
3+
from pathlib import Path
4+
from tempfile import TemporaryDirectory
5+
6+
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
7+
8+
# Profile name to use (usually 'Default' for single profile setups)
9+
PROFILE_NAME = 'Default'
10+
11+
# Paths to Chrome profiles in your system (example for Windows)
12+
# Use `chrome://version/` to find your profile path
13+
PROFILE_PATH = Path(Path.home(), 'AppData', 'Local', 'Google', 'Chrome', 'User Data')
14+
15+
16+
async def main() -> None:
17+
# Create a temporary folder to copy the profile to
18+
with TemporaryDirectory(prefix='crawlee-') as tmpdirname:
19+
tmp_profile_dir = Path(tmpdirname)
20+
21+
# Copy the profile to a temporary folder
22+
shutil.copytree(
23+
PROFILE_PATH / PROFILE_NAME,
24+
tmp_profile_dir / PROFILE_NAME,
25+
dirs_exist_ok=True,
26+
)
27+
28+
crawler = PlaywrightCrawler(
29+
headless=False,
30+
# Use chromium for Chrome compatibility
31+
browser_type='chromium',
32+
# Disable fingerprints to preserve profile identity
33+
fingerprint_generator=None,
34+
# Set user data directory to temp folder
35+
user_data_dir=tmp_profile_dir,
36+
browser_launch_options={
37+
# Use installed Chrome browser
38+
'channel': 'chrome',
39+
# Slow down actions to mimic human behavior
40+
'slow_mo': 200,
41+
'args': [
42+
# Use the specified profile
43+
f'--profile-directory={PROFILE_NAME}',
44+
],
45+
},
46+
)
47+
48+
@crawler.router.default_handler
49+
async def default_handler(context: PlaywrightCrawlingContext) -> None:
50+
context.log.info(f'Visiting {context.request.url}')
51+
52+
await crawler.run(['https://crawlee.dev/'])
53+
54+
55+
if __name__ == '__main__':
56+
asyncio.run(main())
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
import asyncio
2+
from pathlib import Path
3+
4+
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
5+
6+
# Replace this with your actual Firefox profile name
7+
# Find it at about:profiles in Firefox
8+
PROFILE_NAME = 'your-profile-name-here'
9+
10+
# Paths to Firefox profiles in your system (example for Windows)
11+
# Use `about:profiles` to find your profile path
12+
PROFILE_PATH = Path(
13+
Path.home(), 'AppData', 'Roaming', 'Mozilla', 'Firefox', 'Profiles', PROFILE_NAME
14+
)
15+
16+
17+
async def main() -> None:
18+
crawler = PlaywrightCrawler(
19+
# Use Firefox browser type
20+
browser_type='firefox',
21+
# Disable fingerprints to use the profile as is
22+
fingerprint_generator=None,
23+
headless=False,
24+
# Path to your Firefox profile
25+
user_data_dir=PROFILE_PATH,
26+
browser_launch_options={
27+
'args': [
28+
# Required to avoid version conflicts
29+
'--allow-downgrade'
30+
]
31+
},
32+
)
33+
34+
@crawler.router.default_handler
35+
async def default_handler(context: PlaywrightCrawlingContext) -> None:
36+
context.log.info(f'Visiting {context.request.url}')
37+
38+
await crawler.run(['https://crawlee.dev/'])
39+
40+
41+
if __name__ == '__main__':
42+
asyncio.run(main())
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
id: using_browser_profile
3+
title: Using browser profile
4+
---
5+
6+
import ApiLink from '@site/src/components/ApiLink';
7+
8+
import CodeBlock from '@theme/CodeBlock';
9+
10+
import ChromeProfileExample from '!!raw-loader!./code_examples/using_browser_profiles_chrome.py';
11+
import FirefoxProfileExample from '!!raw-loader!./code_examples/using_browser_profiles_firefox.py';
12+
13+
This example demonstrates how to run <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> using your local browser profile from [Chrome](https://www.google.com/intl/us/chrome/) or [Firefox](https://www.firefox.com/).
14+
15+
Using browser profiles allows you to leverage existing login sessions, saved passwords, bookmarks, and other personalized browser data during crawling. This can be particularly useful for testing scenarios or when you need to access content that requires authentication.
16+
17+
## Chrome browser
18+
19+
To run <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> with your Chrome profile, you need to know the path to your profile files. You can find this information by entering `chrome://version/` as a URL in your Chrome browser. If you have multiple profiles, pay attention to the profile name - if you only have one profile, it's always `Default`.
20+
21+
You also need to use the [`channel`](https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch-option-channel) parameter in `browser_launch_options` to use the Chrome browser installed on your system instead of Playwright's Chromium.
22+
23+
:::warning Profile access limitation
24+
Due to [Chrome's security policies](https://developer.chrome.com/blog/remote-debugging-port), automation cannot use your main browsing profile directly. The example copies your profile to a temporary location as a workaround.
25+
:::
26+
27+
Make sure you don't have any running Chrome browser processes before running this code:
28+
29+
<CodeBlock className="language-python" language="python">
30+
{ChromeProfileExample}
31+
</CodeBlock>
32+
33+
## Firefox browser
34+
35+
To find the path to your Firefox profile, enter `about:profiles` as a URL in your Firefox browser. Unlike Chrome, you can use your standard profile path directly without copying it first.
36+
37+
Make sure you don't have any running Firefox browser processes before running this code:
38+
39+
<CodeBlock className="language-python" language="python">
40+
{FirefoxProfileExample}
41+
</CodeBlock>

docs/upgrading/upgrading_to_v1.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -333,3 +333,7 @@ async def main() -> None:
333333

334334
await crawler.run(['https://crawlee.dev/'])
335335
```
336+
337+
### New storage naming restrictions
338+
339+
We've introduced naming restrictions for storages to ensure compatibility with Apify Platform requirements and prevent potential conflicts. Storage names may include only letters (a–z, A–Z), digits (0–9), and hyphens (-), with hyphens allowed only in the middle of the name (for example, my-storage-1).

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "crawlee"
7-
version = "1.0.1"
7+
version = "1.0.3"
88
description = "Crawlee for Python"
99
authors = [{ name = "Apify Technologies s.r.o.", email = "[email protected]" }]
1010
license = { file = "LICENSE" }
@@ -107,7 +107,7 @@ dev = [
107107
"pytest-timeout~=2.4.0",
108108
"pytest-xdist~=3.8.0",
109109
"pytest~=8.4.0",
110-
"ruff~=0.13.0",
110+
"ruff~=0.14.0",
111111
"setuptools", # setuptools are used by pytest, but not explicitly required
112112
"types-beautifulsoup4~=4.12.0.20240229",
113113
"types-cachetools~=6.2.0.20250827",

src/crawlee/_request.py

Lines changed: 31 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -185,33 +185,44 @@ class Request(BaseModel):
185185
method: HttpMethod = 'GET'
186186
"""HTTP request method."""
187187

188-
headers: Annotated[HttpHeaders, Field(default_factory=HttpHeaders)] = HttpHeaders()
189-
"""HTTP request headers."""
190-
191188
payload: Annotated[
192189
HttpPayload | None,
193190
BeforeValidator(lambda v: v.encode() if isinstance(v, str) else v),
194191
PlainSerializer(lambda v: v.decode() if isinstance(v, bytes) else v),
195192
] = None
196193
"""HTTP request payload."""
197194

198-
user_data: Annotated[
199-
dict[str, JsonSerializable], # Internally, the model contains `UserData`, this is just for convenience
200-
Field(alias='userData', default_factory=lambda: UserData()),
201-
PlainValidator(user_data_adapter.validate_python),
202-
PlainSerializer(
203-
lambda instance: user_data_adapter.dump_python(
204-
instance,
205-
by_alias=True,
206-
exclude_none=True,
207-
exclude_unset=True,
208-
exclude_defaults=True,
209-
)
210-
),
211-
] = {}
212-
"""Custom user data assigned to the request. Use this to save any request related data to the
213-
request's scope, keeping them accessible on retries, failures etc.
214-
"""
195+
# Workaround for pydantic 2.12 and mypy type checking issue for Annotated with default_factory
196+
if TYPE_CHECKING:
197+
headers: HttpHeaders = HttpHeaders()
198+
"""HTTP request headers."""
199+
200+
user_data: dict[str, JsonSerializable] = {}
201+
"""Custom user data assigned to the request. Use this to save any request related data to the
202+
request's scope, keeping them accessible on retries, failures etc.
203+
"""
204+
205+
else:
206+
headers: Annotated[HttpHeaders, Field(default_factory=HttpHeaders)]
207+
"""HTTP request headers."""
208+
209+
user_data: Annotated[
210+
dict[str, JsonSerializable], # Internally, the model contains `UserData`, this is just for convenience
211+
Field(alias='userData', default_factory=lambda: UserData()),
212+
PlainValidator(user_data_adapter.validate_python),
213+
PlainSerializer(
214+
lambda instance: user_data_adapter.dump_python(
215+
instance,
216+
by_alias=True,
217+
exclude_none=True,
218+
exclude_unset=True,
219+
exclude_defaults=True,
220+
)
221+
),
222+
]
223+
"""Custom user data assigned to the request. Use this to save any request related data to the
224+
request's scope, keeping them accessible on retries, failures etc.
225+
"""
215226

216227
retry_count: Annotated[int, Field(alias='retryCount')] = 0
217228
"""Number of times the request has been retried."""

src/crawlee/_service_locator.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def __init__(
3838
def get_configuration(self) -> Configuration:
3939
"""Get the configuration."""
4040
if self._configuration is None:
41-
logger.warning('No configuration set, implicitly creating and using default Configuration.')
41+
logger.debug('No configuration set, implicitly creating and using default Configuration.')
4242
self._configuration = Configuration()
4343

4444
return self._configuration
@@ -63,9 +63,9 @@ def set_configuration(self, configuration: Configuration) -> None:
6363
def get_event_manager(self) -> EventManager:
6464
"""Get the event manager."""
6565
if self._event_manager is None:
66-
logger.warning('No event manager set, implicitly creating and using default LocalEventManager.')
66+
logger.debug('No event manager set, implicitly creating and using default LocalEventManager.')
6767
if self._configuration is None:
68-
logger.warning(
68+
logger.debug(
6969
'Implicit creation of event manager will implicitly set configuration as side effect. '
7070
'It is advised to explicitly first set the configuration instead.'
7171
)
@@ -93,7 +93,7 @@ def set_event_manager(self, event_manager: EventManager) -> None:
9393
def get_storage_client(self) -> StorageClient:
9494
"""Get the storage client."""
9595
if self._storage_client is None:
96-
logger.warning('No storage client set, implicitly creating and using default FileSystemStorageClient.')
96+
logger.debug('No storage client set, implicitly creating and using default FileSystemStorageClient.')
9797
if self._configuration is None:
9898
logger.warning(
9999
'Implicit creation of storage client will implicitly set configuration as side effect. '

0 commit comments

Comments
 (0)