Releases · apify/crawlee-python · GitHub

05 Sep 11:38

0.3.3

0.3.3 (2024-09-05)

🐛 Bug Fixes

Deduplicate requests by unique key before submitting them to the queue (#499) (6a3e0e7) by @janbuchar

Contributors

janbuchar

Assets 2

04 Sep 11:28

0.3.2

0.3.2 (2024-09-04)

🐛 Bug Fixes

Double incrementation of item_count (#443, closes #442) (cd9adf1) by @cadlagtrader
Field alias in BatchRequestsOperationResponse (#485) (126a862) by @janbuchar
JSON handling with Parsel (#490, closes #488) (ebf5755) by @janbuchar

Contributors

janbuchar and cadlagtrader

Assets 2

30 Aug 09:54

0.3.1

0.3.1 (2024-08-30)

🚀 Features

Curl http client selects chrome impersonation by default (#473) (82dc939) by @vdusek

Assets 2

27 Aug 10:08

0.3.0

0.3.0 (2024-08-27)

See upgrading guide.

🚀 Features

Implement ParselCrawler that adds support for Parsel (#348, closes #335) (a3832e5) by @asymness
Add support for filling a web form (#453, closes #305) (5a125b4) by @vdusek

🐛 Bug Fixes

Remove indentation from statistics logging and print the data in tables (#322, closes #306) (359b515) by @TymeeK
Remove redundant log, fix format (#408) (8d27e39) by @janbuchar
Dequeue items from RequestQueue in the correct order (#411) (96fc33e) by @janbuchar
Relative URLS supports & If not a URL, pass #417 (#431, closes #417) (ccd8145) by @black7375
Typo in ProlongRequestLockResponse (#458) (30ccc3a) by @janbuchar
Add missing all to top-level init.py file (#463) (353a1ce) by @janbuchar

Refactor

[breaking] RequestQueue and service management rehaul (#429, closes #83, #174, #203, #423) (b155a9f) by @janbuchar
[breaking] Declare private and public interface (#456) (d6738df) by @vdusek

Contributors

janbuchar, asymness, and 3 other contributors

Assets 2

05 Aug 11:28

0.2.1

0.2.1 (2024-08-05)

🐛 Bug Fixes

Do not import curl impersonate in http clients init (#396) (3bb8009)

Assets 2

05 Aug 09:33

0.2.0

0.2.0 (2024-08-05)

🚀 Features

Add new curl impersonate HTTP client (#387) (9c06260)
(playwright) infinite_scroll helper (#393) (34f74bd)

Assets 2

30 Jul 19:21

0.1.2

0.1.2 (2024-07-30)

🚀 Features

Add URL validation (#343) (1514538)

🐛 Bug Fixes

Minor log fix (#341) (0688bf1)
Also use error_handler for context pipeline errors (#331) (7a66445)
Strip whitespace from href in enqueue_links (#346) (8a3174a)
Warn instead of crashing when an empty dataset is being exported (#342) (22b95d1)
Avoid Github rate limiting in project bootstrapping test (#364) (992f07f)
Pass crawler configuration to storages (#375) (b2d3a52)
Purge request queue on repeated crawler runs (#377) (7ad3d69)

Assets 2

19 Jul 12:15

vdusek

0.1.1

Features

Support for proxy configuration in PlaywrightCrawler.
Blocking detection in PlaywrightCrawler.
Expose crawler.log to public.

Bug fixes

Fix Pylance reportPrivateImportUsage errors by defining __all__ in modules __init__.py.
Set HTTPX logging level to WARNING by default.
Fix CLI behavior with existing project folders

Assets 4

09 Jul 06:49

vdusek

0.1.0

Crawlee is a web scraping and browser automation library.
Launching Crawlee for Python blog post

Features

Why is Crawlee the preferred choice for web scraping and crawling?

Why use Crawlee instead of just a random HTTP library with an HTML parser?

Unified interface for HTTP & headless browser crawling.
Automatic parallel crawling based on available system resources.
Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).
Automatic retries on errors or when you’re getting blocked.
Integrated proxy rotation and session management.
Configurable request routing - direct URLs to the appropriate handlers.
Persistent queue for URLs to crawl.
Pluggable storage of both tabular data and files.
Robust error handling.

Why to use Crawlee rather than Scrapy?

Crawlee has out-of-the-box support for headless browser crawling (Playwright).
Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code.
Complete type hint coverage.
Based on standard Asyncio.

Assets 4

27 Jun 15:00

vdusek

0.0.7

Fixes

selector handling for RETRY_CSS_SELECTORS in _handle_blocked_request in BeautifulSoupCrawler
selector handling in enqueue_links in BeautifulSoupCrawler
improve AutoscaledPool state management

Assets 4