Releases: apify/crawlee-python
Releases · apify/crawlee-python
0.3.3
0.3.3 (2024-09-05)
🐛 Bug Fixes
- Deduplicate requests by unique key before submitting them to the queue (#499) (6a3e0e7) by @janbuchar
0.3.2
0.3.2 (2024-09-04)
🐛 Bug Fixes
- Double incrementation of
item_count(#443, closes #442) (cd9adf1) by @cadlagtrader - Field alias in
BatchRequestsOperationResponse(#485) (126a862) by @janbuchar - JSON handling with Parsel (#490, closes #488) (ebf5755) by @janbuchar
0.3.1
0.3.0
0.3.0 (2024-08-27)
- See upgrading guide.
🚀 Features
- Implement ParselCrawler that adds support for Parsel (#348, closes #335) (a3832e5) by @asymness
- Add support for filling a web form (#453, closes #305) (5a125b4) by @vdusek
🐛 Bug Fixes
- Remove indentation from statistics logging and print the data in tables (#322, closes #306) (359b515) by @TymeeK
- Remove redundant log, fix format (#408) (8d27e39) by @janbuchar
- Dequeue items from RequestQueue in the correct order (#411) (96fc33e) by @janbuchar
- Relative URLS supports & If not a URL, pass #417 (#431, closes #417) (ccd8145) by @black7375
- Typo in ProlongRequestLockResponse (#458) (30ccc3a) by @janbuchar
- Add missing all to top-level init.py file (#463) (353a1ce) by @janbuchar
Refactor
0.2.1
0.2.0
0.1.2
0.1.2 (2024-07-30)
🚀 Features
🐛 Bug Fixes
- Minor log fix (#341) (0688bf1)
- Also use error_handler for context pipeline errors (#331) (7a66445)
- Strip whitespace from href in enqueue_links (#346) (8a3174a)
- Warn instead of crashing when an empty dataset is being exported (#342) (22b95d1)
- Avoid Github rate limiting in project bootstrapping test (#364) (992f07f)
- Pass crawler configuration to storages (#375) (b2d3a52)
- Purge request queue on repeated crawler runs (#377) (7ad3d69)
0.1.1
Features
- Support for proxy configuration in
PlaywrightCrawler. - Blocking detection in
PlaywrightCrawler. - Expose
crawler.logto public.
Bug fixes
- Fix Pylance
reportPrivateImportUsageerrors by defining__all__in modules__init__.py. - Set
HTTPXlogging level toWARNINGby default. - Fix CLI behavior with existing project folders
0.1.0
- Crawlee is a web scraping and browser automation library.
- Launching Crawlee for Python blog post
Features
Why is Crawlee the preferred choice for web scraping and crawling?
Why use Crawlee instead of just a random HTTP library with an HTML parser?
- Unified interface for HTTP & headless browser crawling.
- Automatic parallel crawling based on available system resources.
- Written in Python with type hints - enhances DX (IDE autocompletion) and reduces bugs (static type checking).
- Automatic retries on errors or when you’re getting blocked.
- Integrated proxy rotation and session management.
- Configurable request routing - direct URLs to the appropriate handlers.
- Persistent queue for URLs to crawl.
- Pluggable storage of both tabular data and files.
- Robust error handling.
Why to use Crawlee rather than Scrapy?
- Crawlee has out-of-the-box support for headless browser crawling (Playwright).
- Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code.
- Complete type hint coverage.
- Based on standard Asyncio.