v1.1.0: Web Crawling, Content Parsing & Browser Automation

Idanvilenski released this 01 Sep 14:31

· 90 commits to main since this release

aefa14a

New Features

🕷️ Web Crawling

crawl() function for discovering and scraping multiple pages from websites
Advanced filtering with regex patterns for URL inclusion/exclusion
Configurable crawl depth and sitemap handling
Custom output schema support

🔍 Content Parsing

parse_content() function for extracting useful data from API responses
Support for text extraction, link discovery, and image URL collection
Handles both JSON responses and raw HTML content
Structured data extraction from various content formats

🌐 Browser Automation

connect_browser() function for Playwright/Selenium integration
WebSocket endpoint generation for scraping browser connections
Support for multiple browser automation tools (Playwright, Puppeteer, Selenium)
Seamless authentication with Bright Data's browser service

Improvements

📡 Better Async Handling

Enhanced download_snapshot() with improved 202 status code handling
Friendly status messages instead of exceptions for pending snapshots
Better user experience for asynchronous data processing

🔧 Robust Error Handling

Fixed zone creation error handling with proper exception propagation
Added retry logic for network failures and temporary errors
Improved zone management reliability

🐍 Python Support Update

Updated to support Python 3.8+ (removed Python 3.7)
Updated CI/CD pipeline for modern Python versions
Added BeautifulSoup4 as core dependency

Dependencies

Added: beautifulsoup4>=4.9.0 for content parsing
Updated: Python compatibility to >=3.8

Examples

New example files demonstrate the enhanced functionality:

examples/crawl_example.py - Web crawling usage
examples/browser_connection_example.py - Browser automation setup
examples/parse_content_example.py - Content parsing workflows

Assets 2