Skip to content

v1.1.0: Web Crawling, Content Parsing & Browser Automation

Choose a tag to compare

@Idanvilenski Idanvilenski released this 01 Sep 14:31
· 90 commits to main since this release

New Features

🕷️ Web Crawling

  • crawl() function for discovering and scraping multiple pages from websites
  • Advanced filtering with regex patterns for URL inclusion/exclusion
  • Configurable crawl depth and sitemap handling
  • Custom output schema support

🔍 Content Parsing

  • parse_content() function for extracting useful data from API responses
  • Support for text extraction, link discovery, and image URL collection
  • Handles both JSON responses and raw HTML content
  • Structured data extraction from various content formats

🌐 Browser Automation

  • connect_browser() function for Playwright/Selenium integration
  • WebSocket endpoint generation for scraping browser connections
  • Support for multiple browser automation tools (Playwright, Puppeteer, Selenium)
  • Seamless authentication with Bright Data's browser service

Improvements

📡 Better Async Handling

  • Enhanced download_snapshot() with improved 202 status code handling
  • Friendly status messages instead of exceptions for pending snapshots
  • Better user experience for asynchronous data processing

🔧 Robust Error Handling

  • Fixed zone creation error handling with proper exception propagation
  • Added retry logic for network failures and temporary errors
  • Improved zone management reliability

🐍 Python Support Update

  • Updated to support Python 3.8+ (removed Python 3.7)
  • Updated CI/CD pipeline for modern Python versions
  • Added BeautifulSoup4 as core dependency

Dependencies

  • Added: beautifulsoup4>=4.9.0 for content parsing
  • Updated: Python compatibility to >=3.8

Examples

New example files demonstrate the enhanced functionality:

  • examples/crawl_example.py - Web crawling usage
  • examples/browser_connection_example.py - Browser automation setup
  • examples/parse_content_example.py - Content parsing workflows