Skip to content

ToufiqQureshi/chuscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

214 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Chuscraper Logo

πŸ•·οΈ Chuscraper

Stealth-focused Web & Mobile automation framework powered by CDP and ADB
You Only Scrape Once β€” data extraction made smarter, faster, and more resilient.

Ask DeepWiki


πŸš€ What is Chuscraper?

Chuscraper is a Python web & mobile scraping library that uses CDP (Chrome DevTools Protocol) for web and ADB (Android Debug Bridge) for mobile apps. It extracts structured data, interacts with pages/screens, and automates workflows β€” with a heavy focus on Anti-Detection and Stealth.

It converts standard Chromium instances into undetectable agents that can bypass bot verification systems like Cloudflare, Akamai, and Datadome, while also allowing control of native Android apps for data extraction.


🌟 Key Features

πŸ•·οΈ Universal Crawler (New!)

Turn entire websites into LLM-ready data with a single command.

  • Sitemap & BFS: Supports both sitemap-based (fast) and BFS (deep) crawling strategies.
  • Streaming: Stream extracted data directly to your database without memory limits.
  • Multi-Format: Extract Markdown, HTML, and Text simultaneously.
  • Robust: Handles redirects, SPA link discovery, and concurrency automatically.
  • AI Extraction: Integrate OpenAI/LLMs to extract structured JSON data from any page using natural language prompts.

πŸ“± Native Mobile App Scraping

Chuscraper now supports scraping native Android apps using ADB:

  • UI Automation: Tap, swipe, and type on any connected Android device (Real or Emulator).
  • XML Dumping: Extract the full UI hierarchy as XML to find elements by text, resource-id, or content-desc.
  • Background Execution: Run scripts without touching the device.
  • Zero-Setup: Just enable USB Debugging and connect. No Appium server required.

πŸ•΅οΈβ€β™‚οΈ Dynamic Stealth & Fingerprinting (New!)

Chuscraper now includes an advanced Auto-Update and Fingerprint Rotation engine:

  • Auto-Update Chrome Version: Automatically detects your installed Chrome version and updates the User-Agent to match. No manual updates required!
  • Fingerprint Rotation: Randomizes hardware fingerprints (RAM, CPU, Screen Resolution) per session while strictly adhering to your host OS (Windows, macOS, Linux) to prevent OS mismatch detection.
  • Client Hints Sync: Automatically patches navigator.userAgentData to match the User-Agent string.
  • Advanced Stealth Patches: 6 core JS bypasses for WebDriver, Chrome Runtime, Canvas/WebGL noise, and iFrame leaks.
  • Modern Timezones: Automatically syncs browser timezone with IP location using modern IANA names.

⚑ Async + Fast

Built on async CDP, low overhead, no heavy browser bundles.

πŸ”„ Advanced Selector & Extraction Engine (New!)

Chuscraper now includes a high-performance parsing engine:

  • Adaptive Selectors: Save and automatically relocate elements even if the DOM structure changes.
  • AI-Ready Extraction: One-click conversion of pages or elements to clean Markdown or normalized Text.
  • CSS & XPath Support: Unified API for high-speed selection.

πŸ› οΈ Hidden Gems (Undocumented Functions)

Chuscraper has several advanced functions that are often missed:

  • select_text(selector): Quickly get the inner text of an element in one line.
  • save_snapshot(filename): Save a full MHTML snapshot of the current page.
  • to_markdown() / to_text(): Convert any live Element directly to Markdown or plain text.
  • wait_for_ready_state(state): Wait specifically for loading, interactive, or complete document states.
  • mouse_drag(destination): Perform native drag-and-drop operations with human-like movement.
  • print_to_pdf(filename): Export the current page as a professional PDF.
  • get_all_urls(): Extract every link, image, and asset URL from the page in one call.
  • scroll_down(amount=25): Smoothly scroll down by a percentage of the page height.
  • human_click(selector) / human_type(selector, text): High-level aliases for ultra-realistic human behavior.
  • submit(selector): One-click form submission for forms or individual buttons.
  • activate() / bring_to_front(): Bring a background tab to the front for interaction.

πŸ”„ Flexible Outputs

Supports JSON, CSV, Markdown, Excel, Pydantic, and more.


πŸ“¦ Installation

pip install chuscraper

Tip

Use within a virtual environment to avoid conflicts.


Example: Advanced Mode (Elite Stealth + Human Interaction)

import asyncio
import chuscraper as zd

async def main():
    # 1. Launch with all-in-one start() helper
    async with await zd.start(
        headless=False,
        stealth=True,
        lang="en-US",
        retry_enabled=True
    ) as browser:
        page = browser.main_tab
        await page.goto("https://github.com/login")

        # 2. Use Ultra-Realistic Human Interactions
        # Automatically retries if element is loading/stale
        await page.human_type("#login_field", "jules_bot")
        await page.human_type("#password", "SecurePass123!")

        # 3. One-Click Form Submission
        await page.submit("form")

        # 4. Extract with Adaptive Selectors
        # 'adaptive=True' saves element metadata for resilient relocation
        results = await page.select_all(".repository-item", adaptive=True)
        
        for item in results:
            # 5. Get clean Markdown for LLMs instantly
            print(await item.to_markdown())

if __name__ == "__main__":
    asyncio.run(main())

Note

chuscraper automatically handles Chrome process cleanup and Local Proxy lifecycle.


βš™οΈ Configuration Switches (Parameters)

Chuscraper gives you full control via zd.start(). Here are the powerful switches you can use:

πŸ› οΈ Core Switches

Switch Description Default
headless Run without a visible window (True/False) False
stealth Master Switch for advanced anti-detection (System Fingerprints + JS Bypasses) False
stealth_domain The domain used for cookie storage/loading in stealth mode ""
user_data_dir Path to save/load browser profile (keep logins/cookies) Temp
proxy Proxy URL (e.g. http://user:pass@host:port) None

πŸš€ Advanced Switches

Switch Description Default
browser_executable_path Custom path to Chrome/Brave binary (auto-detect if omitted) Auto
browser Browser selection: "auto", "chrome", "brave" "auto"
browser_args Extra Chromium args list []
sandbox Set False for Linux/Docker/root environments True
lang Browser locale/language (e.g., en-US, hi-IN) en-US
user_agent Manually override User-Agent (not recommended with stealth=True) Auto
disable_webrtc Prevent IP leaks via WebRTC True
disable_webgl Disable WebGL (can reduce detection surface in some setups) False
timezone Force timezone (IANA format, e.g. Asia/Kolkata) Auto/None
stealth_options Dict for fine-grained stealth patches Built-in defaults
retry_enabled Enable retry helpers for unstable workflows False
retry_timeout Retry timeout seconds 10.0
retry_count Retry count 3
browser_connection_timeout Wait between connection attempts 0.25
browser_connection_max_tries Browser connection retries 10

πŸ•΅οΈβ€β™‚οΈ Granular Stealth Options

When stealth=True, you can fine-tune specific patches by passing a stealth_options dict:

await zd.start(stealth=True, stealth_options={
    "patch_webdriver": True,  # Hide WebDriver
    "patch_webgl": True,      # Spoof Graphics Card
    "patch_canvas": True,     # Add Canvas Noise
    "patch_audio": False      # Disable Audio Fingerprinting noise
})

πŸ“± Mobile Scraping Example

Scrape data from any native Android app (e.g., Hotel/Flight apps):

import asyncio
from chuscraper.mobile import MobileDevice

async def main():
    # Connect to first available device
    device = await MobileDevice().connect()

    # Example: Searching for hotels
    city_input = await device.find_element(text="Enter destination")
    if city_input:
        await city_input.type("Goa")

    search_btn = await device.find_element(resource_id="com.hotel.app:id/search_btn")
    if search_btn:
        await search_btn.click()

    # Extract prices
    prices = await device.find_elements(resource_id="com.hotel.app:id/price_text")
    for price in prices:
        print(price.get_text())

if __name__ == "__main__":
    asyncio.run(main())

πŸ›‘οΈ Stealth & Anti-Detection Proof

We don't just claim to be stealthy; we prove it. Below are the results from top anti-bot detection suites, all passed with 100% "Human" status.

πŸ‘‰ View Full Visual Proofs & Screenshots Here

Detection Suite Result Status
SannySoft No WebDriver detected βœ… Pass
BrowserScan 100% Trust Score βœ… Pass
PixelScan Consistent Fingerprint βœ… Pass
IPHey Software Clean (Green) βœ… Pass
CreepJS 0% Stealth / 0% Headless βœ… Pass
Fingerprint.com No Bot Detected βœ… Pass

🌍 Real-World Protection Bypass

We tested chuscraper against live websites protected by major security providers:

Provider Target Result
Cloudflare Turnstile Demo βœ… Solved Automatically
DataDome Antoine Vastel Research βœ… Accessed
Akamai Nike Product Page βœ… Bypassed

πŸ“– Documentation

Full technical guides are available in the docs/ folder:

Translations (Chinese, Japanese, etc.) coming soon.

πŸ’– Support & Sponsorship

chuscraper is an open-source project maintained by [Toufiq Qureshi]. If the library has helped you or your business, please consider supporting its development:

  • GitHub Sponsors: Sponsor me on GitHub
  • Corporate Sponsorship: If you are a Proxy Provider or Data Company, we offer featured placement in our documentation. Contact us for partnership opportunities.
  • Custom Scraping Solutions: Need a private, high-performance scraper? We offer professional consulting.

πŸ› οΈ Contributing

Want to contribute? Open an issue or send a pull request β€” all levels welcome! Please follow the CONTRIBUTING.md guidelines.


πŸ“œ License

Chuscraper is licensed under the AGPL-3.0 License. This ensures that any software using Chuscraper must also be open-source, protecting the community and your freedom.

Made with ❀️ by [Toufiq Qureshi]

About

Stealth-native web & Android scraping framework powered by CDP and ADB with adaptive fingerprint rotation and bot-protection bypass.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors