|
1 | | -🚀 **Looking for an even faster and simpler way to scrape at scale (only 5 lines of code)?** Check out our enhanced version at [**Chuscraper.com**](https://github.com/ToufiqQureshi/chuscraper)! 🚀 |
| 1 | +<p align="center"> |
| 2 | + <img src="https://i.ibb.co/HLyG7BBK/Chat-GPT-Image-Feb-16-2026-11-13-14-AM.png" alt="Chuscraper Logo" width="180" /> |
| 3 | +</p> |
2 | 4 |
|
3 | | ---- |
| 5 | +<h1 align="center">🕷️ Chuscraper</h1> |
| 6 | +<p align="center"> |
| 7 | + <strong>LLM + CDP powered undetectable web scraping & automation framework</strong><br/> |
| 8 | + You Only Scrape Once — data extraction made smarter, faster, and stealthier. |
| 9 | +</p> |
4 | 10 |
|
5 | | -# 🕷️ Chuscraper: You Only Scrape Once |
| 11 | +<p align="center"> |
| 12 | + <a href="https://pypi.org/project/chuscraper/"><img src="https://static.pepy.tech/personalized-badge/chuscraper?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads"/></a> |
| 13 | + <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge"/></a> |
| 14 | + <a href="https://github.com/ToufiqQureshi/chuscraper"><img src="https://img.shields.io/badge/GitHub-Trending-blue?style=for-the-badge&logo=github"/></a> |
| 15 | +</p> |
6 | 16 |
|
7 | | -[English](README.md) | [中文](docs/chinese.md) | [日本語](docs/japanese.md) |
8 | | -| [한국어](docs/korean.md) |
9 | | -| [Русский](docs/russian.md) | [Türkçe](docs/turkish.md) |
10 | | -| [Deutsch](docs/german.md) |
11 | | -| [Español](docs/spanish.md) |
12 | | -| [français](docs/french.md) |
13 | | -| [Português](docs/portuguese.md) |
| 17 | +--- |
14 | 18 |
|
15 | | -[](https://pepy.tech/projects/chuscraper) |
16 | | -[](https://github.com/ToufiqQureshi/chuscraper) |
17 | | -[](https://opensource.org/licenses/MIT) |
| 19 | +## 🚀 What is Chuscraper? |
| 20 | +Chuscraper is a Python web scraping & automation library that uses **CDP (Chrome DevTools Protocol)** and **LLMs** to extract structured data, interact with pages, and automate workflows — all while staying *stealthy and undetected*. |
18 | 21 |
|
19 | | -[](https://github.com/ToufiqQureshi/chuscraper) |
| 22 | +With AI-powered extraction, you tell it *what* to extract — it figures out *how*. |
20 | 23 |
|
21 | | -<p align="center"> |
22 | | -<a href="https://github.com/ToufiqQureshi/chuscraper" target="_blank"><img src="https://img.shields.io/badge/GitHub-Trending-blue?style=for-the-badge&logo=github" alt="Chuscraper | Trending" style="width: 250px; height: 55px;" width="250" height="55"/></a> |
23 | | -</p> |
| 24 | +--- |
24 | 25 |
|
25 | | -[Chuscraper](https://github.com/ToufiqQureshi/chuscraper) is a *web scraping* python library that uses LLM and direct CDP logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). |
| 26 | +## 🌟 Features |
26 | 27 |
|
27 | | -Just say which information you want to extract and the library will do it for you! |
| 28 | +### 🕵️♂️ Stealth & Anti-Detection |
| 29 | +- Hides `navigator.webdriver`, user agent rotation |
| 30 | +- Canvas/WebGL noise + hardware spoofing |
| 31 | +- Timezone & geolocation spoofing |
28 | 32 |
|
29 | | -<p align="center"> |
30 | | - <img src="docs/assets/official_logo.png" alt="Chuscraper Logo" width="400"> |
31 | | -</p> |
| 33 | +### 🤖 AI-Driven Data Extraction |
| 34 | +- **Semantic extraction** using LLMs |
| 35 | +- Converts HTML into structured JSON/Pydantic |
| 36 | + |
| 37 | +### 🧠 Autonomous Navigation |
| 38 | +- Intelligent pilot (`ai_pilot`) that clicks/types until goal achieved |
32 | 39 |
|
| 40 | +### ⚡ Async + Fast |
| 41 | +Built on async CDP, low overhead, no heavy browser bundles. |
33 | 42 |
|
34 | | -## 🚀 Integrations |
35 | | -Chuscraper offers seamless integration with popular frameworks and tools to enhance your scraping capabilities. Whether you're building with Python, using LLM frameworks, or working with AI agents, we've got you covered with our comprehensive integration options. |
| 43 | +### 🔄 Flexible Outputs |
| 44 | +Supports JSON, CSV, Markdown, Excel, Pydantic, and more. |
36 | 45 |
|
37 | | -**Integrations**: |
38 | | -- **Providers**: OpenAI, Gemini (Native), Anthropic, Ollama |
39 | | -- **LLM Frameworks**: Langchain, Llama Index, Crew.ai, Agno |
40 | | -- **Output Protocols**: Pydantic, JSON, CSV, Markdown, Excel |
41 | | -- **Stealth**: Built-in Canvas/WebGL noise, Hardware spoofing, UA rotation. |
| 46 | +### 🌐 Integrations |
| 47 | +- LLM Providers: OpenAI, Gemini, Anthropic, Ollama |
| 48 | +- Frameworks: LangChain, LlamaIndex, Agno, Crew.ai |
42 | 49 |
|
43 | | -## 🚀 Quick install |
| 50 | +--- |
44 | 51 |
|
45 | | -The reference page for Chuscraper is available on the official page of PyPI: [pypi](https://pypi.org/project/chuscraper/). |
| 52 | +## 📦 Installation |
46 | 53 |
|
47 | 54 | ```bash |
48 | 55 | pip install chuscraper |
49 | 56 |
|
50 | | -# FOR AI CAPABILITIES |
| 57 | +# For AI Capabilities |
51 | 58 | pip install chuscraper[ai] |
52 | 59 | ``` |
53 | 60 |
|
54 | | -**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱 |
| 61 | +> [!TIP] |
| 62 | +> Use within a virtual environment to avoid conflicts. |
55 | 63 |
|
| 64 | +--- |
56 | 65 |
|
57 | | -## 💻 Usage |
58 | | -There are multiple standard scraping methods that can be used to extract information from a website (or local file). |
59 | | - |
60 | | -The most common one is the `ai_pilot`, which autonomously navigates and extracts information from a page given a user goal. |
61 | | - |
| 66 | +## 💻 Quick Start (Async) |
62 | 67 |
|
63 | 68 | ```python |
64 | 69 | import asyncio |
65 | 70 | from chuscraper import start |
66 | 71 |
|
67 | 72 | async def main(): |
68 | | - # Start the stealth browser |
69 | 73 | browser = await start(headless=False) |
70 | 74 | page = await browser.get("https://www.makemytrip.com/") |
71 | 75 |
|
72 | | - # Define the goal |
73 | | - print("AI is starting to search...") |
74 | | - await page.ai_pilot("Search for hotels in Goa for next weekend") |
75 | | - |
| 76 | + # Tell the AI what to extract |
| 77 | + print("AI is navigating...") |
| 78 | + await page.ai_pilot("Search hotels in Goa for next weekend") |
| 79 | + |
76 | 80 | # Extract structured data |
77 | | - result = await page.ai_extract("Extract first 3 hotels with prices") |
78 | | - |
| 81 | + result = await page.ai_extract("Get the first 3 hotels with prices") |
79 | 82 | import json |
80 | | - print(json.dumps(result, indent=4)) |
| 83 | + print(json.dumps(result, indent=2)) |
81 | 84 |
|
82 | 85 | await browser.stop() |
83 | 86 |
|
84 | 87 | if __name__ == "__main__": |
85 | 88 | asyncio.run(main()) |
86 | 89 | ``` |
87 | 90 |
|
88 | | -> [!NOTE] |
89 | | -> For OpenAI and other models you just need to pass the provider! |
90 | | -> ```python |
91 | | -> from chuscraper.ai.providers import OpenAIProvider |
92 | | -> provider = OpenAIProvider(api_key="YOUR_OPENAI_API_KEY") |
93 | | -> await page.ai_extract("Extract data", provider=provider) |
94 | | -> ``` |
95 | | -
|
| 91 | +--- |
96 | 92 |
|
97 | | -The output will be a structured dictionary like the following: |
| 93 | +## 🤖 AI Usage with Providers |
| 94 | +Example using **OpenAIProvider**: |
98 | 95 |
|
99 | 96 | ```python |
100 | | -{ |
101 | | - "hotels": [ |
102 | | - { |
103 | | - "name": "Taj Exotica Resort & Spa", |
104 | | - "price": "₹ 25,000", |
105 | | - "rating": "4.8" |
106 | | - }, |
107 | | - { |
108 | | - "name": "Cygnett Inn", |
109 | | - "price": "₹ 4,500", |
110 | | - "rating": "4.2" |
111 | | - } |
112 | | - ] |
113 | | -} |
| 97 | +from chuscraper.ai.providers import OpenAIProvider |
| 98 | + |
| 99 | +provider = OpenAIProvider(api_key="YOUR_OPENAI_API_KEY") |
| 100 | +await page.ai_extract("Extract prices and listings", provider=provider) |
114 | 101 | ``` |
115 | 102 |
|
| 103 | +--- |
| 104 | + |
116 | 105 | ## 📖 Documentation |
117 | | -The documentation for Chuscraper can be found in the [docs/](docs/) folder. |
118 | | - |
119 | | -## 🤝 Contributing |
120 | | - |
121 | | -Feel free to contribute and join our community to discuss improvements and give us suggestions! |
122 | | - |
123 | | -Please see the [contributing guidelines](CONTRIBUTING.md). |
124 | | - |
125 | | -## 🔥 AI Methods |
126 | | - |
127 | | -| Method Name | Description | |
128 | | -|-------------------------|------------------------------------------------------------------------------------------------------------------| |
129 | | -| ai_pilot | Single-goal autonomous navigator that handles interaction (clicks, types) to reach a target. | |
130 | | -| ai_extract | Semantic data extractor that converts HTML content into structured JSON/Pydantic models. | |
131 | | -| ai_visual_extract | Multi-modal Vision scraper that extracts data directly from the rendered page screenshot. | |
132 | | -| ai_learn_selector | Self-healing tool that generates robust CSS/Xpath selectors for long-term automation. | |
133 | | -| ai_ask | Context-aware Q&A that answers questions based on the current page's content. | |
134 | | - |
135 | | -## 🎓 Citations |
136 | | -If you have used our library for research purposes please quote us with the following reference: |
137 | | -```text |
138 | | - @misc{chuscraper, |
139 | | - author = {Toufiq Qureshi}, |
140 | | - title = {Chuscraper}, |
141 | | - year = {2026}, |
142 | | - url = {https://github.com/ToufiqQureshi/chuscraper}, |
143 | | - note = {An undetectable & agentic python library for scraping leveraging CDP and LLMs} |
144 | | - } |
145 | | -``` |
| 106 | +Full docs available in the `docs/` folder: |
| 107 | + |
| 108 | +- [English](README.md) |
| 109 | +- [Chinese](docs/chinese.md) |
| 110 | +- [Japanese](docs/japanese.md) |
| 111 | +- [Korean](docs/korean.md) |
| 112 | +- [Russian](docs/russian.md) |
| 113 | +- [Turkish](docs/turkish.md) |
| 114 | +- [German](docs/german.md) |
| 115 | +- [Spanish](docs/spanish.md) |
| 116 | +- [French](docs/french.md) |
| 117 | +- [Portuguese](docs/portuguese.md) |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## 🛠️ Contributing |
| 122 | +Want to contribute? Open an issue or send a pull request — all levels welcome! Please follow the `CONTRIBUTING.md` guidelines. |
| 123 | + |
| 124 | +--- |
146 | 125 |
|
147 | 126 | ## 📜 License |
148 | | -Chuscraper is licensed under the MIT License. See the [LICENSE](LICENSE) file for more information. |
| 127 | +Chuscraper is licensed under the MIT License. |
149 | 128 |
|
150 | 129 | Made with ❤️ by [Toufiq Qureshi] |
0 commit comments