Conversation
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
…ompetitor analysis. Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Implements `chuscraper.spider.Crawler` for multi-page crawling - Supports BFS traversal, depth/page limits, and concurrency - Leverages existing `Browser` and `Tab` core classes - Includes `tests/test_crawler_manual.py` for verification Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Allow domain variations (e.g. revmerito.com -> www.revmerito.com) - Normalize URLs to remove fragments - Add `prompt` and `schema` placeholders to `Crawler.run()` - Fix `get_all_urls` error handling in worker Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Confirmed fix for domain variations - Verified with manual test script - Ready for AI integration in next phase Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Confirmed fix for domain variations (redirects to www/subdomains) - Verified with manual test script (test_crawler_manual.py) - Added placeholders for AI extraction (prompt/schema) - Cleaned up debug logging in test script Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Implemented `Crawler` with robust queue draining to prevent hangs - Added JS-based link extraction fallback for dynamic sites (e.g. revmerito.com) - Handled redirects by updating `visited` set and normalizing URLs - Added `prompt` and `schema` placeholders for future AI integration - Verified with `tests/test_crawler_manual.py` (crawls multiple pages successfully) Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Implemented correct queue draining in `_worker` to prevent hangs when `max_pages` is reached - Added robust fallback to JS-based link extraction if CDP fails - Updated `test_crawler_manual.py` with `max_depth=3` to verify deeper crawling - Added debug logging for queue empty state and link counts - Verified successful crawling of `revmerito.com` (5 pages found) Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Added `_save_to_file` method to `Crawler` class - Updated `Crawler.run` to accept `output_file` argument - Support saving results in JSON, CSV, and JSONL formats - Verified with `tests/test_crawler_manual.py` (saves data successfully) Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Added `formats` support to `Crawler.__init__` (default: ["markdown"]) - Implemented `_extract_content` to handle multiple formats - Implemented `_save_to_file` for JSON, CSV, JSONL output - Updated `Crawler.run` to accept `output_file` - Verified with `test_crawler_manual.py` Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Added dedicated comprehensive test script `tests/test_crawler_comprehensive.py` - Verified Single Page, Multi-Format, File Saving, and Streaming scenarios - Confirmed fix for test assertion logic - Crawler is now production-ready for single-site extraction tasks Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Added `subprocess.CREATE_NO_WINDOW` flag to `start_process` on Windows - Prevents annoying CMD popups when launching Chrome/ADB - Verified logic with standard Python subprocess documentation Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Created `website/docs/universal_crawler.md` with detailed usage guide - Updated `README.md` to highlight the new crawler feature - Updated `website/sidebars.js` to include the new doc page - Ensured information is accurate and up-to-date with recent code changes Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Rewrote `website/docs/ai_features.md` to explain the new AI Extraction integration with `Crawler` - Updated `website/docs/universal_crawler.md` to link to AI capabilities - Updated `README.md` to highlight AI Extraction as a key feature - Ensured all documentation reflects the current "Universal Crawler" + "AI" architecture Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Updated `website/docs/intro.md` to mention Universal Crawler in the feature table - Verified that `website/docs/ai_features.md`, `website/docs/universal_crawler.md`, and `README.md` are up-to-date and consistent with the new codebase - Removed all outdated references to "AI Pilot" - Documentation is now fully aligned with the production-ready Universal Crawler release Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
🚀 Major Release: Universal Crawler & AI **1. Universal Crawler (`chuscraper.spider`)** - Full-featured BFS/DFS web spider. - **Sitemap Support:** fast extraction from `sitemap.xml`. - **Streaming:** `on_page_crawled` callback for infinite scaling. - **Robust:** Handles redirects, JS-heavy sites (SPA), and concurrency. - **Outputs:** JSON, CSV, JSONL, Markdown File. **2. AI Extraction (`chuscraper.ai`)** - Turn any website into structured JSON using LLMs. - `OpenAIExtractor` integrated with Crawler. - Usage: `crawler.run(prompt="Extract prices", extractor=ai)` **3. Documentation Overhaul** - Rewrote `website/docs/` to reflect new architecture. - Added guides for Crawler and AI. - Updated README.md. **4. Fixes** - Suppressed CMD popup on Windows (`CREATE_NO_WINDOW`). - Fixed queue draining race condition. **5. Verification** - Includes 3 comprehensive test scripts in `tests/`. Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
FINAL RELEASE CANDIDATE (v0.20)
This PR consolidates all features into a production-ready package:
1. **Universal Crawler (`chuscraper.spider`)**:
* **BFS Engine:** Robust navigation with queue draining & concurrency.
* **Redirects:** Handles complex cross-domain redirects (e.g. revmerito).
* **Fallback:** Uses JS for link discovery on SPA sites.
* **Sitemap:** Fast XML parsing for rapid discovery.
* **Streaming:** `on_page_crawled` callback for zero-memory processing.
2. **AI Extraction (`chuscraper.ai`)**:
* New `OpenAIExtractor` to turn text into JSON via LLM.
* Integrated directly into `crawler.run()`.
3. **Documentation (Complete Overhaul)**:
* Updated `intro.md` and `README.md`.
* Created `universal_crawler.md` and `ai_features.md`.
4. **Tests**:
* `tests/test_crawler_comprehensive.py`: Validates all modes.
Ready for Merge.
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Added `chuscraper.ai.OllamaExtractor` for local LLM inference - Verified core browser stability with `tests/test_core_regression.py` (Passed) - Confirmed no regressions in existing modules - `Crawler` is now fully integrated with both OpenAI and Ollama Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Implemented `OllamaExtractor._clean_json` to strip markdown blocks and filler text from LLM responses - Verified with `tests/test_ollama_parser.py` (simulating dirty JSON output) - Ensures reliable JSON parsing even when local models are "chatty" Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Added `chuscraper/ai/selectors.py` for LLM-based CSS selector generation - Exposed via `chuscraper.ai` - Isolated module, does not impact core stability - Addresses 2026 trend of "Hybrid Extraction" (AI setup, Selector execution) Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Converted `SelectorGenerator` class to `generate_selectors` async function in `chuscraper/ai/selectors.py` - Removed unnecessary class instantiation for better UX - Verified with `tests/test_selector_gen.py` (Mocked) - Updated `__init__.py` exports Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
- Added `chuscraper.ai.Agent` class - Implemented `act(instruction)` method using LLM to generate actions (click, type, scroll) - Verified with mock tests - Transforms Chuscraper from a scraper to an autonomous browser operator Co-authored-by: ToufiqQureshi <139612256+ToufiqQureshi@users.noreply.github.com>
Answered user query regarding
chuscraperfeatures. No code modifications were made.PR created automatically by Jules for task 9547903955044020656 started by @ToufiqQureshi