Remove mock data fallback from Amazon keyboard scraper

lurenss · claude · lurenss · commit 1dd953aec807 · 2025-11-04T14:55:26.000-08:00
- Remove _scrape_with_mock() method and all mock data generation - Update initialization to require valid ScrapeGraph API client - Simplify scrape_page() to return empty list on API failure instead of falling back to mock data - Remove unused hashlib import - Remove use_api flag from scrape_all_pages() output - Add CLAUDE.md documentation for future Claude Code instances The scraper now requires a working API connection and will fail gracefully if the API has issues, continuing to attempt other pages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,190 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is a demonstration project integrating **ScrapeGraphAI SDK** with **Elasticsearch** for AI-powered marketplace product scraping and comparison. The project demonstrates how to scrape product data from marketplaces (Amazon, eBay, etc.), store it in Elasticsearch, and perform advanced searches and comparisons.
+
+## Common Commands
+
+### Environment Setup
+```bash
+# Create and activate virtual environment
+python -m venv venv
+source venv/bin/activate  # On Mac/Linux
+# venv\Scripts\activate   # On Windows
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Configure environment
+cp .env.example .env
+# Edit .env to add SCRAPEGRAPHAI_API_KEY or OPENAI_API_KEY
+```
+
+### Docker/Elasticsearch
+```bash
+# Start Elasticsearch and Kibana
+# Use 'docker compose' (Docker CLI plugin) or 'docker-compose' (standalone)
+docker compose up -d
+# OR: docker-compose up -d
+
+# Check Elasticsearch health
+curl http://localhost:9200/_cluster/health
+
+# Stop services
+docker compose down
+# OR: docker-compose down
+
+# View logs
+docker compose logs elasticsearch
+docker compose logs kibana
+# OR: docker-compose logs elasticsearch
+```
+
+### Running Examples
+```bash
+# Basic usage demonstration
+python examples/basic_usage.py
+
+# Product comparison across marketplaces
+python examples/product_comparison.py
+
+# Advanced search capabilities
+python examples/advanced_search.py
+
+# Interactive quickstart demo
+python quickstart.py
+```
+
+### Testing
+```bash
+# Run all tests (custom test runner)
+python run_tests.py
+
+# Run individual test modules
+python tests/test_config.py
+python tests/test_models.py
+python tests/test_scraper.py
+```
+
+## Architecture
+
+### Core Components (4-Layer Pattern)
+
+The architecture follows a clean separation of concerns with 4 main components in `src/scrapegraph_demo/`:
+
+1. **Config** (`config.py`): Environment-based configuration management using `@dataclass` and `python-dotenv`. Loads settings from `.env` and provides connection URLs.
+
+2. **Models** (`models.py`): Pydantic v2 models for type-safe data handling:
+   - `Product`: Represents a marketplace product with validation
+   - `ProductComparison`: Provides comparison methods (cheapest, best-rated, grouping)
+
+3. **ElasticsearchClient** (`elasticsearch_client.py`): Manages all Elasticsearch operations including index creation, product indexing (single/bulk), search with filters, aggregations, and statistics.
+
+4. **MarketplaceScraper** (`scraper.py`): Handles web scraping using ScrapeGraphAI's `SmartScraperGraph`. Includes mock data fallback for testing without API keys.
+
+### Key Architectural Patterns
+
+#### Mock Data Fallback
+The scraper implements a graceful fallback to mock data when ScrapeGraphAI is unavailable or API keys are missing. This enables:
+- Testing without external dependencies
+- Development without API keys
+- Consistent demonstration data
+
+Look for `_mock_scrape_product()` and `_mock_scrape_search_results()` methods in `scraper.py`.
+
+#### Elasticsearch Index Design
+The `marketplace_products` index uses a carefully designed mapping:
+- **Keyword fields** for exact matching: `product_id`, `marketplace`, `category`, `availability`, `currency`
+- **Text fields with keyword sub-fields** for flexible search: `name`, `description`, `brand`
+- **Proper data types**: `float` for price and rating, `integer` for review_count, `date` for scraped_at
+- **Object type** for nested specifications
+
+This design optimizes search performance by using term queries for filters and multi_match for text search.
+
+#### Integration Pattern
+The typical workflow combines scraping and indexing:
+```python
+config = Config.from_env()
+scraper = MarketplaceScraper(config)
+es_client = ElasticsearchClient(config)
+
+# Scrape → Index → Search
+products = scraper.scrape_search_results(query, marketplace, max_results)
+es_client.index_products(products)  # Bulk indexing
+results = es_client.search_products(query, filters...)
+```
+
+### Data Flow
+
+1. **Configuration Loading**: `Config.from_env()` loads environment variables
+2. **Scraping**: `MarketplaceScraper` uses ScrapeGraphAI to extract product data (or falls back to mock data)
+3. **Validation**: Pydantic models validate and structure the data
+4. **Indexing**: `ElasticsearchClient` stores products in Elasticsearch
+5. **Search/Analysis**: Full-text search, filtering, aggregations, and comparisons
+
+## Development Patterns
+
+### Pydantic Models
+All data models use Pydantic v2 for:
+- Type validation and coercion
+- JSON serialization via `model_dump(mode='json')`
+- IDE autocomplete support
+- Elasticsearch document conversion via `to_elasticsearch_doc()`
+
+### Error Handling
+The codebase implements graceful degradation:
+- Elasticsearch connection failures are caught and logged
+- Scraping errors trigger mock data fallback
+- Bulk indexing returns success/failure counts rather than raising exceptions
+
+### Environment Variables
+Required environment variables (set in `.env`):
+- `SCRAPEGRAPHAI_API_KEY` or `OPENAI_API_KEY` (one required for AI scraping)
+- `ELASTICSEARCH_HOST`, `ELASTICSEARCH_PORT`, `ELASTICSEARCH_SCHEME` (optional, have defaults)
+- `ELASTICSEARCH_USERNAME`, `ELASTICSEARCH_PASSWORD` (optional, for auth)
+
+## Important Implementation Details
+
+### Price Extraction
+The scraper includes `_extract_price()` utility that handles various price formats:
+- Removes currency symbols ($, €, £, etc.)
+- Handles comma/period number formats
+- Extracts first numeric value from strings
+
+### Product ID Extraction
+`_extract_product_id()` extracts product IDs from marketplace URLs:
+- Amazon: Looks for `/dp/` or `/gp/product/` patterns
+- eBay: Extracts from `/itm/` pattern
+- Falls back to URL hash for other marketplaces
+
+### Bulk Operations
+For indexing multiple products, always use `index_products()` instead of looping over `index_product()`:
+- More efficient (uses Elasticsearch bulk API)
+- Returns tuple of (success_count, failed_count)
+- Handles individual failures without stopping the entire operation
+
+### Test Infrastructure
+Tests use a custom test runner (`run_tests.py`) rather than pytest. Tests are designed to run without:
+- Elasticsearch running
+- Web requests
+- API keys
+
+All tests use mock data and verify core functionality in isolation.
+
+## Services Integration
+
+- **Elasticsearch**: `localhost:9200` (via Docker)
+- **Kibana**: `localhost:5601` (for data visualization)
+- **ScrapeGraphAI**: External API (requires API key)
+
+## Package Structure
+
+The package is installable via `setup.py` and exports main components in `__init__.py`:
+```python
+from src.scrapegraph_demo import Config, ElasticsearchClient, MarketplaceScraper, Product
+```
+
+Version is defined in `__init__.py` as `__version__ = "0.1.0"`
diff --git a/amazon_keyboard_scraper.py b/amazon_keyboard_scraper.py
@@ -24,7 +24,6 @@
 import os
 import time
 import re
-import hashlib
 import traceback
 from typing import List, Dict, Any, Optional
 from datetime import datetime
@@ -52,25 +51,24 @@ def __init__(self):
         # Users can set SGAI_API_KEY environment variable to override the default
         api_key = os.environ.get('SGAI_API_KEY', self.DEFAULT_API_KEY)
         os.environ['SGAI_API_KEY'] = api_key
-        
+
         # Load configuration
         self.config = Config.from_env()
-        
+
         # Initialize Elasticsearch client
         self.es_client = ElasticsearchClient(self.config)
-        
-        # Initialize ScrapeGraph client
+
+        # Initialize ScrapeGraph client (required)
         try:
             from scrapegraph_py import Client
             self.sg_client = Client(api_key=api_key)
-            self.use_api = True
             print("✓ ScrapeGraph API client initialized")
         except Exception as e:
-            print(f"⚠ Warning: Could not initialize ScrapeGraph API client: {e}")
-            print("  Will use mock data for demonstration")
-            self.sg_client = None
-            self.use_api = False
-        
+            raise RuntimeError(
+                f"Failed to initialize ScrapeGraph API client: {e}\n"
+                f"Please ensure you have a valid SGAI_API_KEY set in your environment."
+            )
+
         # Statistics
         self.total_scraped = 0
         self.total_stored = 0
@@ -80,38 +78,26 @@ def __init__(self):
     def scrape_page(self, page_num: int) -> List[Product]:
         """
         Scrape a single page of Amazon search results
-        
+
         Args:
             page_num: Page number to scrape (1-20)
-            
+
         Returns:
-            List of Product objects
+            List of Product objects (empty list if scraping fails)
         """
         page_url = f"{self.AMAZON_BASE_URL}&page={page_num}"
         print(f"\n📄 Scraping page {page_num}/{self.TOTAL_PAGES}: {page_url}")
-        
+
         products = []
-        
+
         try:
-            if self.use_api and self.sg_client:
-                # Use ScrapeGraph API to scrape the page
-                try:
-                    products = self._scrape_with_api(page_url, page_num)
-                except Exception as api_error:
-                    print(f"  ⚠ API error: {str(api_error)}")
-                    print(f"  Falling back to mock data for page {page_num}")
-                    products = self._scrape_with_mock(page_url, page_num)
-            else:
-                # Use mock data for demonstration
-                products = self._scrape_with_mock(page_url, page_num)
-            
+            products = self._scrape_with_api(page_url, page_num)
             print(f"✓ Found {len(products)} products on page {page_num}")
             self.total_scraped += len(products)
-            
         except Exception as e:
             print(f"✗ Error scraping page {page_num}: {str(e)}")
             self.failed_pages.append({"page": page_num, "error": str(e)})
-        
+
         return products
     
     def _scrape_with_api(self, page_url: str, page_num: int) -> List[Product]:
@@ -211,86 +197,7 @@ def _scrape_with_api(self, page_url: str, page_num: int) -> List[Product]:
                         })
         
         return products
-    
-    def _scrape_with_mock(self, page_url: str, page_num: int) -> List[Product]:
-        """Generate mock product data for demonstration"""
-        products = []
-        # Generate 10-15 products per page
-        num_products = 12 + (page_num % 4)
-        
-        keyboard_types = [
-            "Mechanical Gaming Keyboard RGB",
-            "Wireless Bluetooth Keyboard",
-            "Ergonomic Split Keyboard",
-            "Compact 60% Mechanical Keyboard",
-            "Full-Size Office Keyboard",
-            "Gaming Keyboard with Wrist Rest",
-            "Backlit Mechanical Keyboard",
-            "Ultra-Thin Wireless Keyboard",
-            "Gaming Keyboard and Mouse Combo",
-            "Mechanical Keyboard TKL",
-            "RGB Gaming Keyboard 104 Keys",
-            "Portable Foldable Keyboard",
-            "Mechanical Keyboard Hot Swappable",
-            "Wireless Gaming Keyboard",
-            "Professional Typing Keyboard"
-        ]
-        
-        brands = ["Logitech", "Razer", "Corsair", "HyperX", "SteelSeries", 
-                 "Keychron", "Ducky", "ASUS", "Redragon", "Cooler Master"]
-        
-        for i in range(num_products):
-            # Generate unique product ID
-            product_seed = f"{page_num}-{i}"
-            product_hash = hashlib.md5(product_seed.encode()).hexdigest()[:10].upper()
-            asin = f"B0{product_hash[:8]}"
-            
-            # Select keyboard type and brand
-            keyboard_name = keyboard_types[(page_num * i) % len(keyboard_types)]
-            brand = brands[(page_num + i) % len(brands)]
-            
-            # Generate realistic price (20-200 EUR)
-            base_price = 29.99 + (i * 8.5) + (page_num * 3)
-            price = round(base_price % 180 + 20, 2)
-            
-            # Generate rating (3.5 - 5.0)
-            rating = round(3.5 + ((page_num + i) % 16) * 0.1, 1)
-            if rating > 5.0:
-                rating = 5.0
-            
-            # Generate review count (10-5000)
-            review_count = 50 + (i * 150) + (page_num * 80)
-            if review_count > 5000:
-                review_count = review_count % 5000 + 100
-            
-            # Prime availability (70% chance)
-            has_prime = ((page_num + i) % 10) < 7
-            
-            product = Product(
-                product_id=asin,
-                name=f"{brand} {keyboard_name}",
-                price=price,
-                currency="EUR",
-                url=f"https://www.amazon.it/dp/{asin}",
-                marketplace=self.MARKETPLACE,
-                description=f"High-quality {keyboard_name.lower()} from {brand}",
-                brand=brand,
-                category="Keyboards",
-                rating=rating,
-                review_count=review_count,
-                availability="Prime" if has_prime else "Standard",
-                specifications={
-                    "prime_eligible": has_prime,
-                    "page_number": page_num,
-                    "keyboard_type": keyboard_name
-                },
-                scraped_at=datetime.utcnow()
-            )
-            
-            products.append(product)
-        
-        return products
-    
+
     def store_products(self, products: List[Product]) -> int:
         """
         Store products in Elasticsearch
@@ -326,7 +233,6 @@ def scrape_all_pages(self):
         print(f"Target: {self.AMAZON_BASE_URL}")
         print(f"Pages to scrape: {self.TOTAL_PAGES}")
         print(f"Marketplace: {self.MARKETPLACE}")
-        print(f"Using API: {self.use_api}")
         print("="*70)
         
         start_time = time.time()