Skip to content

Commit 1dd953a

Browse files
lurenssclaude
andcommitted
Remove mock data fallback from Amazon keyboard scraper
- Remove _scrape_with_mock() method and all mock data generation - Update initialization to require valid ScrapeGraph API client - Simplify scrape_page() to return empty list on API failure instead of falling back to mock data - Remove unused hashlib import - Remove use_api flag from scrape_all_pages() output - Add CLAUDE.md documentation for future Claude Code instances The scraper now requires a working API connection and will fail gracefully if the API has issues, continuing to attempt other pages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent cb18671 commit 1dd953a

File tree

2 files changed

+207
-111
lines changed

2 files changed

+207
-111
lines changed

CLAUDE.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
This is a demonstration project integrating **ScrapeGraphAI SDK** with **Elasticsearch** for AI-powered marketplace product scraping and comparison. The project demonstrates how to scrape product data from marketplaces (Amazon, eBay, etc.), store it in Elasticsearch, and perform advanced searches and comparisons.
8+
9+
## Common Commands
10+
11+
### Environment Setup
12+
```bash
13+
# Create and activate virtual environment
14+
python -m venv venv
15+
source venv/bin/activate # On Mac/Linux
16+
# venv\Scripts\activate # On Windows
17+
18+
# Install dependencies
19+
pip install -r requirements.txt
20+
21+
# Configure environment
22+
cp .env.example .env
23+
# Edit .env to add SCRAPEGRAPHAI_API_KEY or OPENAI_API_KEY
24+
```
25+
26+
### Docker/Elasticsearch
27+
```bash
28+
# Start Elasticsearch and Kibana
29+
# Use 'docker compose' (Docker CLI plugin) or 'docker-compose' (standalone)
30+
docker compose up -d
31+
# OR: docker-compose up -d
32+
33+
# Check Elasticsearch health
34+
curl http://localhost:9200/_cluster/health
35+
36+
# Stop services
37+
docker compose down
38+
# OR: docker-compose down
39+
40+
# View logs
41+
docker compose logs elasticsearch
42+
docker compose logs kibana
43+
# OR: docker-compose logs elasticsearch
44+
```
45+
46+
### Running Examples
47+
```bash
48+
# Basic usage demonstration
49+
python examples/basic_usage.py
50+
51+
# Product comparison across marketplaces
52+
python examples/product_comparison.py
53+
54+
# Advanced search capabilities
55+
python examples/advanced_search.py
56+
57+
# Interactive quickstart demo
58+
python quickstart.py
59+
```
60+
61+
### Testing
62+
```bash
63+
# Run all tests (custom test runner)
64+
python run_tests.py
65+
66+
# Run individual test modules
67+
python tests/test_config.py
68+
python tests/test_models.py
69+
python tests/test_scraper.py
70+
```
71+
72+
## Architecture
73+
74+
### Core Components (4-Layer Pattern)
75+
76+
The architecture follows a clean separation of concerns with 4 main components in `src/scrapegraph_demo/`:
77+
78+
1. **Config** (`config.py`): Environment-based configuration management using `@dataclass` and `python-dotenv`. Loads settings from `.env` and provides connection URLs.
79+
80+
2. **Models** (`models.py`): Pydantic v2 models for type-safe data handling:
81+
- `Product`: Represents a marketplace product with validation
82+
- `ProductComparison`: Provides comparison methods (cheapest, best-rated, grouping)
83+
84+
3. **ElasticsearchClient** (`elasticsearch_client.py`): Manages all Elasticsearch operations including index creation, product indexing (single/bulk), search with filters, aggregations, and statistics.
85+
86+
4. **MarketplaceScraper** (`scraper.py`): Handles web scraping using ScrapeGraphAI's `SmartScraperGraph`. Includes mock data fallback for testing without API keys.
87+
88+
### Key Architectural Patterns
89+
90+
#### Mock Data Fallback
91+
The scraper implements a graceful fallback to mock data when ScrapeGraphAI is unavailable or API keys are missing. This enables:
92+
- Testing without external dependencies
93+
- Development without API keys
94+
- Consistent demonstration data
95+
96+
Look for `_mock_scrape_product()` and `_mock_scrape_search_results()` methods in `scraper.py`.
97+
98+
#### Elasticsearch Index Design
99+
The `marketplace_products` index uses a carefully designed mapping:
100+
- **Keyword fields** for exact matching: `product_id`, `marketplace`, `category`, `availability`, `currency`
101+
- **Text fields with keyword sub-fields** for flexible search: `name`, `description`, `brand`
102+
- **Proper data types**: `float` for price and rating, `integer` for review_count, `date` for scraped_at
103+
- **Object type** for nested specifications
104+
105+
This design optimizes search performance by using term queries for filters and multi_match for text search.
106+
107+
#### Integration Pattern
108+
The typical workflow combines scraping and indexing:
109+
```python
110+
config = Config.from_env()
111+
scraper = MarketplaceScraper(config)
112+
es_client = ElasticsearchClient(config)
113+
114+
# Scrape → Index → Search
115+
products = scraper.scrape_search_results(query, marketplace, max_results)
116+
es_client.index_products(products) # Bulk indexing
117+
results = es_client.search_products(query, filters...)
118+
```
119+
120+
### Data Flow
121+
122+
1. **Configuration Loading**: `Config.from_env()` loads environment variables
123+
2. **Scraping**: `MarketplaceScraper` uses ScrapeGraphAI to extract product data (or falls back to mock data)
124+
3. **Validation**: Pydantic models validate and structure the data
125+
4. **Indexing**: `ElasticsearchClient` stores products in Elasticsearch
126+
5. **Search/Analysis**: Full-text search, filtering, aggregations, and comparisons
127+
128+
## Development Patterns
129+
130+
### Pydantic Models
131+
All data models use Pydantic v2 for:
132+
- Type validation and coercion
133+
- JSON serialization via `model_dump(mode='json')`
134+
- IDE autocomplete support
135+
- Elasticsearch document conversion via `to_elasticsearch_doc()`
136+
137+
### Error Handling
138+
The codebase implements graceful degradation:
139+
- Elasticsearch connection failures are caught and logged
140+
- Scraping errors trigger mock data fallback
141+
- Bulk indexing returns success/failure counts rather than raising exceptions
142+
143+
### Environment Variables
144+
Required environment variables (set in `.env`):
145+
- `SCRAPEGRAPHAI_API_KEY` or `OPENAI_API_KEY` (one required for AI scraping)
146+
- `ELASTICSEARCH_HOST`, `ELASTICSEARCH_PORT`, `ELASTICSEARCH_SCHEME` (optional, have defaults)
147+
- `ELASTICSEARCH_USERNAME`, `ELASTICSEARCH_PASSWORD` (optional, for auth)
148+
149+
## Important Implementation Details
150+
151+
### Price Extraction
152+
The scraper includes `_extract_price()` utility that handles various price formats:
153+
- Removes currency symbols ($, €, £, etc.)
154+
- Handles comma/period number formats
155+
- Extracts first numeric value from strings
156+
157+
### Product ID Extraction
158+
`_extract_product_id()` extracts product IDs from marketplace URLs:
159+
- Amazon: Looks for `/dp/` or `/gp/product/` patterns
160+
- eBay: Extracts from `/itm/` pattern
161+
- Falls back to URL hash for other marketplaces
162+
163+
### Bulk Operations
164+
For indexing multiple products, always use `index_products()` instead of looping over `index_product()`:
165+
- More efficient (uses Elasticsearch bulk API)
166+
- Returns tuple of (success_count, failed_count)
167+
- Handles individual failures without stopping the entire operation
168+
169+
### Test Infrastructure
170+
Tests use a custom test runner (`run_tests.py`) rather than pytest. Tests are designed to run without:
171+
- Elasticsearch running
172+
- Web requests
173+
- API keys
174+
175+
All tests use mock data and verify core functionality in isolation.
176+
177+
## Services Integration
178+
179+
- **Elasticsearch**: `localhost:9200` (via Docker)
180+
- **Kibana**: `localhost:5601` (for data visualization)
181+
- **ScrapeGraphAI**: External API (requires API key)
182+
183+
## Package Structure
184+
185+
The package is installable via `setup.py` and exports main components in `__init__.py`:
186+
```python
187+
from src.scrapegraph_demo import Config, ElasticsearchClient, MarketplaceScraper, Product
188+
```
189+
190+
Version is defined in `__init__.py` as `__version__ = "0.1.0"`

amazon_keyboard_scraper.py

Lines changed: 17 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
import os
2525
import time
2626
import re
27-
import hashlib
2827
import traceback
2928
from typing import List, Dict, Any, Optional
3029
from datetime import datetime
@@ -52,25 +51,24 @@ def __init__(self):
5251
# Users can set SGAI_API_KEY environment variable to override the default
5352
api_key = os.environ.get('SGAI_API_KEY', self.DEFAULT_API_KEY)
5453
os.environ['SGAI_API_KEY'] = api_key
55-
54+
5655
# Load configuration
5756
self.config = Config.from_env()
58-
57+
5958
# Initialize Elasticsearch client
6059
self.es_client = ElasticsearchClient(self.config)
61-
62-
# Initialize ScrapeGraph client
60+
61+
# Initialize ScrapeGraph client (required)
6362
try:
6463
from scrapegraph_py import Client
6564
self.sg_client = Client(api_key=api_key)
66-
self.use_api = True
6765
print("✓ ScrapeGraph API client initialized")
6866
except Exception as e:
69-
print(f"⚠ Warning: Could not initialize ScrapeGraph API client: {e}")
70-
print(" Will use mock data for demonstration")
71-
self.sg_client = None
72-
self.use_api = False
73-
67+
raise RuntimeError(
68+
f"Failed to initialize ScrapeGraph API client: {e}\n"
69+
f"Please ensure you have a valid SGAI_API_KEY set in your environment."
70+
)
71+
7472
# Statistics
7573
self.total_scraped = 0
7674
self.total_stored = 0
@@ -80,38 +78,26 @@ def __init__(self):
8078
def scrape_page(self, page_num: int) -> List[Product]:
8179
"""
8280
Scrape a single page of Amazon search results
83-
81+
8482
Args:
8583
page_num: Page number to scrape (1-20)
86-
84+
8785
Returns:
88-
List of Product objects
86+
List of Product objects (empty list if scraping fails)
8987
"""
9088
page_url = f"{self.AMAZON_BASE_URL}&page={page_num}"
9189
print(f"\n📄 Scraping page {page_num}/{self.TOTAL_PAGES}: {page_url}")
92-
90+
9391
products = []
94-
92+
9593
try:
96-
if self.use_api and self.sg_client:
97-
# Use ScrapeGraph API to scrape the page
98-
try:
99-
products = self._scrape_with_api(page_url, page_num)
100-
except Exception as api_error:
101-
print(f" ⚠ API error: {str(api_error)}")
102-
print(f" Falling back to mock data for page {page_num}")
103-
products = self._scrape_with_mock(page_url, page_num)
104-
else:
105-
# Use mock data for demonstration
106-
products = self._scrape_with_mock(page_url, page_num)
107-
94+
products = self._scrape_with_api(page_url, page_num)
10895
print(f"✓ Found {len(products)} products on page {page_num}")
10996
self.total_scraped += len(products)
110-
11197
except Exception as e:
11298
print(f"✗ Error scraping page {page_num}: {str(e)}")
11399
self.failed_pages.append({"page": page_num, "error": str(e)})
114-
100+
115101
return products
116102

117103
def _scrape_with_api(self, page_url: str, page_num: int) -> List[Product]:
@@ -211,86 +197,7 @@ def _scrape_with_api(self, page_url: str, page_num: int) -> List[Product]:
211197
})
212198

213199
return products
214-
215-
def _scrape_with_mock(self, page_url: str, page_num: int) -> List[Product]:
216-
"""Generate mock product data for demonstration"""
217-
products = []
218-
# Generate 10-15 products per page
219-
num_products = 12 + (page_num % 4)
220-
221-
keyboard_types = [
222-
"Mechanical Gaming Keyboard RGB",
223-
"Wireless Bluetooth Keyboard",
224-
"Ergonomic Split Keyboard",
225-
"Compact 60% Mechanical Keyboard",
226-
"Full-Size Office Keyboard",
227-
"Gaming Keyboard with Wrist Rest",
228-
"Backlit Mechanical Keyboard",
229-
"Ultra-Thin Wireless Keyboard",
230-
"Gaming Keyboard and Mouse Combo",
231-
"Mechanical Keyboard TKL",
232-
"RGB Gaming Keyboard 104 Keys",
233-
"Portable Foldable Keyboard",
234-
"Mechanical Keyboard Hot Swappable",
235-
"Wireless Gaming Keyboard",
236-
"Professional Typing Keyboard"
237-
]
238-
239-
brands = ["Logitech", "Razer", "Corsair", "HyperX", "SteelSeries",
240-
"Keychron", "Ducky", "ASUS", "Redragon", "Cooler Master"]
241-
242-
for i in range(num_products):
243-
# Generate unique product ID
244-
product_seed = f"{page_num}-{i}"
245-
product_hash = hashlib.md5(product_seed.encode()).hexdigest()[:10].upper()
246-
asin = f"B0{product_hash[:8]}"
247-
248-
# Select keyboard type and brand
249-
keyboard_name = keyboard_types[(page_num * i) % len(keyboard_types)]
250-
brand = brands[(page_num + i) % len(brands)]
251-
252-
# Generate realistic price (20-200 EUR)
253-
base_price = 29.99 + (i * 8.5) + (page_num * 3)
254-
price = round(base_price % 180 + 20, 2)
255-
256-
# Generate rating (3.5 - 5.0)
257-
rating = round(3.5 + ((page_num + i) % 16) * 0.1, 1)
258-
if rating > 5.0:
259-
rating = 5.0
260-
261-
# Generate review count (10-5000)
262-
review_count = 50 + (i * 150) + (page_num * 80)
263-
if review_count > 5000:
264-
review_count = review_count % 5000 + 100
265-
266-
# Prime availability (70% chance)
267-
has_prime = ((page_num + i) % 10) < 7
268-
269-
product = Product(
270-
product_id=asin,
271-
name=f"{brand} {keyboard_name}",
272-
price=price,
273-
currency="EUR",
274-
url=f"https://www.amazon.it/dp/{asin}",
275-
marketplace=self.MARKETPLACE,
276-
description=f"High-quality {keyboard_name.lower()} from {brand}",
277-
brand=brand,
278-
category="Keyboards",
279-
rating=rating,
280-
review_count=review_count,
281-
availability="Prime" if has_prime else "Standard",
282-
specifications={
283-
"prime_eligible": has_prime,
284-
"page_number": page_num,
285-
"keyboard_type": keyboard_name
286-
},
287-
scraped_at=datetime.utcnow()
288-
)
289-
290-
products.append(product)
291-
292-
return products
293-
200+
294201
def store_products(self, products: List[Product]) -> int:
295202
"""
296203
Store products in Elasticsearch
@@ -326,7 +233,6 @@ def scrape_all_pages(self):
326233
print(f"Target: {self.AMAZON_BASE_URL}")
327234
print(f"Pages to scrape: {self.TOTAL_PAGES}")
328235
print(f"Marketplace: {self.MARKETPLACE}")
329-
print(f"Using API: {self.use_api}")
330236
print("="*70)
331237

332238
start_time = time.time()

0 commit comments

Comments
 (0)