Skip to content

Commit d64f6f1

Browse files
authored
Merge pull request #3 from ScrapeGraphAI/copilot/migrate-to-scrapegraph-py-sdk
Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK
2 parents 6d4557e + 78b145f commit d64f6f1

File tree

11 files changed

+267
-97
lines changed

11 files changed

+267
-97
lines changed

.env.example

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,9 @@
1+
# ScrapeGraphAI API Key (required for scrapegraph-py SDK)
2+
SGAI_API_KEY=your-scrapegraphai-api-key-here
3+
14
# Elasticsearch Configuration
25
ELASTICSEARCH_HOST=localhost
36
ELASTICSEARCH_PORT=9200
47
ELASTICSEARCH_SCHEME=http
5-
ELASTICSEARCH_USERNAME=elastic
6-
ELASTICSEARCH_PASSWORD=changeme
7-
8-
# ScrapeGraphAI Configuration
9-
SCRAPEGRAPHAI_API_KEY=your_api_key_here
10-
11-
# Optional: OpenAI API Key for LLM functionality
12-
OPENAI_API_KEY=your_openai_api_key_here
8+
# ELASTICSEARCH_USERNAME=
9+
# ELASTICSEARCH_PASSWORD=

README.md

Lines changed: 60 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,25 @@
11
# ScrapeGraphAI Elasticsearch Demo
22

3-
A comprehensive demo project showcasing the integration of **ScrapeGraphAI SDK** with **Elasticsearch** for intelligent marketplace product scraping, storage, and comparison.
3+
A comprehensive demo project showcasing the integration of **ScrapeGraphAI API (via scrapegraph-py SDK)** with **Elasticsearch** for intelligent marketplace product scraping, storage, and comparison.
4+
5+
> **Note**: This demo uses the `scrapegraph-py` SDK which provides API-based scraping through ScrapeGraphAI's cloud service. This means simpler setup, no local LLM requirements, and managed infrastructure.
46
57
## 🚀 Features
68

7-
- **Web Scraping with ScrapeGraphAI**: Leverage AI-powered scraping to extract structured product data from marketplace websites
9+
- **Web Scraping with ScrapeGraphAI API**: Leverage cloud-based AI scraping to extract structured product data from marketplace websites
10+
- **Simple SDK Integration**: Use the `scrapegraph-py` SDK for easy API-based scraping
811
- **Elasticsearch Integration**: Store and index product data for powerful search and analytics
912
- **Multi-Marketplace Support**: Scrape and compare products across different marketplaces (Amazon, eBay, etc.)
1013
- **Product Comparison**: Advanced features to compare products by price, ratings, and specifications
1114
- **Flexible Search**: Full-text search with filters for marketplace, price range, and more
1215
- **Data Analytics**: Aggregations and statistics on product data
16+
- **No Local LLM Setup**: All AI processing happens in the cloud - just use your API key
1317

1418
## 📋 Prerequisites
1519

1620
- Python 3.8 or higher
1721
- Docker and Docker Compose (for Elasticsearch)
18-
- OpenAI API key (optional, for AI-powered scraping)
22+
- ScrapeGraphAI API key (get one at [scrapegraphai.com](https://scrapegraphai.com))
1923

2024
## 🔧 Installation
2125

@@ -48,11 +52,16 @@ pip install -r requirements.txt
4852
# Copy the example environment file
4953
cp .env.example .env
5054

51-
# Edit .env and add your configuration
52-
# At minimum, you need to set:
53-
# - SCRAPEGRAPHAI_API_KEY or OPENAI_API_KEY
55+
# Edit .env and add your ScrapeGraphAI API key
56+
# Required: SGAI_API_KEY=your-api-key-here
5457
```
5558

59+
**Getting your API Key:**
60+
1. Visit [scrapegraphai.com](https://scrapegraphai.com)
61+
2. Sign up or log in to your account
62+
3. Navigate to your API settings
63+
4. Copy your API key and add it to `.env` as `SGAI_API_KEY`
64+
5665
### 4. Start Elasticsearch
5766

5867
```bash
@@ -117,14 +126,14 @@ This demonstrates:
117126
```python
118127
from src.scrapegraph_demo import Config, ElasticsearchClient, MarketplaceScraper
119128

120-
# Load configuration
129+
# Load configuration (reads SGAI_API_KEY from environment)
121130
config = Config.from_env()
122131

123132
# Initialize clients
124133
es_client = ElasticsearchClient(config)
125134
scraper = MarketplaceScraper(config)
126135

127-
# Scrape a product
136+
# Scrape a product using the SDK
128137
product = scraper.scrape_product(
129138
url="https://www.amazon.com/dp/PRODUCTID",
130139
marketplace="Amazon"
@@ -144,12 +153,16 @@ results = es_client.search_products(
144153
# Print results
145154
for product in results:
146155
print(f"{product.name} - ${product.price}")
156+
157+
# Clean up
158+
scraper.close()
159+
es_client.close()
147160
```
148161

149162
### Scraping Search Results
150163

151164
```python
152-
# Scrape multiple products from a search
165+
# Scrape multiple products from a search using the SDK
153166
products = scraper.scrape_search_results(
154167
search_query="wireless mouse",
155168
marketplace="Amazon",
@@ -159,6 +172,9 @@ products = scraper.scrape_search_results(
159172
# Bulk index
160173
success, failed = es_client.index_products(products)
161174
print(f"Indexed {success} products")
175+
176+
# Don't forget to close the scraper
177+
scraper.close()
162178
```
163179

164180
### Product Comparison
@@ -213,11 +229,12 @@ Manages all Elasticsearch operations:
213229

214230
### MarketplaceScraper
215231

216-
Handles web scraping using ScrapeGraphAI:
217-
- Scrape individual product pages
218-
- Scrape search results
232+
Handles web scraping using ScrapeGraphAI API (via scrapegraph-py SDK):
233+
- Scrape individual product pages using cloud-based AI
234+
- Scrape search results with structured data extraction
219235
- Extract structured data (price, rating, specs, etc.)
220236
- Support for multiple marketplaces
237+
- Automatic fallback to mock data if API is unavailable
221238

222239
### Product Model
223240

@@ -234,15 +251,14 @@ Pydantic model representing a marketplace product:
234251

235252
| Variable | Description | Required | Default |
236253
|----------|-------------|----------|---------|
254+
| `SGAI_API_KEY` | ScrapeGraphAI API key | Yes* | - |
237255
| `ELASTICSEARCH_HOST` | Elasticsearch host | No | `localhost` |
238256
| `ELASTICSEARCH_PORT` | Elasticsearch port | No | `9200` |
239257
| `ELASTICSEARCH_SCHEME` | HTTP or HTTPS | No | `http` |
240258
| `ELASTICSEARCH_USERNAME` | Elasticsearch username | No | - |
241259
| `ELASTICSEARCH_PASSWORD` | Elasticsearch password | No | - |
242-
| `SCRAPEGRAPHAI_API_KEY` | ScrapeGraphAI API key | Yes* | - |
243-
| `OPENAI_API_KEY` | OpenAI API key | Yes* | - |
244260

245-
*Either `SCRAPEGRAPHAI_API_KEY` or `OPENAI_API_KEY` is required for AI-powered scraping.
261+
*`SGAI_API_KEY` is required for API-based scraping. Without it, the demo will use mock data for testing.
246262

247263
## 📊 Elasticsearch Index
248264

@@ -278,15 +294,23 @@ Use Kibana to:
278294

279295
## 🧪 Testing
280296

281-
The project includes mock data functionality for testing without actual web scraping:
297+
Run the test suite:
298+
299+
```bash
300+
python run_tests.py
301+
```
302+
303+
The project includes mock data functionality for testing without API credits:
282304

283305
```python
284-
# The scraper automatically falls back to mock data if ScrapeGraphAI is unavailable
306+
# The scraper automatically falls back to mock data if API key is not set
285307
scraper = MarketplaceScraper(config)
286308
products = scraper.scrape_search_results("laptop", "Amazon", max_results=5)
287309
# Returns mock products for testing
288310
```
289311

312+
All tests use mock data and don't require an API key.
313+
290314
## 🤝 Contributing
291315

292316
Contributions are welcome! Please feel free to submit a Pull Request.
@@ -297,9 +321,11 @@ This project is provided as-is for demonstration purposes.
297321

298322
## 🔗 Related Resources
299323

300-
- [ScrapeGraphAI Documentation](https://scrapegraphai.com/docs)
324+
- [ScrapeGraphAI Website](https://scrapegraphai.com) - Get your API key
325+
- [ScrapeGraphAI SDK Documentation](https://github.com/ScrapeGraphAI/scrapegraph-sdk) - scrapegraph-py SDK reference
326+
- [ScrapeGraphAI API Documentation](https://scrapegraphai.com/docs) - API documentation
301327
- [Elasticsearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html)
302-
- [ScrapeGraphAI GitHub](https://github.com/ScrapeGraphAI/Scrapegraph-ai)
328+
- [ScrapeGraphAI Open Source](https://github.com/ScrapeGraphAI/Scrapegraph-ai) - Original open-source library
303329

304330
## 💡 Use Cases
305331

@@ -313,6 +339,21 @@ This demo can be adapted for various use cases:
313339

314340
## 🐛 Troubleshooting
315341

342+
### ScrapeGraphAI API Issues
343+
344+
```bash
345+
# Verify your API key is set
346+
echo $SGAI_API_KEY
347+
348+
# Test the SDK
349+
python -c "from scrapegraph_py import Client; print('SDK installed correctly')"
350+
```
351+
352+
**Common Issues:**
353+
- **"SGAI_API_KEY not set"**: Make sure you've added your API key to `.env`
354+
- **API credits exhausted**: Check your account at scrapegraphai.com
355+
- **Connection timeout**: Check your internet connection
356+
316357
### Elasticsearch Connection Issues
317358

318359
```bash

examples/advanced_search.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ def main():
124124
print("Product not found")
125125

126126
# Clean up
127+
scraper.close()
127128
es_client.close()
128129

129130
print("\n\n=== Advanced search demo completed! ===")

examples/basic_usage.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ def main():
8585

8686
# Clean up
8787
print("\n9. Closing connections...")
88+
scraper.close()
8889
es_client.close()
8990

9091
print("\n=== Demo completed successfully! ===")

examples/product_comparison.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ def main():
119119
print(f" Availability: {product.availability}")
120120

121121
# Clean up
122+
scraper.close()
122123
es_client.close()
123124

124125
print("\n" + "=" * 60)

quickstart.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,11 @@ def main():
7575
print_step(3, "Initializing Marketplace Scraper")
7676
scraper = MarketplaceScraper(config)
7777
print("✓ Scraper initialized")
78-
print(" Using mock data for demonstration")
78+
if not config.sgai_api_key:
79+
print(" Note: SGAI_API_KEY not set, using mock data for demonstration")
80+
print(" To use real API scraping, set SGAI_API_KEY in your .env file")
81+
else:
82+
print(" Using ScrapeGraphAI SDK for scraping")
7983
wait_for_user()
8084

8185
# Step 4: Scrape Products
@@ -220,6 +224,9 @@ def main():
220224
print(" - python examples/advanced_search.py")
221225
print()
222226

227+
# Clean up connections
228+
scraper.close()
229+
223230
if es_connected:
224231
print(" 5. Access Kibana at http://localhost:5601 for data visualization")
225232
print()

requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
# ScrapeGraphAI SDK
2-
scrapegraphai>=1.0.0
1+
# ScrapeGraphAI SDK (API-based)
2+
scrapegraph-py>=1.0.0
33

44
# Elasticsearch
55
elasticsearch>=8.0.0

src/scrapegraph_demo/config.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,8 @@ class Config:
1919
elasticsearch_username: Optional[str]
2020
elasticsearch_password: Optional[str]
2121

22-
# ScrapeGraphAI settings
23-
scrapegraphai_api_key: Optional[str]
24-
25-
# OpenAI settings (optional)
26-
openai_api_key: Optional[str]
22+
# ScrapeGraphAI SDK settings
23+
sgai_api_key: Optional[str]
2724

2825
@classmethod
2926
def from_env(cls) -> "Config":
@@ -36,8 +33,7 @@ def from_env(cls) -> "Config":
3633
elasticsearch_scheme=os.getenv("ELASTICSEARCH_SCHEME", "http"),
3734
elasticsearch_username=os.getenv("ELASTICSEARCH_USERNAME"),
3835
elasticsearch_password=os.getenv("ELASTICSEARCH_PASSWORD"),
39-
scrapegraphai_api_key=os.getenv("SCRAPEGRAPHAI_API_KEY"),
40-
openai_api_key=os.getenv("OPENAI_API_KEY"),
36+
sgai_api_key=os.getenv("SGAI_API_KEY"),
4137
)
4238

4339
@property

0 commit comments

Comments
 (0)