Skip to content

[GRO-227]: add python version of integration to mongodb #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,21 +122,24 @@ Enhance your Vercel applications with web-browsing capabilities. Build Generativ
- Available in Node.js, Python, and Stagehand implementations
- Production-ready with comprehensive examples

### 📊 Evaluation & Testing

#### [**Braintrust Integration**](./examples/integrations/braintrust/README.md)
Integrate Browserbase with Braintrust for evaluation and testing of AI agent performance in web environments. Monitor, measure, and improve your browser automation workflows.
### 📊 Data Storage, Searching and Analysis

#### [**MongoDB Integration**](./examples/integrations/mongodb/README.md)
**Intelligent Web Scraping & Data Storage** - Extract structured data from e-commerce websites using Stagehand and store it in MongoDB for analysis. Perfect for building data pipelines, market research, and competitive analysis workflows.
**Intelligent Web Scraping & Data Storage** - Extract semi-structured data from e-commerce websites using Stagehand and store it in MongoDB for analysis. Perfect for building data pipelines, market research, and competitive analysis workflows.

**Capabilities:**
- Document-based model and advanced features like Vector Search and Real-Time Stream Processing make it the perfect foundation for advanced search and data pipelines
- AI-powered web scraping with Stagehand
- Structured data extraction with schema validation
- MongoDB storage for persistence and querying
- Built-in data analysis and reporting
- Robust error handling for production use

### 📊 Evaluation & Testing

#### [**Braintrust Integration**](./examples/integrations/braintrust/README.md)
Integrate Browserbase with Braintrust for evaluation and testing of AI agent performance in web environments. Monitor, measure, and improve your browser automation workflows.

## 🏗️ Monorepo Structure

```
Expand Down
246 changes: 175 additions & 71 deletions examples/integrations/mongodb/README.md
Original file line number Diff line number Diff line change
@@ -1,99 +1,203 @@
# Stagehand MongoDB Scraper
# Browserbase + Stagehand MongoDB Integration

A web scraping project that uses Stagehand to extract structured data from e-commerce websites and store it in MongoDB for analysis.
A comprehensive web scraping integration that uses Stagehand to extract structured data from e-commerce websites and store it in MongoDB for analysis. Available in both **Python** and **TypeScript**.

## Features
## 🚀 Choose Your Language

- **Web Scraping**: Uses Stagehand (built on Playwright) for intelligent web scraping
- **Data Extraction**: Extracts structured product data using AI-powered instructions
- **MongoDB Storage**: Stores scraped data in MongoDB for persistence and querying
- **Schema Validation**: Uses Zod for schema validation and TypeScript interfaces
- **Error Handling**: Robust error handling to prevent crashes during scraping
- **Data Analysis**: Built-in MongoDB queries for data analysis
<table>
<tr>
<td width="50%" valign="top">

## Prerequisites
### 🐍 **Python Version**
**`📁 python/`**

- Node.js 16 or higher
Perfect for data scientists and Python developers who want:
- **Rich terminal output** with beautiful tables and progress indicators
- **Pydantic models** for robust data validation
- **Async/await** support for high-performance scraping
- **pymongo** for MongoDB operations
- Simple single-file architecture

**[→ Get Started with Python](python/README.md)**

```bash
cd python/
pip install -r requirements.txt
python main.py
```

</td>
<td width="50%" valign="top">

### 📘 **TypeScript Version**
**`📁 typescript/`**

Ideal for JavaScript/Node.js developers who prefer:
- **Type safety** with full TypeScript support
- **Zod schemas** for runtime validation
- **Modern ES modules** and clean architecture
- **MongoDB native driver** with full typing
- Modular, well-structured codebase

**[→ Get Started with TypeScript](typescript/README.md)**

```bash
cd typescript/
npm install
npm start
```

</td>
</tr>
</table>

## 🌟 Features (Both Versions)

- **🌐 Intelligent Web Scraping**: Uses Stagehand's AI-powered extraction
- **🗄️ MongoDB Storage**: Persistent data storage with proper indexing
- **📊 Data Analysis**: Built-in queries and reporting
- **🛡️ Error Handling**: Robust error handling and recovery
- **⚡ Performance**: Optimized for speed and reliability
- **🔍 Schema Validation**: Type-safe data models

## 📋 What It Does

Both versions perform the same core functionality:

1. **🔌 Connect** to MongoDB and set up collections with proper indexes
2. **📊 Scrape** Amazon product listings using Stagehand's AI extraction
3. **🔍 Extract** detailed product information including:
- Product names, prices, ratings
- Categories, descriptions, specifications
- Review counts and availability
4. **💾 Store** all data in MongoDB with validated schemas
5. **📈 Analyze** the data with built-in reporting:
- Collection statistics
- Products by category
- Top-rated products

## 🛠️ Prerequisites

**For Both Versions:**
- MongoDB installed locally or MongoDB Atlas account
- Stagehand API key

## Installation
**Python Version:**
- Python 3.8+

**TypeScript Version:**
- Node.js 16+
- npm or pnpm

## 🚦 Quick Start

### Python Quick Start
```bash
# Navigate to Python version
cd examples/integrations/mongodb/python

# Install dependencies
pip install -r requirements.txt

# Set up environment
cp env.example .env
# Edit .env with your MongoDB URI and Stagehand API key

# Run the scraper
python main.py
```

### TypeScript Quick Start
```bash
# Navigate to TypeScript version
cd examples/integrations/mongodb/typescript

# Install dependencies
npm install

# Set up environment
cp .env.example .env
# Edit .env with your MongoDB URI and Stagehand API key

# Run the scraper
npm start
```

## 📊 Sample Output

Both versions provide rich, colorful output showing the scraping progress:

```
🤘 Welcome to Stagehand MongoDB Scraper!

🔌 Connecting to MongoDB...
✅ Connected to MongoDB
⚙️ Creating indexes...
✅ Index creation completed

📊 Starting to scrape product listing...
✅ Scraped 16 products from category: Laptops

1. Clone the repository:
```
git clone <repository-url>
cd stagehand-mongodb-scraper
```
📊 Scraping details for product 1/3: MacBook Pro M3
✅ Scraped detailed information for: MacBook Pro M3

2. Install dependencies:
```
npm install
```
📊 Running Data Analysis
┌─────────────────┬───────┐
│ Collection │ Count │
├─────────────────┼───────┤
│ PRODUCTS │ 19 │
│ PRODUCT_LISTS │ 1 │
└─────────────────┴───────┘

3. Set up environment variables:
```
# Create a .env file with the following variables
MONGO_URI=mongodb://localhost:27017
DB_NAME=scraper_db
```
🎉 Scraping completed successfully!
```

## Usage
## 🏗️ Architecture

1. Start MongoDB locally:
```
mongod
```
Both versions follow the same architectural patterns:

2. Run the scraper:
```
npm start
```
- **MongoDB Manager**: Handles database connections, indexing, and operations
- **Product Scraper**: Manages web scraping using Stagehand
- **Data Models**: Structured schemas for products and product lists
- **Data Analyzer**: Provides insights and reporting on collected data

3. The script will:
- Scrape product listings from Amazon
- Extract detailed information for the first 3 products
- Extract reviews for each product
- Store all data in MongoDB
- Run analysis queries on the collected data showing:
- Collection counts
- Products by category
- Top-rated products
## 🔧 Configuration

## Project Structure
Both versions support:
- **Browserbase** cloud browsers for scalability
- **Environment-based** configuration
- **Flexible MongoDB** connection options

The project has a simple structure with a single file containing all functionality:
## 📚 Documentation

- `index.ts`: Contains the complete implementation including:
- MongoDB connection and data operations
- Schema definitions
- Scraping functions
- Data analysis
- Main execution logic
- `stagehand.config.js`: Stagehand configuration
- `.env.example`: Example environment variables
- **[Python Version Documentation](python/README.md)** - Detailed Python setup and usage
- **[TypeScript Version Documentation](typescript/README.md)** - Complete TypeScript guide
- **[Stagehand Documentation](https://docs.stagehand.dev/)** - Learn more about Stagehand
- **[MongoDB Documentation](https://docs.mongodb.com/)** - MongoDB setup and operations

## Data Models
## 🤝 Contributing

The project uses the following data models:
Both versions are actively maintained and welcome contributions:
- Bug reports and feature requests
- Code improvements and optimizations
- Documentation enhancements
- Additional data analysis features

- **Product**: Individual product information
- **ProductList**: List of products from a category page
- **Review**: Product reviews
## 📄 License

## MongoDB Collections
MIT License - feel free to use in your projects!

Data is stored in the following MongoDB collections:
## 🙏 Acknowledgements

- **products**: Individual product information
- **product_lists**: Lists of products from category pages
- **reviews**: Product reviews
- **[Stagehand](https://docs.stagehand.dev/)** - AI-powered web scraping
- **[MongoDB](https://www.mongodb.com/)** - Flexible document database
- **[Pydantic](https://pydantic.dev/)** (Python) - Data validation
- **[Zod](https://zod.dev/)** (TypeScript) - Schema validation

## License
---

MIT
## 🤘 Ready to Start?

## Acknowledgements
Choose your preferred language and dive in:

- [Stagehand](https://docs.stagehand.dev/) for the powerful web scraping capabilities
- [MongoDB](https://www.mongodb.com/) for the flexible document database
- [Zod](https://zod.dev/) for runtime schema validation
**🐍 [Python Version →](python/README.md)** | **📘 [TypeScript Version →](typescript/README.md)**
2 changes: 2 additions & 0 deletions examples/integrations/mongodb/python/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.env
/venv
Loading