Skip to content

Commit 936fd29

Browse files
authored
add mongodb integration (#2)
1 parent dadbc69 commit 936fd29

File tree

12 files changed

+3662
-0
lines changed

12 files changed

+3662
-0
lines changed

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,16 @@ Enhance your Vercel applications with web-browsing capabilities. Build Generativ
6666
#### [**Braintrust Integration**](./examples/integrations/braintrust/README.md)
6767
Integrate Browserbase with Braintrust for evaluation and testing of AI agent performance in web environments. Monitor, measure, and improve your browser automation workflows.
6868

69+
#### [**MongoDB Integration**](./examples/integrations/mongodb/README.md)
70+
**Intelligent Web Scraping & Data Storage** - Extract structured data from e-commerce websites using Stagehand and store it in MongoDB for analysis. Perfect for building data pipelines, market research, and competitive analysis workflows.
71+
72+
**Capabilities:**
73+
- AI-powered web scraping with Stagehand
74+
- Structured data extraction with schema validation
75+
- MongoDB storage for persistence and querying
76+
- Built-in data analysis and reporting
77+
- Robust error handling for production use
78+
6979
## 🏗️ Monorepo Structure
7080

7181
```
@@ -80,6 +90,7 @@ integrations/
8090
│ ├── langchain/ # LangChain framework integration
8191
│ ├── browser-use/ # Simplified browser automation
8292
│ ├── braintrust/ # Evaluation and testing tools
93+
│ ├── mongodb/ # MongoDB data extraction & storage
8394
│ └── agentkit/ # AgentKit implementations
8495
└── README.md # This file
8596
```
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Stagehand Project
2+
3+
This is a project that uses Stagehand, which amplifies Playwright with `act`, `extract`, and `observe` added to the Page class.
4+
5+
`Stagehand` is a class that provides config, a `StagehandPage` object via `stagehand.page`, and a `StagehandContext` object via `stagehand.context`.
6+
7+
`Page` is a class that extends the Playwright `Page` class and adds `act`, `extract`, and `observe` methods.
8+
`Context` is a class that extends the Playwright `BrowserContext` class.
9+
10+
Use the following rules to write code for this project.
11+
12+
- To take an action on the page like "click the sign in button", use Stagehand `act` like this:
13+
14+
```typescript
15+
await page.act("Click the sign in button");
16+
```
17+
18+
- To plan an instruction before taking an action, use Stagehand `observe` to get the action to execute.
19+
20+
```typescript
21+
const [action] = await page.observe("Click the sign in button");
22+
```
23+
24+
- The result of `observe` is an array of `ObserveResult` objects that can directly be used as params for `act` like this:
25+
26+
```typescript
27+
const [action] = await page.observe("Click the sign in button");
28+
await page.act(action);
29+
```
30+
31+
- When writing code that needs to extract data from the page, use Stagehand `extract`. Explicitly pass the following params by default:
32+
33+
```typescript
34+
const { someValue } = await page.extract({
35+
instruction: the instruction to execute,
36+
schema: z.object({
37+
someValue: z.string(),
38+
}), // The schema to extract
39+
});
40+
```
41+
42+
## Initialize
43+
44+
```typescript
45+
import { Stagehand } from "@browserbasehq/stagehand";
46+
import StagehandConfig from "./stagehand.config";
47+
48+
const stagehand = new Stagehand(StagehandConfig);
49+
await stagehand.init();
50+
51+
const page = stagehand.page; // Playwright Page with act, extract, and observe methods
52+
const context = stagehand.context; // Playwright BrowserContext
53+
```
54+
55+
## Act
56+
57+
You can cache the results of `observe` and use them as params for `act` like this:
58+
59+
```typescript
60+
const instruction = "Click the sign in button";
61+
const cachedAction = await getCache(instruction);
62+
63+
if (cachedAction) {
64+
await page.act(cachedAction);
65+
} else {
66+
try {
67+
const results = await page.observe(instruction);
68+
await setCache(instruction, results);
69+
await page.act(results[0]);
70+
} catch (error) {
71+
await page.act(instruction); // If the action is not cached, execute the instruction directly
72+
}
73+
}
74+
```
75+
76+
Be sure to cache the results of `observe` and use them as params for `act` to avoid unexpected DOM changes. Using `act` without caching will result in more unpredictable behavior.
77+
78+
Act `action` should be as atomic and specific as possible, i.e. "Click the sign in button" or "Type 'hello' into the search input".
79+
AVOID actions that are more than one step, i.e. "Order me pizza" or "Type in the search bar and hit enter".
80+
81+
## Extract
82+
83+
If you are writing code that needs to extract data from the page, use Stagehand `extract`.
84+
85+
```typescript
86+
const signInButtonText = await page.extract("extract the sign in button text");
87+
```
88+
89+
You can also pass in params like an output schema in Zod, and a flag to use text extraction:
90+
91+
```typescript
92+
const data = await page.extract({
93+
instruction: "extract the sign in button text",
94+
schema: z.object({
95+
text: z.string(),
96+
}),
97+
});
98+
```
99+
100+
`schema` is a Zod schema that describes the data you want to extract. To extract an array, make sure to pass in a single object that contains the array, as follows:
101+
102+
```typescript
103+
const data = await page.extract({
104+
instruction: "extract the text inside all buttons",
105+
schema: z.object({
106+
text: z.array(z.string()),
107+
}),
108+
useTextExtract: true, // Set true for larger-scale extractions (multiple paragraphs), or set false for small extractions (name, birthday, etc)
109+
});
110+
```
111+
112+
## Agent
113+
114+
Use the `agent` method to automonously execute larger tasks like "Get the stock price of NVDA"
115+
116+
```typescript
117+
// Navigate to a website
118+
await stagehand.page.goto("https://www.google.com");
119+
120+
const agent = stagehand.agent({
121+
// You can use either OpenAI or Anthropic
122+
provider: "openai",
123+
// The model to use (claude-3-7-sonnet-20250219 or claude-3-5-sonnet-20240620 for Anthropic)
124+
model: "computer-use-preview",
125+
126+
// Customize the system prompt
127+
instructions: `You are a helpful assistant that can use a web browser.
128+
Do not ask follow up questions, the user will trust your judgement.`,
129+
130+
// Customize the API key
131+
options: {
132+
apiKey: process.env.OPENAI_API_KEY,
133+
},
134+
});
135+
136+
// Execute the agent
137+
await agent.execute(
138+
"Apply for a library card at the San Francisco Public Library"
139+
);
140+
```
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# MongoDB Connection
2+
3+
# Local MongoDB instance
4+
# MONGO_URI=mongodb://localhost:27017
5+
6+
# MongoDB Atlas connection string format:
7+
# MONGO_URI=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/<database>?retryWrites=true&w=majority
8+
9+
# Database name
10+
DB_NAME=scraper_db
11+
12+
BROWSERBASE_PROJECT_ID="YOUR_BROWSERBASE_PROJECT_ID"
13+
BROWSERBASE_API_KEY="YOUR_BROWSERBASE_API_KEY"
14+
OPENAI_API_KEY="THIS_IS_OPTIONAL_WITH_ANTHROPIC_KEY"
15+
ANTHROPIC_API_KEY="THIS_IS_OPTIONAL_WITH_OPENAI_KEY"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
.env
2+
node_modules
3+
tmp
4+
downloads
5+
.DS_Store
6+
dist
7+
cache.json

examples/integrations/mongodb/LICENSE

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright 2025 Browserbase, Inc
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Stagehand MongoDB Scraper
2+
3+
A web scraping project that uses Stagehand to extract structured data from e-commerce websites and store it in MongoDB for analysis.
4+
5+
## Features
6+
7+
- **Web Scraping**: Uses Stagehand (built on Playwright) for intelligent web scraping
8+
- **Data Extraction**: Extracts structured product data using AI-powered instructions
9+
- **MongoDB Storage**: Stores scraped data in MongoDB for persistence and querying
10+
- **Schema Validation**: Uses Zod for schema validation and TypeScript interfaces
11+
- **Error Handling**: Robust error handling to prevent crashes during scraping
12+
- **Data Analysis**: Built-in MongoDB queries for data analysis
13+
14+
## Prerequisites
15+
16+
- Node.js 16 or higher
17+
- MongoDB installed locally or MongoDB Atlas account
18+
- Stagehand API key
19+
20+
## Installation
21+
22+
1. Clone the repository:
23+
```
24+
git clone <repository-url>
25+
cd stagehand-mongodb-scraper
26+
```
27+
28+
2. Install dependencies:
29+
```
30+
npm install
31+
```
32+
33+
3. Set up environment variables:
34+
```
35+
# Create a .env file with the following variables
36+
MONGO_URI=mongodb://localhost:27017
37+
DB_NAME=scraper_db
38+
```
39+
40+
## Usage
41+
42+
1. Start MongoDB locally:
43+
```
44+
mongod
45+
```
46+
47+
2. Run the scraper:
48+
```
49+
npm start
50+
```
51+
52+
3. The script will:
53+
- Scrape product listings from Amazon
54+
- Extract detailed information for the first 3 products
55+
- Extract reviews for each product
56+
- Store all data in MongoDB
57+
- Run analysis queries on the collected data showing:
58+
- Collection counts
59+
- Products by category
60+
- Top-rated products
61+
62+
## Project Structure
63+
64+
The project has a simple structure with a single file containing all functionality:
65+
66+
- `index.ts`: Contains the complete implementation including:
67+
- MongoDB connection and data operations
68+
- Schema definitions
69+
- Scraping functions
70+
- Data analysis
71+
- Main execution logic
72+
- `stagehand.config.js`: Stagehand configuration
73+
- `.env.example`: Example environment variables
74+
75+
## Data Models
76+
77+
The project uses the following data models:
78+
79+
- **Product**: Individual product information
80+
- **ProductList**: List of products from a category page
81+
- **Review**: Product reviews
82+
83+
## MongoDB Collections
84+
85+
Data is stored in the following MongoDB collections:
86+
87+
- **products**: Individual product information
88+
- **product_lists**: Lists of products from category pages
89+
- **reviews**: Product reviews
90+
91+
## License
92+
93+
MIT
94+
95+
## Acknowledgements
96+
97+
- [Stagehand](https://docs.stagehand.dev/) for the powerful web scraping capabilities
98+
- [MongoDB](https://www.mongodb.com/) for the flexible document database
99+
- [Zod](https://zod.dev/) for runtime schema validation

0 commit comments

Comments
 (0)