Skip to content

Commit f0a4af2

Browse files
Creates LeafyGreen AI RAG bot crawler (#2842)
* init crawler moved from private * install
1 parent b239369 commit f0a4af2

29 files changed

+3304
-11
lines changed

pnpm-lock.yaml

Lines changed: 1991 additions & 11 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

tools/crawler/.env.example

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
MONGODB_USER=<YOUR_MONGODB_USER>
2+
MONGODB_PASSWORD=<YOUR_MONGODB_PASSWORD>
3+
MONGODB_PROJECT_URL=<YOUR_PROJECT_URL>
4+
MONGODB_APP_NAME=LeafyGreenAI
5+
6+
# Used for vector embedding
7+
AZURE_API_KEY1=<Key1>
8+
AZURE_API_KEY2=<Key2>
9+
AZURE_OPENAI_ENDPOINT=https://<your-env>.openai.azure.com/
10+
AZURE_OPENAI_DEPLOYMENT=text-embedding-3-small

tools/crawler/CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# @lg-tools/crawler
2+
3+
## 0.0.2
4+
5+
### Patch Changes
6+
7+
- ## Added
8+
9+
- Implemented prune command in CLI
10+
- Added prune script in package.json
11+
- Created CrawlerDocument interface for better type safety
12+
- Implemented robots.txt checking functionality
13+
- Added new utility function newURL for enhanced URL processing
14+
15+
## Changed
16+
17+
- Updated SOURCES in constants.ts to include additional URLs and collections
18+
- Changed log color to green in processSingleUrl for better visibility
19+
- Refactored crawler logic to improve URL processing
20+
- Enhanced logging in recursive crawling functionality
21+
- Improved URL processing with better logging
22+
23+
## Updated
24+
25+
- Refactored crawler constants for better organization
26+
- Updated various log formats and display

tools/crawler/README.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# LeafyGreen Crawler Tool
2+
3+
A CLI tool for crawling and analyzing website content for LeafyGreen AI.
4+
5+
## Overview
6+
7+
This tool crawls websites and stores the content in MongoDB collections for use with LeafyGreen AI systems. The crawler can process either specific URLs or use pre-configured website sources.
8+
9+
## Prerequisites
10+
11+
- Node.js (v16 or higher)
12+
- Yarn package manager
13+
- MongoDB Atlas account with connection details
14+
- Environment variables properly configured
15+
16+
## Installation
17+
18+
```bash
19+
# From the root of the leafygreen-ui-private repository
20+
cd tools/crawler
21+
yarn install
22+
```
23+
24+
## Configuration
25+
26+
Create a `.env` file in the `tools/crawler` directory with the following variables:
27+
28+
```
29+
MONGODB_USER=your_mongodb_user
30+
MONGODB_PASSWORD=your_mongodb_password
31+
MONGODB_PROJECT_URL=your_project_url
32+
MONGODB_APP_NAME=your_app_name
33+
```
34+
35+
### Default Sources
36+
37+
The crawler comes with pre-configured sources in `src/constants.ts`:
38+
39+
- MongoDB Design (https://mongodb.design)
40+
- React Documentation (https://react.dev)
41+
- MDN Web Docs (https://developer.mozilla.org)
42+
43+
To add or modify sources, edit the `SOURCES` array in `src/constants.ts`.
44+
45+
## Usage
46+
47+
### Building the Tool
48+
49+
```bash
50+
yarn build
51+
```
52+
53+
### Basic Usage
54+
55+
```bash
56+
# Use the built version
57+
yarn lg-crawler
58+
59+
# Or use the development version
60+
yarn crawl
61+
```
62+
63+
### Command Line Options
64+
65+
- `-v, --verbose`: Enable verbose output
66+
- `-d, --depth <number>`: Set maximum crawl depth (default: 3)
67+
- `--url <url>`: Specify a single URL to crawl
68+
- `--dry-run`: Run crawler without inserting documents into MongoDB
69+
70+
### Examples
71+
72+
```bash
73+
# Crawl all pre-configured sources with verbose output
74+
yarn crawl --verbose
75+
76+
# Crawl a specific URL with a depth of 2
77+
yarn crawl --url https://example.com --depth 2
78+
79+
# Test crawling without saving to MongoDB
80+
yarn crawl --dry-run --verbose
81+
```
82+
83+
## Development
84+
85+
### Project Structure
86+
87+
- `src/index.ts`: Main entry point and command-line interface
88+
- `src/crawler.ts`: Core crawler implementation
89+
- `src/constants.ts`: Configuration constants and source definitions
90+
- `src/utils/`: Helper utilities for crawling and data processing
91+
92+
### Adding New Features
93+
94+
1. Make your code changes
95+
2. Build the project: `yarn build`
96+
3. Test your changes: `yarn crawl --dry-run --verbose`
97+
98+
### Running Tests
99+
100+
```bash
101+
yarn test
102+
```
103+
104+
## Troubleshooting
105+
106+
- **MongoDB Connection Issues**: Verify your `.env` file has the correct credentials
107+
- **Crawling Errors**: Use the `--verbose` flag to get detailed logs
108+
- **Rate Limiting**: Some websites may block the crawler if too many requests are made
109+
110+
## License
111+
112+
Apache-2.0

tools/crawler/bin/cli.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
#!/usr/bin/env node
2+
require('../dist/cli.js');

tools/crawler/package.json

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
{
2+
"name": "@lg-tools/crawler",
3+
"version": "0.0.2",
4+
"description": "Crawler for MongoDB documentation and other sites",
5+
"type": "module",
6+
"main": "./dist/index.js",
7+
"module": "./dist/esm/index.js",
8+
"types": "./dist/types/index.d.ts",
9+
"bin": {
10+
"lg-crawler": "./bin/cli.js"
11+
},
12+
"scripts": {
13+
"build": "lg build-package",
14+
"tsc": "lg build-ts",
15+
"postbuild": "zip -r dist/lambda.zip dist/lambda.js node_modules package.json",
16+
"crawl": "tsx src/cli.ts crawl",
17+
"prune": "tsx src/cli.ts prune",
18+
"deploy": "bash scripts/deploy.sh"
19+
},
20+
"publishConfig": {
21+
"access": "public"
22+
},
23+
"keywords": [
24+
"mongodb",
25+
"ui",
26+
"kit",
27+
"components",
28+
"react",
29+
"uikit",
30+
"leafygreen",
31+
"crawler",
32+
"ai"
33+
],
34+
"author": "",
35+
"license": "Apache-2.0",
36+
"dependencies": {
37+
"@azure/identity": "^4.9.1",
38+
"@langchain/community": "^0.3.42",
39+
"@langchain/core": "^0.3.42",
40+
"chalk": "4.1.2",
41+
"cheerio": "^1.0.0",
42+
"commander": "^13.1.0",
43+
"dotenv": "^16.5.0",
44+
"langchain": "^0.3.24",
45+
"lodash": "^4.17.21",
46+
"mongodb": "^6.16.0",
47+
"openai": "^4.97.0",
48+
"ora": "^8.2.0"
49+
},
50+
"devDependencies": {
51+
"@lg-tools/build": "workspace:^",
52+
"@lg-tools/meta": "workspace:^",
53+
"tsx": "^4.19.4"
54+
}
55+
}

tools/crawler/rollup.config.mjs

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import { esmConfig, umdConfig } from '@lg-tools/build/config/rollup.config.mjs';
2+
3+
const cli = {
4+
...umdConfig,
5+
input: ['./src/cli.ts'],
6+
};
7+
8+
const lambda = {
9+
...umdConfig,
10+
input: ['./src/lambda.ts'],
11+
external: [],
12+
};
13+
14+
export default [esmConfig, umdConfig, cli, lambda];

tools/crawler/scripts/deploy.sh

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#!/bin/bash
2+
3+
# Configuration
4+
FUNCTION_NAME="ragCrawl"
5+
ZIP_FILE="./dist/lambda.zip"
6+
7+
# Check if the zip file exists
8+
if [ ! -f "$ZIP_FILE" ]; then
9+
echo "Error: $ZIP_FILE does not exist. Please build the Lambda package first."
10+
exit 1
11+
fi
12+
13+
echo "Deploying $ZIP_FILE to Lambda function: $FUNCTION_NAME"
14+
15+
16+
aws lambda update-function-code \
17+
--function-name $FUNCTION_NAME \
18+
--zip-file fileb://$ZIP_FILE \
19+
20+
21+
if [ $? -eq 0 ]; then
22+
echo "Successfully updated Lambda function: $FUNCTION_NAME"
23+
24+
# Optional: Wait for function to be updated and then publish a new version
25+
echo "Waiting for function update to complete..."
26+
aws lambda wait function-updated --function-name $FUNCTION_NAME
27+
28+
# Print the function details
29+
echo "Getting updated function details..."
30+
aws lambda get-function \
31+
--function-name $FUNCTION_NAME \
32+
--query 'Configuration.[FunctionName,Version,LastModified]'
33+
else
34+
echo "Failed to update Lambda function"
35+
exit 1
36+
fi

tools/crawler/src/cli.ts

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
import { Command } from 'commander';
2+
3+
import { crawl } from './crawler';
4+
import { prune } from './prune';
5+
6+
const program = new Command();
7+
8+
// Initialize CLI program
9+
program
10+
.name('lg-crawler')
11+
.description(
12+
'A CLI tool for crawling and analyzing website content for LeafyGreen AI',
13+
);
14+
15+
program
16+
.command('crawl')
17+
.description('Run the crawler')
18+
.option('-v, --verbose', 'Enable verbose output', false)
19+
.option('-d, --depth <number>', 'Maximum crawl depth', '3')
20+
.option(
21+
'--url <url>',
22+
'Specific URL to crawl. If not provided, the crawler will scan all URLs defined in the config.',
23+
)
24+
.option(
25+
'--dry-run',
26+
'Run crawler without inserting documents into MongoDB',
27+
false,
28+
)
29+
.action(crawl);
30+
31+
program
32+
.command('prune')
33+
.description(
34+
'Prune old documents from MongoDB collections used by LeafyGreen Crawler',
35+
)
36+
.option('-v, --verbose', 'Enable verbose output', false)
37+
.option(
38+
'--dry-run',
39+
'Run prune without deleting documents from MongoDB',
40+
false,
41+
)
42+
.option(
43+
'-d, --days <number>',
44+
'Keep documents newer than this many days',
45+
'7',
46+
)
47+
.action(prune);
48+
49+
// Parse the command line arguments
50+
program.parse(process.argv);
51+
52+
export default program;

tools/crawler/src/constants.ts

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
import dotenv from 'dotenv';
2+
dotenv.config();
3+
4+
const {
5+
MONGODB_USER,
6+
MONGODB_PASSWORD,
7+
MONGODB_PROJECT_URL,
8+
MONGODB_APP_NAME,
9+
} = process.env;
10+
11+
export const MDB_URI = `mongodb+srv://${MONGODB_USER}:${MONGODB_PASSWORD}@${MONGODB_PROJECT_URL}/?retryWrites=true&w=majority&appName=${MONGODB_APP_NAME}`;
12+
export const MDB_DB = 'rag-sources' as const;
13+
14+
export const EMBEDDING_MODEL_NAME = 'text-embedding-3-small';
15+
16+
export const SOURCES = [
17+
{
18+
url: 'https://mongodb.design',
19+
collection: 'mongodb-dot-design',
20+
},
21+
{
22+
url: 'https://react.dev/reference/react',
23+
collection: 'react-dev',
24+
},
25+
{
26+
url: 'https://developer.mozilla.org/en-US/docs/Web',
27+
collection: 'mdn',
28+
},
29+
{
30+
url: 'https://css-tricks.com/category/articles',
31+
collection: 'css-tricks',
32+
},
33+
{
34+
url: 'https://www.nngroup.com/articles',
35+
collection: 'nn-group',
36+
},
37+
{
38+
url: 'https://www.w3.org/WAI/standards-guidelines/wcag',
39+
collection: 'wcag',
40+
},
41+
{
42+
url: 'https://atomicdesign.bradfrost.com/table-of-contents',
43+
collection: 'atomic-design',
44+
},
45+
] as const;
46+
47+
/**
48+
* Allow the crawler to follow links to these domains
49+
* (with restricted depth)
50+
*/
51+
export const allowedDomains = [
52+
'https://www.mongodb.com',
53+
'https://github.com',
54+
] as const;

0 commit comments

Comments
 (0)