Skip to content

Commit 0577b19

Browse files
committed
Initial commit
0 parents  commit 0577b19

27 files changed

+12819
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.gcloudignore
2+
!emulator-params.env

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
## Version 0.1.0
2+
3+
Initial release of the extension.

POSTINSTALL.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Post-Installation Guide
2+
3+
After installing the extension, follow this guide to configure scraping tasks and manage extracted data. Below you'll find detailed instructions, document structures, and examples.
4+
5+
---
6+
7+
## **Setting Up a Task**
8+
Create a document in your tasks collection **`${param:SCRAPE_COLLECTION}`** to define a scraping task.
9+
10+
### **Task Document Structure**
11+
| Field | Type | Description |
12+
|-------------|------------------|-----------------------------------------------------------------------------|
13+
| `url` | string | **Required.** Target URL to scrape (e.g., `"https://example.com"`). |
14+
| `queries` | array of objects | **Required.** List of queries to extract data from the HTML content. |
15+
16+
### **1. `queries` Configuration**
17+
Each query in the `queries` array narrows down elements from the HTML. Queries execute **in sequence**, with each subsequent query applied to the results of the previous one.
18+
19+
#### **1.1. Query Object**
20+
| Field | Type | Description |
21+
|----------|--------|-----------------------------------------------------------------------------------------------|
22+
| `id` | string | **Required.** Unique identifier for the query. |
23+
| `type` | string | **Required.** Selector type. Supported values: `id`, `class`, `tag`, `attribute`, `text`, `xpath`. |
24+
| `value` | string | **Required.** Value for the selector (see examples below). |
25+
| `target` | string (optional) | What to extract from the selected elements. Supported values: `html`, `text`, `attribute`. `html` is set by default |
26+
| `attr` | string (optional) | Attribute name to extract when `target` is set to `attribute`. |
27+
28+
#### **1.2. Examples by Query Type**
29+
| Type | `value` Example | Description |
30+
|--------------|-------------------------------|--------------------------------------------------|
31+
| **`id`** | `"header"` | Select element with ID `#header`. |
32+
| **`class`** | `"menu-item"` | Select elements with class `.menu-item`. |
33+
| **`tag`** | `"a"` | Select all `<a>` tags. |
34+
| **`attribute`** | `"href"` or `"[data-role='button']"` | Select elements with the `href` attribute or matching `data-role="button"`. |
35+
| **`xpath`** | `"//div[@class='content']"` | Select elements using an XPath expression. |
36+
| **`selector`** | `"#header > h1"` | Select elements using a CSS selector. |
37+
38+
#### **1.3. Examples By Target Type**
39+
| Target | Description |
40+
|-----------------|-----------------------------------------------------------------------------------------------|
41+
| **`html`** | Extracts the HTML content of the selected elements. |
42+
| **`inner`** | Extracts the inner HTML content of the selected elements. |
43+
| **`text`** | Extracts the text content of the selected elements. |
44+
| **`attribute`** | Extracts the value of the specified attribute from the selected elements. |
45+
46+
47+
### **Example Task Document (Before Processing)**
48+
```json
49+
{
50+
"url": "https://example.com",
51+
"queries": [
52+
{
53+
"id": "title",
54+
"type": "xpath",
55+
"value": "//title",
56+
"target": "text"
57+
},
58+
{
59+
"id": "description",
60+
"type": "class",
61+
"value": "description"
62+
},
63+
{
64+
"id": "links",
65+
"type": "tag",
66+
"value": "a",
67+
"target": "attribute",
68+
"attr": "href"
69+
}
70+
]
71+
}
72+
```
73+
74+
### **Example Data Document (After Processing)**
75+
```json
76+
{
77+
"url": "https://example.com",
78+
"queries": [
79+
{
80+
"id": "title",
81+
"type": "xpath",
82+
"value": "//title",
83+
"target": "text"
84+
},
85+
{
86+
"id": "description",
87+
"type": "class",
88+
"value": "description"
89+
},
90+
{
91+
"id": "links",
92+
"type": "tag",
93+
"value": "a",
94+
"target": "attribute",
95+
"attr": "href"
96+
}
97+
],
98+
"data": {
99+
"title": "Example Domain",
100+
"description": "<p>This domain is for use in illustrative examples...</p>",
101+
"links": ["https://www.iana.org/domains/example", "https://www.iana.org/domains/reserved"]
102+
},
103+
"timestamp": "2023-01-01T00:00:00Z"
104+
}
105+
```

PREINSTALL.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Pre-Installation Guide
2+
3+
This guide will help you install and configure this extension in your Firebase project.
4+
5+
## Billing
6+
To install an extension, your project must be on the [Blaze (pay as you go) plan](https://firebase.google.com/pricing)
7+
8+
## Setup
9+
10+
### **Step 1: Install the Extension**
11+
12+
You can install this extension locally by running the following commands:
13+
14+
```bash
15+
git clone https://github.com/CorieW/firestore-web-scraper.git
16+
firebase ext:install ./firestore-web-scraper
17+
```
18+
19+
In the future, this extension could be published to the Firebase Extensions registry for easier installation.
20+
21+
### **Step 2: Configure the Extension**
22+
23+
After installing the extension, you need setup the configuration in your Firebase project. The configuration includes the following parameters:
24+
25+
| Parameter | Description |
26+
|-----------------|-------------------------|
27+
| `scrapeCollection` | The collection in which scraping tasks are stored and processed. Each document in this collection should contain the details of the task to be performed. The same document will be updated with the results of the scraping task. |
28+
29+
30+
**Example Configuration:**
31+
32+
```json
33+
{
34+
"scrapeCollection": "tasks",
35+
}
36+
```
37+
38+
## GitHub Repository
39+
40+
The source code for this extension is available on GitHub: [firestore-web-scraper](https://github.com/CorieW/firestore-web-scraper).

README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Firestore Web Scraper
2+
3+
This is a web scraper that is configured via Firestore. You can create scraping tasks in the form of documents in a Firestore collection. The scraper will then scrape the website and extract the data based on the queries you define in the task document.
4+
5+
## Usage
6+
7+
You can read PREINSTALL.md and POSTINSTALL.md for more detailed instructions on how to use this extension.
8+
9+
## Installation
10+
11+
You can install this extension locally by running the following commands:
12+
13+
```bash
14+
git clone https://github.com/CorieW/firestore-web-scraper.git
15+
firebase ext:install ./firestore-web-scraper
16+
```
17+
18+
In the future, this extension could be published to the Firebase Extensions registry for easier installation.
19+
20+
## Contributing
21+
22+
Contributions are always welcome! If you have an idea for a new feature or a bug fix, please open an issue first to discuss the changes.

_emulator/.firebaserc

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"projects": {
3+
"default": "demo-test"
4+
},
5+
"targets": {}
6+
}

_emulator/.gitignore

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Logs
2+
logs
3+
*.log
4+
npm-debug.log*
5+
yarn-debug.log*
6+
yarn-error.log*
7+
firebase-debug.log*
8+
firebase-debug.*.log*
9+
10+
# Firebase cache
11+
.firebase/
12+
13+
# Firebase config
14+
15+
# Uncomment this if you'd like others to create their own Firebase project.
16+
# For a team working on the same Firebase project(s), it is recommended to leave
17+
# it commented so all members can deploy to the same project(s) in .firebaserc.
18+
# .firebaserc
19+
20+
# Runtime data
21+
pids
22+
*.pid
23+
*.seed
24+
*.pid.lock
25+
26+
# Directory for instrumented libs generated by jscoverage/JSCover
27+
lib-cov
28+
29+
# Coverage directory used by tools like istanbul
30+
coverage
31+
32+
# nyc test coverage
33+
.nyc_output
34+
35+
# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
36+
.grunt
37+
38+
# Bower dependency directory (https://bower.io/)
39+
bower_components
40+
41+
# node-waf configuration
42+
.lock-wscript
43+
44+
# Compiled binary addons (http://nodejs.org/api/addons.html)
45+
build/Release
46+
47+
# Dependency directories
48+
node_modules/
49+
50+
# Optional npm cache directory
51+
.npm
52+
53+
# Optional eslint cache
54+
.eslintcache
55+
56+
# Optional REPL history
57+
.node_repl_history
58+
59+
# Output of 'npm pack'
60+
*.tgz
61+
62+
# Yarn Integrity file
63+
.yarn-integrity
64+
65+
# dotenv environment variables file
66+
.env
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
SCRAPE_COLLECTION=tasks

_emulator/firebase.json

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
{
2+
"extensions": {
3+
"firestore-web-scraper": "../"
4+
},
5+
"storage": {
6+
"rules": "storage.rules"
7+
},
8+
"emulators": {
9+
"hub": {
10+
"port": 4000
11+
},
12+
"storage": {
13+
"port": 9199
14+
},
15+
"auth": {
16+
"port": 9099
17+
},
18+
"pubsub": {
19+
"port": 8085
20+
},
21+
"functions": {
22+
"port": 5001
23+
},
24+
"ui": {
25+
"enabled": true
26+
},
27+
"firestore": {
28+
"host": "127.0.0.1",
29+
"port": 8080
30+
}
31+
},
32+
"functions": {
33+
"port": 5002,
34+
"source": "functions"
35+
},
36+
"firestore": {
37+
"rules": "firestore.rules",
38+
"indexes": "firestore.indexes.json"
39+
}
40+
}

_emulator/firestore.indexes.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"indexes": [],
3+
"fieldOverrides": []
4+
}

0 commit comments

Comments
 (0)