Skip to content

Commit ece20cd

Browse files
committed
Upgrade to v2, and a bunch of improvements (#4)
* feat: upgraded to v2 functions * feat: allow non-(default) firestore databases * feat: added location parameter * feat: improved validation * fix: removed deprecated xmldom dependency, replaced with @xmldom/xmldom * chore: audit dependencies * chore: removed unnecessary packages * chore: updated POSTINSTALL.md * refactor: improved code
1 parent 0577b19 commit ece20cd

22 files changed

+1407
-1649
lines changed

.gitignore

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,17 @@
11
.gcloudignore
2-
!emulator-params.env
2+
!emulator-params.env
3+
4+
# Any files that are private
5+
project.private/
6+
7+
# Environment variables
8+
*.env
9+
*.env.development.local
10+
*.env.test.local
11+
*.env.production.local
12+
*.env.development
13+
*.env.test
14+
*.env.production
15+
16+
# Required for emulator
17+
_emulator/extensions/firestore-web-scraper.env.local

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,14 @@
1+
## Version 0.2.0
2+
3+
- feat: upgraded to v2 functions
4+
- feat: allow non-(default) firestore databases
5+
- feat: added location parameter
6+
- feat: improved validation
7+
- fix: removed deprecated `xmldom` dependency, replaced with `@xmldom/xmldom`
8+
- chore: audit dependencies
9+
- chore: removed unnecessary packages
10+
- chore: updated **POSTINSTALL.md**
11+
112
## Version 0.1.0
213

314
Initial release of the extension.

POSTINSTALL.md

Lines changed: 136 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -2,58 +2,145 @@
22

33
After installing the extension, follow this guide to configure scraping tasks and manage extracted data. Below you'll find detailed instructions, document structures, and examples.
44

5+
## Setting Up a Task
6+
Create a document in your tasks collection **`${param:SCRAPE_COLLECTION}`** to define a scraping task.
7+
8+
---
9+
10+
### Task Document Structure
11+
12+
#### Required Fields:
13+
- **url** (string): Target URL to scrape (e.g., `"https://example.com"`)
14+
- **queries** (array of objects): List of queries to extract data from the HTML content
15+
516
---
617

7-
## **Setting Up a Task**
8-
Create a document in your tasks collection **`${param:SCRAPE_COLLECTION}`** to define a scraping task.
9-
10-
### **Task Document Structure**
11-
| Field | Type | Description |
12-
|-------------|------------------|-----------------------------------------------------------------------------|
13-
| `url` | string | **Required.** Target URL to scrape (e.g., `"https://example.com"`). |
14-
| `queries` | array of objects | **Required.** List of queries to extract data from the HTML content. |
15-
16-
### **1. `queries` Configuration**
17-
Each query in the `queries` array narrows down elements from the HTML. Queries execute **in sequence**, with each subsequent query applied to the results of the previous one.
18-
19-
#### **1.1. Query Object**
20-
| Field | Type | Description |
21-
|----------|--------|-----------------------------------------------------------------------------------------------|
22-
| `id` | string | **Required.** Unique identifier for the query. |
23-
| `type` | string | **Required.** Selector type. Supported values: `id`, `class`, `tag`, `attribute`, `text`, `xpath`. |
24-
| `value` | string | **Required.** Value for the selector (see examples below). |
25-
| `target` | string (optional) | What to extract from the selected elements. Supported values: `html`, `text`, `attribute`. `html` is set by default |
26-
| `attr` | string (optional) | Attribute name to extract when `target` is set to `attribute`. |
27-
28-
#### **1.2. Examples by Query Type**
29-
| Type | `value` Example | Description |
30-
|--------------|-------------------------------|--------------------------------------------------|
31-
| **`id`** | `"header"` | Select element with ID `#header`. |
32-
| **`class`** | `"menu-item"` | Select elements with class `.menu-item`. |
33-
| **`tag`** | `"a"` | Select all `<a>` tags. |
34-
| **`attribute`** | `"href"` or `"[data-role='button']"` | Select elements with the `href` attribute or matching `data-role="button"`. |
35-
| **`xpath`** | `"//div[@class='content']"` | Select elements using an XPath expression. |
36-
| **`selector`** | `"#header > h1"` | Select elements using a CSS selector. |
37-
38-
#### **1.3. Examples By Target Type**
39-
| Target | Description |
40-
|-----------------|-----------------------------------------------------------------------------------------------|
41-
| **`html`** | Extracts the HTML content of the selected elements. |
42-
| **`inner`** | Extracts the inner HTML content of the selected elements. |
43-
| **`text`** | Extracts the text content of the selected elements. |
44-
| **`attribute`** | Extracts the value of the specified attribute from the selected elements. |
45-
46-
47-
### **Example Task Document (Before Processing)**
18+
### Query Configuration
19+
Each query in the `queries` array narrows down specific elements from the HTML. Multiple queries can be used to extract different types of data from the same HTML.
20+
21+
#### Query Object Fields:
22+
- **id** (string, required): Unique identifier for the query. Will be used as the key in the output `data` object.
23+
- **type** (string, required): Selector type. Supported values:
24+
- `id`: Select by element ID
25+
- `class`: Select by CSS class
26+
- `tag`: Select by HTML tag
27+
- `attribute`: Select by attribute
28+
- `text`: Select by text content
29+
- `selector`: Select using CSS selector
30+
- **value** (string, required): Value for the selector
31+
- **target** (string, optional): What to extract from selected elements
32+
- `html`: Extract HTML content (default)
33+
- `text`: Extract text content
34+
- `attribute`: Extract attribute value
35+
- **attr** (string, optional): Attribute name to extract when `target` is set to `attribute`. Only allowed when `type` is `attribute`.
36+
37+
---
38+
39+
### Query Type Examples
40+
41+
#### ID Selector
42+
```json
43+
{
44+
"id": "header",
45+
"type": "id",
46+
"value": "header"
47+
}
48+
```
49+
Selects element with ID `#header`
50+
51+
#### Class Selector
52+
```json
53+
{
54+
"id": "menu",
55+
"type": "class",
56+
"value": "menu-item"
57+
}
58+
```
59+
Selects elements with class `.menu-item`
60+
61+
#### Tag Selector
62+
```json
63+
{
64+
"id": "links",
65+
"type": "tag",
66+
"value": "a"
67+
}
68+
```
69+
Selects all `<a>` tags
70+
71+
#### Attribute Selector
72+
```json
73+
{
74+
"id": "buttons",
75+
"type": "attribute",
76+
"value": "data-role",
77+
"target": "attribute",
78+
"attr": "data-role"
79+
}
80+
```
81+
Selects elements with matching attribute
82+
83+
#### CSS Selector
84+
```json
85+
{
86+
"id": "content",
87+
"type": "selector",
88+
"value": "div.content"
89+
}
90+
```
91+
Selects elements using CSS selector
92+
93+
---
94+
95+
### Target Type Examples
96+
97+
#### HTML Target
98+
```json
99+
{
100+
"id": "content",
101+
"type": "class",
102+
"value": "content",
103+
"target": "html"
104+
}
105+
```
106+
Extracts the HTML content of selected elements
107+
108+
#### Text Target
109+
```json
110+
{
111+
"id": "title",
112+
"type": "tag",
113+
"value": "h1",
114+
"target": "text"
115+
}
116+
```
117+
Extracts the text content of selected elements
118+
119+
#### Attribute Target
120+
```json
121+
{
122+
"id": "links",
123+
"type": "tag",
124+
"value": "a",
125+
"target": "attribute",
126+
"attr": "href"
127+
}
128+
```
129+
Extracts the value of specified attribute from selected elements
130+
131+
---
132+
133+
### Complete Example
134+
135+
#### Task Document (Before Processing)
48136
```json
49137
{
50138
"url": "https://example.com",
51139
"queries": [
52140
{
53141
"id": "title",
54-
"type": "xpath",
55-
"value": "//title",
56-
"target": "text"
142+
"type": "tag",
143+
"value": "h1"
57144
},
58145
{
59146
"id": "description",
@@ -71,16 +158,17 @@ Each query in the `queries` array narrows down elements from the HTML. Queries e
71158
}
72159
```
73160

74-
### **Example Data Document (After Processing)**
161+
Extracts the text content of the `<h1>` tag, the text content of the element with class `description`, and the value of the `href` attribute from all `<a>` tags.
162+
163+
#### Result Document (After Processing)
75164
```json
76165
{
77166
"url": "https://example.com",
78167
"queries": [
79168
{
80169
"id": "title",
81-
"type": "xpath",
82-
"value": "//title",
83-
"target": "text"
170+
"type": "tag",
171+
"value": "h1"
84172
},
85173
{
86174
"id": "description",

_emulator/extensions/firestore-send-email.env.local

Lines changed: 0 additions & 1 deletion
This file was deleted.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
LOCATION=us
2+
DATABASE=(default)
3+
SCRAPE_COLLECTION=tasks

_emulator/firebase.json

Lines changed: 0 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,39 +2,16 @@
22
"extensions": {
33
"firestore-web-scraper": "../"
44
},
5-
"storage": {
6-
"rules": "storage.rules"
7-
},
85
"emulators": {
9-
"hub": {
10-
"port": 4000
11-
},
126
"storage": {
137
"port": 9199
148
},
15-
"auth": {
16-
"port": 9099
17-
},
18-
"pubsub": {
19-
"port": 8085
20-
},
21-
"functions": {
22-
"port": 5001
23-
},
249
"ui": {
2510
"enabled": true
2611
},
2712
"firestore": {
2813
"host": "127.0.0.1",
2914
"port": 8080
3015
}
31-
},
32-
"functions": {
33-
"port": 5002,
34-
"source": "functions"
35-
},
36-
"firestore": {
37-
"rules": "firestore.rules",
38-
"indexes": "firestore.indexes.json"
3916
}
4017
}

_emulator/firestore.rules

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,6 @@ rules_version = '2';
22
service cloud.firestore {
33
match /databases/{database}/documents {
44
match /{document=**} {
5-
// This rule allows anyone with your database reference to view, edit,
6-
// and delete all data in your database. It is useful for getting
7-
// started, but it is configured to expire after 30 days because it
8-
// leaves your app open to attackers. At that time, all client
9-
// requests to your database will be denied.
10-
//
11-
// Make sure to write security rules for your app before that time, or
12-
// else all client requests to your database will be denied until you
13-
// update your rules.
145
allow read, write;
156
}
167
}

extension.yaml

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
name: firestore-web-scraper
2-
version: 0.1.0
2+
version: 0.2.0
33
specVersion: v1beta
44

55
displayName: Web Scrape with Firestore
@@ -23,17 +23,39 @@ roles:
2323

2424
resources:
2525
- name: processQueue
26-
type: firebaseextensions.v1beta.function
26+
type: firebaseextensions.v1beta.v2function
2727
description:
2828
Processes document changes in the specified Cloud Firestore collection,
2929
creating and performing web scraping tasks.
3030
properties:
31-
runtime: nodejs20
31+
sourceDirectory: functions
32+
buildConfig:
33+
runtime: nodejs22
3234
eventTrigger:
33-
eventType: providers/cloud.firestore/eventTypes/document.create
34-
resource: projects/${param:PROJECT_ID}/databases/(default)/documents/${param:SCRAPE_COLLECTION}/{id}
35+
eventType: google.cloud.firestore.document.v1.created
36+
triggerRegion: ${param:LOCATION}
37+
eventFilters:
38+
- attribute: database
39+
value: ${param:DATABASE}
40+
- attribute: document
41+
value: ${param:SCRAPE_COLLECTION}/{documentId}
42+
operator: match-path-pattern
3543

3644
params:
45+
- param: LOCATION
46+
label: Location
47+
description: The location of the Cloud Firestore database.
48+
type: string
49+
default: us-central1
50+
required: true
51+
52+
- param: DATABASE
53+
label: Database
54+
description: The Firestore database to use. "(default)" is used for the default database.
55+
type: string
56+
default: (default)
57+
required: true
58+
3759
- param: SCRAPE_COLLECTION
3860
label: Scrape documents collection
3961
description: >-

0 commit comments

Comments
 (0)