You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After installing the extension, follow this guide to configure scraping tasks and manage extracted data. Below you'll find detailed instructions, document structures, and examples.
4
4
5
+
## Setting Up a Task
6
+
Create a document in your tasks collection **`${param:SCRAPE_COLLECTION}`** to define a scraping task.
7
+
8
+
---
9
+
10
+
### Task Document Structure
11
+
12
+
#### Required Fields:
13
+
-**url** (string): Target URL to scrape (e.g., `"https://example.com"`)
14
+
-**queries** (array of objects): List of queries to extract data from the HTML content
15
+
5
16
---
6
17
7
-
## **Setting Up a Task**
8
-
Create a document in your tasks collection **`${param:SCRAPE_COLLECTION}`** to define a scraping task.
|`url`| string |**Required.** Target URL to scrape (e.g., `"https://example.com"`). |
14
-
|`queries`| array of objects |**Required.** List of queries to extract data from the HTML content. |
15
-
16
-
### **1. `queries` Configuration**
17
-
Each query in the `queries` array narrows down elements from the HTML. Queries execute **in sequence**, with each subsequent query applied to the results of the previous one.
|**`html`**| Extracts the HTML content of the selected elements. |
42
-
|**`inner`**| Extracts the inner HTML content of the selected elements. |
43
-
|**`text`**| Extracts the text content of the selected elements. |
44
-
|**`attribute`**| Extracts the value of the specified attribute from the selected elements. |
45
-
46
-
47
-
### **Example Task Document (Before Processing)**
18
+
### Query Configuration
19
+
Each query in the `queries` array narrows down specific elements from the HTML. Multiple queries can be used to extract different types of data from the same HTML.
20
+
21
+
#### Query Object Fields:
22
+
-**id** (string, required): Unique identifier for the query. Will be used as the key in the output `data` object.
-**value** (string, required): Value for the selector
31
+
-**target** (string, optional): What to extract from selected elements
32
+
-`html`: Extract HTML content (default)
33
+
-`text`: Extract text content
34
+
-`attribute`: Extract attribute value
35
+
-**attr** (string, optional): Attribute name to extract when `target` is set to `attribute`. Only allowed when `type` is `attribute`.
36
+
37
+
---
38
+
39
+
### Query Type Examples
40
+
41
+
#### ID Selector
42
+
```json
43
+
{
44
+
"id": "header",
45
+
"type": "id",
46
+
"value": "header"
47
+
}
48
+
```
49
+
Selects element with ID `#header`
50
+
51
+
#### Class Selector
52
+
```json
53
+
{
54
+
"id": "menu",
55
+
"type": "class",
56
+
"value": "menu-item"
57
+
}
58
+
```
59
+
Selects elements with class `.menu-item`
60
+
61
+
#### Tag Selector
62
+
```json
63
+
{
64
+
"id": "links",
65
+
"type": "tag",
66
+
"value": "a"
67
+
}
68
+
```
69
+
Selects all `<a>` tags
70
+
71
+
#### Attribute Selector
72
+
```json
73
+
{
74
+
"id": "buttons",
75
+
"type": "attribute",
76
+
"value": "data-role",
77
+
"target": "attribute",
78
+
"attr": "data-role"
79
+
}
80
+
```
81
+
Selects elements with matching attribute
82
+
83
+
#### CSS Selector
84
+
```json
85
+
{
86
+
"id": "content",
87
+
"type": "selector",
88
+
"value": "div.content"
89
+
}
90
+
```
91
+
Selects elements using CSS selector
92
+
93
+
---
94
+
95
+
### Target Type Examples
96
+
97
+
#### HTML Target
98
+
```json
99
+
{
100
+
"id": "content",
101
+
"type": "class",
102
+
"value": "content",
103
+
"target": "html"
104
+
}
105
+
```
106
+
Extracts the HTML content of selected elements
107
+
108
+
#### Text Target
109
+
```json
110
+
{
111
+
"id": "title",
112
+
"type": "tag",
113
+
"value": "h1",
114
+
"target": "text"
115
+
}
116
+
```
117
+
Extracts the text content of selected elements
118
+
119
+
#### Attribute Target
120
+
```json
121
+
{
122
+
"id": "links",
123
+
"type": "tag",
124
+
"value": "a",
125
+
"target": "attribute",
126
+
"attr": "href"
127
+
}
128
+
```
129
+
Extracts the value of specified attribute from selected elements
130
+
131
+
---
132
+
133
+
### Complete Example
134
+
135
+
#### Task Document (Before Processing)
48
136
```json
49
137
{
50
138
"url": "https://example.com",
51
139
"queries": [
52
140
{
53
141
"id": "title",
54
-
"type": "xpath",
55
-
"value": "//title",
56
-
"target": "text"
142
+
"type": "tag",
143
+
"value": "h1"
57
144
},
58
145
{
59
146
"id": "description",
@@ -71,16 +158,17 @@ Each query in the `queries` array narrows down elements from the HTML. Queries e
71
158
}
72
159
```
73
160
74
-
### **Example Data Document (After Processing)**
161
+
Extracts the text content of the `<h1>` tag, the text content of the element with class `description`, and the value of the `href` attribute from all `<a>` tags.
0 commit comments