You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: specs/crawler/common/schemas/action.yml
+67-9Lines changed: 67 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -3,10 +3,14 @@ Action:
3
3
description: |
4
4
How to process crawled URLs.
5
5
6
+
6
7
Each action defines:
7
8
9
+
8
10
- The targeted subset of URLs it processes.
11
+
9
12
- What information to extract from the web pages.
13
+
10
14
- The Algolia indices where the extracted records will be stored.
11
15
12
16
If a single web page matches several actions,
@@ -21,16 +25,23 @@ Action:
21
25
discoveryPatterns:
22
26
type: array
23
27
description: |
24
-
Indicates _intermediary_ pages that the crawler should visit.
28
+
Which _intermediary_ web pages the crawler should visit.
29
+
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages, _not_ their content.
30
+
25
31
26
-
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
32
+
It functions similarly to the `pathsToMatch` action but without record extraction.
33
+
34
+
35
+
`discoveryPatterns` uses [micromatch](https://github.com/micromatch/micromatch) to support matching with wildcards, negation, and other features.
36
+
The crawler adds all matching URLs to its queue.
27
37
items:
28
38
$ref: '#/urlPattern'
29
39
fileTypesToMatch:
30
40
type: array
31
41
description: |
32
42
File types for crawling non-HTML documents.
33
43
44
+
34
45
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
35
46
maxItems: 100
36
47
items:
@@ -59,6 +70,7 @@ Action:
59
70
description: |
60
71
URLs to which this action should apply.
61
72
73
+
62
74
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
63
75
minItems: 1
64
76
maxItems: 100
@@ -69,9 +81,12 @@ Action:
69
81
type: object
70
82
description: |
71
83
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
72
-
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
73
84
74
-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).
85
+
86
+
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor`.
87
+
88
+
89
+
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
75
90
properties:
76
91
__type:
77
92
$ref: '#/configurationRecordExtractorType'
@@ -110,7 +125,8 @@ ActionSchedule:
110
125
fileTypes:
111
126
type: string
112
127
description: |
113
-
Supported file type for indexing non-HTML documents.
128
+
Supported file types for indexing non-HTML documents.
129
+
114
130
115
131
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
116
132
enum:
@@ -129,6 +145,7 @@ urlPattern:
129
145
description: |
130
146
Pattern for matching URLs.
131
147
148
+
132
149
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
133
150
example: https://www.algolia.com/**
134
151
@@ -140,7 +157,43 @@ hostnameAliases:
140
157
Key-value pairs to replace matching hostnames found in a sitemap,
141
158
on a page, in canonical links, or redirects.
142
159
143
-
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).
160
+
161
+
During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.
162
+
This helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).
163
+
164
+
165
+
For example, with this `hostnameAliases` mapping:
166
+
167
+
{
168
+
hostnameAliases: {
169
+
'dev.example.com': 'example.com'
170
+
}
171
+
}
172
+
173
+
1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.
174
+
175
+
1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.
176
+
177
+
1. The crawler follows the transformed URL (not the original).
178
+
179
+
180
+
**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**
However, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,
196
+
nor does it affect the `pathsToMatch` action or other configuration elements.
144
197
additionalProperties:
145
198
type: string
146
199
description: Hostname that should be added in the records.
@@ -153,12 +206,16 @@ pathAliases:
153
206
'/foo': '/bar'
154
207
description: |
155
208
Key-value pairs to replace matching paths with new values.
209
+
156
210
157
211
It doesn't replace:
158
-
212
+
213
+
159
214
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
215
+
160
216
- Paths found in extracted text.
161
217
218
+
162
219
The crawl continues from the _transformed_ URLs.
163
220
additionalProperties:
164
221
type: object
@@ -172,9 +229,10 @@ pathAliases:
172
229
cache:
173
230
type: object
174
231
description: |
175
-
Whether the crawler should cache crawled pages.
232
+
Whether the crawler should cache crawled pages.
233
+
176
234
177
-
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
235
+
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
0 commit comments