You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: specs/crawler/common/schemas/action.yml
+8-66Lines changed: 8 additions & 66 deletions
Original file line number
Diff line number
Diff line change
@@ -3,14 +3,10 @@ Action:
3
3
description: |
4
4
How to process crawled URLs.
5
5
6
-
7
6
Each action defines:
8
7
9
-
10
8
- The targeted subset of URLs it processes.
11
-
12
9
- What information to extract from the web pages.
13
-
14
10
- The Algolia indices where the extracted records will be stored.
15
11
16
12
If a single web page matches several actions,
@@ -26,22 +22,19 @@ Action:
26
22
type: array
27
23
description: |
28
24
Which _intermediary_ web pages the crawler should visit.
29
-
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages, _not_ their content.
30
-
31
-
25
+
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages,
26
+
_not_ their content.
32
27
It functions similarly to the `pathsToMatch` action but without record extraction.
33
28
34
-
35
-
Uses [micromatch](https://github.com/micromatch/micromatch) to match wildcards, negation, and other features.
36
-
The crawler adds all matching URLs to its queue.
29
+
`discoveryPatterns` uses [micromatch](https://github.com/micromatch/micromatch) to support matching with wildcards,
30
+
negation, and other features.
37
31
items:
38
32
$ref: '#/urlPattern'
39
33
fileTypesToMatch:
40
34
type: array
41
35
description: |
42
36
File types for crawling non-HTML documents.
43
37
44
-
45
38
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
46
39
maxItems: 100
47
40
items:
@@ -70,7 +63,6 @@ Action:
70
63
description: |
71
64
URLs to which this action should apply.
72
65
73
-
74
66
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
75
67
minItems: 1
76
68
maxItems: 100
@@ -82,11 +74,8 @@ Action:
82
74
description: |
83
75
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
84
76
85
-
86
77
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor`.
87
-
88
-
89
-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
78
+
For details, see the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
90
79
properties:
91
80
__type:
92
81
$ref: '#/configurationRecordExtractorType'
@@ -127,7 +116,6 @@ fileTypes:
127
116
description: |
128
117
Supported file types for indexing non-HTML documents.
129
118
130
-
131
119
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
132
120
enum:
133
121
- doc
@@ -145,55 +133,14 @@ urlPattern:
145
133
description: |
146
134
Pattern for matching URLs.
147
135
148
-
149
136
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
150
137
example: https://www.algolia.com/**
151
138
152
139
hostnameAliases:
153
140
type: object
154
141
example:
155
142
'dev.example.com': 'example.com'
156
-
description: |
157
-
Key-value pairs to replace matching hostnames found in a sitemap,
158
-
on a page, in canonical links, or redirects.
159
-
160
-
161
-
During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.
162
-
This helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).
163
-
164
-
165
-
For example, with this `hostnameAliases` mapping:
166
-
167
-
{
168
-
hostnameAliases: {
169
-
'dev.example.com': 'example.com'
170
-
}
171
-
}
172
-
173
-
1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.
174
-
175
-
1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.
176
-
177
-
1. The crawler follows the transformed URL (not the original).
178
-
179
-
180
-
**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**
However, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,
196
-
nor does it affect the `pathsToMatch` action or other configuration elements.
143
+
description: "Key-value pairs to replace matching hostnames found in a sitemap,\non a page, in canonical links, or redirects.\n\n\nDuring a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.\nThis helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).\n\n\nFor example, with this `hostnameAliases` mapping:\n\n {\n hostnameAliases: {\n 'dev.example.com': 'example.com'\n }\n }\n\n1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.\n\n1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.\n\n1. The crawler follows the transformed URL (not the original).\n\n\n**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**\n\n\nThe crawler can discover URLs in places such as:\n\n\n- Crawled pages\n\n- Sitemaps\n\n- [Canonical URLs](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behavior)\n\n- Redirects. \n\n\nHowever, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,\nnor does it affect the `pathsToMatch` action or other configuration elements.\n"
197
144
additionalProperties:
198
145
type: string
199
146
description: Hostname that should be added in the records.
@@ -207,15 +154,11 @@ pathAliases:
207
154
description: |
208
155
Key-value pairs to replace matching paths with new values.
209
156
210
-
211
157
It doesn't replace:
212
158
213
-
214
159
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
215
-
216
160
- Paths found in extracted text.
217
161
218
-
219
162
The crawl continues from the _transformed_ URLs.
220
163
221
164
@@ -237,10 +180,9 @@ pathAliases:
237
180
cache:
238
181
type: object
239
182
description: |
240
-
Whether the crawler should cache crawled pages.
241
-
183
+
Whether the crawler should cache crawled pages.
242
184
243
-
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
185
+
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
0 commit comments