You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: specs/crawler/common/schemas/action.yml
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -21,9 +21,9 @@ Action:
21
21
discoveryPatterns:
22
22
type: array
23
23
description: |
24
-
Indicates additional pages that the crawler should visit.
24
+
Indicates _intermediary_ pages that the crawler should visit.
25
25
26
-
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/discovery-patterns/).
26
+
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
27
27
items:
28
28
$ref: '#/urlPattern'
29
29
fileTypesToMatch:
@@ -71,7 +71,7 @@ Action:
71
71
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
72
72
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
73
73
74
-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
74
+
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).
75
75
properties:
76
76
__type:
77
77
$ref: '#/configurationRecordExtractorType'
@@ -140,7 +140,7 @@ hostnameAliases:
140
140
Key-value pairs to replace matching hostnames found in a sitemap,
141
141
on a page, in canonical links, or redirects.
142
142
143
-
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/hostname-aliases/).
143
+
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).
144
144
additionalProperties:
145
145
type: string
146
146
description: Hostname that should be added in the records.
@@ -174,7 +174,7 @@ cache:
174
174
description: |
175
175
Whether the crawler should cache crawled pages.
176
176
177
-
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/cache/).
177
+
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
Copy file name to clipboardExpand all lines: specs/crawler/common/schemas/configuration.yml
+34-20Lines changed: 34 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ Configuration:
18
18
description: |
19
19
Algolia API key for indexing the records.
20
20
21
-
For more information, see the [`apiKey` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/api-key/).
21
+
For more information, see the [`apiKey` documentation](https://www.algolia.com/doc/tools/crawler/apis/apikey/).
22
22
appId:
23
23
$ref: '../parameters.yml#/applicationID'
24
24
exclusionPatterns:
@@ -50,9 +50,8 @@ Configuration:
50
50
type: array
51
51
maxItems: 9999
52
52
description: |
53
-
URLs from where to start crawling.
54
-
55
-
For more information, see the [`extraUrls` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/extra-urls/).
53
+
The Crawler treats `extraUrls` the same as `startUrls`.
54
+
Specify `extraUrls` if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in `startUrls`.
56
55
items:
57
56
type: string
58
57
ignoreCanonicalTo:
@@ -62,7 +61,7 @@ Configuration:
62
61
description: |
63
62
Whether to ignore the `nofollow` meta tag or link attribute.
64
63
65
-
For more information, see the [`ignoreNoFollowTo` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/ignore-no-follow-to/).
64
+
For more information, see the [`ignoreNoFollowTo` documentation](https://www.algolia.com/doc/tools/crawler/apis/ignorenofollowto/).
66
65
ignoreNoIndex:
67
66
type: boolean
68
67
description: |
@@ -97,7 +96,9 @@ Configuration:
97
96
description: |
98
97
Crawler index settings.
99
98
100
-
For more information, see the [`initialIndexSettings` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/initial-index-settings/).
99
+
These index settings are only applied during the first crawl of an index.
100
+
Any subsequent changes won't be applied to the index.
101
+
Instead, make changes to your index settings in the [Algolia dashboard](https://dashboard.algolia.com/explorer/configuration/).
Change it to a low value, such as 100, for quick crawling tests.
143
+
Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures.
140
144
141
-
Setting `maxUrls` doesn't guarantee consistency between crawls
142
-
because the crawler processes URLs in parallel.
145
+
Because the Crawler works on many pages simultaneously, `maxUrls` doesn't guarantee finding the same pages each time it runs.
143
146
minimum: 1
144
147
maximum: 15000000
145
148
rateLimit:
@@ -194,9 +197,12 @@ ignoreCanonicalTo:
194
197
oneOf:
195
198
- type: boolean
196
199
description: |
197
-
Whether to ignore canonical redirects.
200
+
Determines if the crawler should extract records from a page with a [canonical URL](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behaviorr).
198
201
199
-
If true, canonical URLs for pages are ignored.
202
+
If ignoreCanonicalTo is set to:
203
+
204
+
- `true` all canonical URLs are ignored.
205
+
- One or more URL patterns, the crawler will ignore the canonical URL if it matches a pattern.
200
206
- type: array
201
207
description: |
202
208
Canonical URLs or URL patterns to ignore.
@@ -209,9 +215,10 @@ ignoreCanonicalTo:
209
215
210
216
renderJavaScript:
211
217
description: |
212
-
Crawl JavaScript-rendered pages with a headless browser.
218
+
If `true`, use a Chrome headless browser to crawl pages.
213
219
214
-
For more information, see the [`renderJavaScript` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/render-java-script/).
220
+
Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages.
221
+
Use [micromatch](https://github.com/micromatch/micromatch) to define URL patterns, including negations and wildcards.
215
222
oneOf:
216
223
- type: boolean
217
224
description: Whether to render all pages.
@@ -220,24 +227,29 @@ renderJavaScript:
220
227
items:
221
228
type: string
222
229
description: URL or URL pattern to render.
223
-
example: https://www.example.com
230
+
example:
231
+
- http://www.mysite.com/dynamic-pages/**
224
232
- title: headlessBrowserConfig
225
233
type: object
226
234
description: Configuration for rendering HTML.
227
235
properties:
228
236
enabled:
229
237
type: boolean
230
-
description: Whether to render matching URLs.
238
+
description: Whether to enable JavaScript rendering.
239
+
example: true
231
240
patterns:
232
241
type: array
233
242
description: URLs or URL patterns to render.
234
243
items:
235
244
type: string
245
+
example:
246
+
- http://www.mysite.com/dynamic-pages/**
236
247
adBlock:
237
248
type: boolean
249
+
default: false
238
250
description: |
239
-
Whether to turn on the built-in adblocker.
240
-
This blocks most ads and tracking scripts but can break some sites.
251
+
Whether to use the Crawler's ad blocker.
252
+
It blocks most ads and tracking scripts but can break some sites.
241
253
waitTime:
242
254
$ref: '#/waitTime'
243
255
required:
@@ -246,7 +258,7 @@ renderJavaScript:
246
258
247
259
requestOptions:
248
260
type: object
249
-
description: Options to add to all HTTP requests made by the crawler.
261
+
description: Lets you add options to HTTP requests made by the crawler.
250
262
properties:
251
263
proxy:
252
264
type: string
@@ -270,10 +282,12 @@ waitTime:
270
282
type: number
271
283
default: 0
272
284
description: Minimum waiting time in milliseconds.
285
+
example: 7000
273
286
max:
274
287
type: number
275
288
default: 20000
276
289
description: Maximum waiting time in milliseconds.
290
+
example: 15000
277
291
278
292
initialIndexSettings:
279
293
type: object
@@ -450,5 +464,5 @@ schedule:
450
464
description: |
451
465
Schedule for running the crawl.
452
466
453
-
For more information, see the [`schedule` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/schedule/).
467
+
For more information, see the [`schedule` documentation](https://www.algolia.com/doc/tools/crawler/apis/schedule/).
0 commit comments