Skip to content

Commit 231de9b

Browse files
gazconroyGary Conroy
andauthored
fix(specs): Update Crawler spec in line with doc site updates (#4508)
Co-authored-by: Gary Conroy <[email protected]>
1 parent 3340a9b commit 231de9b

File tree

3 files changed

+42
-28
lines changed

3 files changed

+42
-28
lines changed

specs/crawler/common/schemas/action.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ Action:
2121
discoveryPatterns:
2222
type: array
2323
description: |
24-
Indicates additional pages that the crawler should visit.
24+
Indicates _intermediary_ pages that the crawler should visit.
2525
26-
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/discovery-patterns/).
26+
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
2727
items:
2828
$ref: '#/urlPattern'
2929
fileTypesToMatch:
@@ -71,7 +71,7 @@ Action:
7171
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
7272
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
7373
74-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
74+
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).
7575
properties:
7676
__type:
7777
$ref: '#/configurationRecordExtractorType'
@@ -140,7 +140,7 @@ hostnameAliases:
140140
Key-value pairs to replace matching hostnames found in a sitemap,
141141
on a page, in canonical links, or redirects.
142142
143-
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/hostname-aliases/).
143+
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).
144144
additionalProperties:
145145
type: string
146146
description: Hostname that should be added in the records.
@@ -174,7 +174,7 @@ cache:
174174
description: |
175175
Whether the crawler should cache crawled pages.
176176
177-
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/cache/).
177+
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
178178
properties:
179179
enabled:
180180
type: boolean

specs/crawler/common/schemas/configuration.yml

Lines changed: 34 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Configuration:
1818
description: |
1919
Algolia API key for indexing the records.
2020
21-
For more information, see the [`apiKey` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/api-key/).
21+
For more information, see the [`apiKey` documentation](https://www.algolia.com/doc/tools/crawler/apis/apikey/).
2222
appId:
2323
$ref: '../parameters.yml#/applicationID'
2424
exclusionPatterns:
@@ -50,9 +50,8 @@ Configuration:
5050
type: array
5151
maxItems: 9999
5252
description: |
53-
URLs from where to start crawling.
54-
55-
For more information, see the [`extraUrls` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/extra-urls/).
53+
The Crawler treats `extraUrls` the same as `startUrls`.
54+
Specify `extraUrls` if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in `startUrls`.
5655
items:
5756
type: string
5857
ignoreCanonicalTo:
@@ -62,7 +61,7 @@ Configuration:
6261
description: |
6362
Whether to ignore the `nofollow` meta tag or link attribute.
6463
65-
For more information, see the [`ignoreNoFollowTo` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/ignore-no-follow-to/).
64+
For more information, see the [`ignoreNoFollowTo` documentation](https://www.algolia.com/doc/tools/crawler/apis/ignorenofollowto/).
6665
ignoreNoIndex:
6766
type: boolean
6867
description: |
@@ -97,7 +96,9 @@ Configuration:
9796
description: |
9897
Crawler index settings.
9998
100-
For more information, see the [`initialIndexSettings` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/initial-index-settings/).
99+
These index settings are only applied during the first crawl of an index.
100+
Any subsequent changes won't be applied to the index.
101+
Instead, make changes to your index settings in the [Algolia dashboard](https://dashboard.algolia.com/explorer/configuration/).
101102
additionalProperties:
102103
$ref: '../../../common/schemas/IndexSettings.yml#/indexSettings'
103104
x-additionalPropertiesName: indexName
@@ -107,7 +108,7 @@ Configuration:
107108
description: |
108109
Function for extracting URLs from links on crawled pages.
109110
110-
For more information, see the [`linkExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/link-extractor/).
111+
For more information, see the [`linkExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/linkextractor/).
111112
properties:
112113
__type:
113114
$ref: './action.yml#/configurationRecordExtractorType'
@@ -136,10 +137,12 @@ Configuration:
136137
maxUrls:
137138
type: number
138139
description: |
139-
Maximum number of crawled URLs.
140+
Limits the number of URLs your crawler processes.
141+
142+
Change it to a low value, such as 100, for quick crawling tests.
143+
Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures.
140144
141-
Setting `maxUrls` doesn't guarantee consistency between crawls
142-
because the crawler processes URLs in parallel.
145+
Because the Crawler works on many pages simultaneously, `maxUrls` doesn't guarantee finding the same pages each time it runs.
143146
minimum: 1
144147
maximum: 15000000
145148
rateLimit:
@@ -194,9 +197,12 @@ ignoreCanonicalTo:
194197
oneOf:
195198
- type: boolean
196199
description: |
197-
Whether to ignore canonical redirects.
200+
Determines if the crawler should extract records from a page with a [canonical URL](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behaviorr).
198201
199-
If true, canonical URLs for pages are ignored.
202+
If ignoreCanonicalTo is set to:
203+
204+
- `true` all canonical URLs are ignored.
205+
- One or more URL patterns, the crawler will ignore the canonical URL if it matches a pattern.
200206
- type: array
201207
description: |
202208
Canonical URLs or URL patterns to ignore.
@@ -209,9 +215,10 @@ ignoreCanonicalTo:
209215
210216
renderJavaScript:
211217
description: |
212-
Crawl JavaScript-rendered pages with a headless browser.
218+
If `true`, use a Chrome headless browser to crawl pages.
213219
214-
For more information, see the [`renderJavaScript` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/render-java-script/).
220+
Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages.
221+
Use [micromatch](https://github.com/micromatch/micromatch) to define URL patterns, including negations and wildcards.
215222
oneOf:
216223
- type: boolean
217224
description: Whether to render all pages.
@@ -220,24 +227,29 @@ renderJavaScript:
220227
items:
221228
type: string
222229
description: URL or URL pattern to render.
223-
example: https://www.example.com
230+
example:
231+
- http://www.mysite.com/dynamic-pages/**
224232
- title: headlessBrowserConfig
225233
type: object
226234
description: Configuration for rendering HTML.
227235
properties:
228236
enabled:
229237
type: boolean
230-
description: Whether to render matching URLs.
238+
description: Whether to enable JavaScript rendering.
239+
example: true
231240
patterns:
232241
type: array
233242
description: URLs or URL patterns to render.
234243
items:
235244
type: string
245+
example:
246+
- http://www.mysite.com/dynamic-pages/**
236247
adBlock:
237248
type: boolean
249+
default: false
238250
description: |
239-
Whether to turn on the built-in adblocker.
240-
This blocks most ads and tracking scripts but can break some sites.
251+
Whether to use the Crawler's ad blocker.
252+
It blocks most ads and tracking scripts but can break some sites.
241253
waitTime:
242254
$ref: '#/waitTime'
243255
required:
@@ -246,7 +258,7 @@ renderJavaScript:
246258

247259
requestOptions:
248260
type: object
249-
description: Options to add to all HTTP requests made by the crawler.
261+
description: Lets you add options to HTTP requests made by the crawler.
250262
properties:
251263
proxy:
252264
type: string
@@ -270,10 +282,12 @@ waitTime:
270282
type: number
271283
default: 0
272284
description: Minimum waiting time in milliseconds.
285+
example: 7000
273286
max:
274287
type: number
275288
default: 20000
276289
description: Maximum waiting time in milliseconds.
290+
example: 15000
277291

278292
initialIndexSettings:
279293
type: object
@@ -450,5 +464,5 @@ schedule:
450464
description: |
451465
Schedule for running the crawl.
452466
453-
For more information, see the [`schedule` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/schedule/).
467+
For more information, see the [`schedule` documentation](https://www.algolia.com/doc/tools/crawler/apis/schedule/).
454468
example: every weekday at 12:00 pm

specs/crawler/spec.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,9 @@ security:
6565
- BasicAuth: []
6666
tags:
6767
- name: actions
68-
x-displayName: Actions
68+
x-displayName: State
6969
description: |
70-
Actions change the state of crawlers, such as pausing and unpausing schedules or testing the crawler with specific URLs.
70+
Change the state of crawlers, such as pausing crawl schedules or testing the crawler with specific URLs.
7171
- name: config
7272
x-displayName: Configuration
7373
description: |
@@ -78,7 +78,7 @@ tags:
7878
It's easiest to make configuration changes on the [Crawler page](https://dashboard.algolia.com/crawler) in the Algolia dashboard.
7979
The editor has autocomplete and built-in validation so you can try your configuration changes before committing them.
8080
- name: crawlers
81-
x-displayName: Crawler
81+
x-displayName: Manage
8282
description: |
8383
A crawler is an object with a name and a [configuration](#tag/config).
8484
Use these endpoints to create, rename, and delete crawlers.

0 commit comments

Comments
 (0)