Skip to content

Commit 409e0ec

Browse files
algolia-botgazconroyGary Conroy
committed
fix(specs): Update Crawler spec in line with doc site updates (#4508) (generated) [skip ci]
Co-authored-by: gazconroy <[email protected]> Co-authored-by: Gary Conroy <[email protected]>
1 parent 231de9b commit 409e0ec

File tree

36 files changed

+232
-168
lines changed

36 files changed

+232
-168
lines changed

docs/bundled/crawler.yml

Lines changed: 66 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -95,10 +95,10 @@ security:
9595
- BasicAuth: []
9696
tags:
9797
- name: actions
98-
x-displayName: Actions
98+
x-displayName: State
9999
description: >
100-
Actions change the state of crawlers, such as pausing and unpausing
101-
schedules or testing the crawler with specific URLs.
100+
Change the state of crawlers, such as pausing crawl schedules or testing
101+
the crawler with specific URLs.
102102
- name: config
103103
x-displayName: Configuration
104104
description: >
@@ -117,7 +117,7 @@ tags:
117117
The editor has autocomplete and built-in validation so you can try your
118118
configuration changes before committing them.
119119
- name: crawlers
120-
x-displayName: Crawler
120+
x-displayName: Manage
121121
description: |
122122
A crawler is an object with a name and a [configuration](#tag/config).
123123
Use these endpoints to create, rename, and delete crawlers.
@@ -817,7 +817,7 @@ components:
817817
818818
819819
For more information, see the [`cache`
820-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/cache/).
820+
documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
821821
properties:
822822
enabled:
823823
type: boolean
@@ -861,7 +861,7 @@ components:
861861
862862
863863
For more information, see the [`hostnameAliases`
864-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/hostname-aliases/).
864+
documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).
865865
additionalProperties:
866866
type: string
867867
description: Hostname that should be added in the records.
@@ -919,11 +919,11 @@ components:
919919
discoveryPatterns:
920920
type: array
921921
description: >
922-
Indicates additional pages that the crawler should visit.
922+
Indicates _intermediary_ pages that the crawler should visit.
923923
924924
925925
For more information, see the [`discoveryPatterns`
926-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/discovery-patterns/).
926+
documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
927927
items:
928928
$ref: '#/components/schemas/urlPattern'
929929
fileTypesToMatch:
@@ -986,7 +986,7 @@ components:
986986
987987
988988
For details, consult the [`recordExtractor`
989-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
989+
documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).
990990
properties:
991991
__type:
992992
$ref: '#/components/schemas/configurationRecordExtractorType'
@@ -1017,10 +1017,19 @@ components:
10171017
ignoreCanonicalTo:
10181018
oneOf:
10191019
- type: boolean
1020-
description: |
1021-
Whether to ignore canonical redirects.
1020+
description: >
1021+
Determines if the crawler should extract records from a page with a
1022+
[canonical
1023+
URL](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behaviorr).
1024+
1025+
1026+
If ignoreCanonicalTo is set to:
1027+
10221028
1023-
If true, canonical URLs for pages are ignored.
1029+
- `true` all canonical URLs are ignored.
1030+
1031+
- One or more URL patterns, the crawler will ignore the canonical
1032+
URL if it matches a pattern.
10241033
- type: array
10251034
description: |
10261035
Canonical URLs or URL patterns to ignore.
@@ -2702,10 +2711,12 @@ components:
27022711
type: number
27032712
default: 0
27042713
description: Minimum waiting time in milliseconds.
2714+
example: 7000
27052715
max:
27062716
type: number
27072717
default: 20000
27082718
description: Maximum waiting time in milliseconds.
2719+
example: 15000
27092720
browserRequest:
27102721
type: object
27112722
description: |
@@ -2807,11 +2818,15 @@ components:
28072818
- $ref: '#/components/schemas/oauthRequest'
28082819
renderJavaScript:
28092820
description: >
2810-
Crawl JavaScript-rendered pages with a headless browser.
2821+
If `true`, use a Chrome headless browser to crawl pages.
2822+
28112823
2824+
Because crawling JavaScript-based web pages is slower than crawling
2825+
regular HTML pages, you can apply this setting to a specific list of
2826+
pages.
28122827
2813-
For more information, see the [`renderJavaScript`
2814-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/render-java-script/).
2828+
Use [micromatch](https://github.com/micromatch/micromatch) to define URL
2829+
patterns, including negations and wildcards.
28152830
oneOf:
28162831
- type: boolean
28172832
description: Whether to render all pages.
@@ -2820,25 +2835,30 @@ components:
28202835
items:
28212836
type: string
28222837
description: URL or URL pattern to render.
2823-
example: https://www.example.com
2838+
example:
2839+
- http://www.mysite.com/dynamic-pages/**
28242840
- title: headlessBrowserConfig
28252841
type: object
28262842
description: Configuration for rendering HTML.
28272843
properties:
28282844
enabled:
28292845
type: boolean
2830-
description: Whether to render matching URLs.
2846+
description: Whether to enable JavaScript rendering.
2847+
example: true
28312848
patterns:
28322849
type: array
28332850
description: URLs or URL patterns to render.
28342851
items:
28352852
type: string
2853+
example:
2854+
- http://www.mysite.com/dynamic-pages/**
28362855
adBlock:
28372856
type: boolean
2857+
default: false
28382858
description: >
2839-
Whether to turn on the built-in adblocker.
2859+
Whether to use the Crawler's ad blocker.
28402860
2841-
This blocks most ads and tracking scripts but can break some
2861+
It blocks most ads and tracking scripts but can break some
28422862
sites.
28432863
waitTime:
28442864
$ref: '#/components/schemas/waitTime'
@@ -2847,7 +2867,7 @@ components:
28472867
- patterns
28482868
requestOptions:
28492869
type: object
2850-
description: Options to add to all HTTP requests made by the crawler.
2870+
description: Lets you add options to HTTP requests made by the crawler.
28512871
properties:
28522872
proxy:
28532873
type: string
@@ -2898,7 +2918,7 @@ components:
28982918
28992919
29002920
For more information, see the [`schedule`
2901-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/schedule/).
2921+
documentation](https://www.algolia.com/doc/tools/crawler/apis/schedule/).
29022922
example: every weekday at 12:00 pm
29032923
Configuration:
29042924
type: object
@@ -2922,7 +2942,7 @@ components:
29222942
29232943
29242944
For more information, see the [`apiKey`
2925-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/api-key/).
2945+
documentation](https://www.algolia.com/doc/tools/crawler/apis/apikey/).
29262946
appId:
29272947
$ref: '#/components/schemas/applicationID'
29282948
exclusionPatterns:
@@ -2961,11 +2981,11 @@ components:
29612981
type: array
29622982
maxItems: 9999
29632983
description: >
2964-
URLs from where to start crawling.
2965-
2984+
The Crawler treats `extraUrls` the same as `startUrls`.
29662985
2967-
For more information, see the [`extraUrls`
2968-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/extra-urls/).
2986+
Specify `extraUrls` if you want to differentiate between URLs you
2987+
manually added to fix site crawling from those you initially
2988+
specified in `startUrls`.
29692989
items:
29702990
type: string
29712991
ignoreCanonicalTo:
@@ -2977,7 +2997,7 @@ components:
29772997
29782998
29792999
For more information, see the [`ignoreNoFollowTo`
2980-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/ignore-no-follow-to/).
3000+
documentation](https://www.algolia.com/doc/tools/crawler/apis/ignorenofollowto/).
29813001
ignoreNoIndex:
29823002
type: boolean
29833003
description: |
@@ -3022,8 +3042,13 @@ components:
30223042
Crawler index settings.
30233043
30243044
3025-
For more information, see the [`initialIndexSettings`
3026-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/initial-index-settings/).
3045+
These index settings are only applied during the first crawl of an
3046+
index.
3047+
3048+
Any subsequent changes won't be applied to the index.
3049+
3050+
Instead, make changes to your index settings in the [Algolia
3051+
dashboard](https://dashboard.algolia.com/explorer/configuration/).
30273052
additionalProperties:
30283053
$ref: '#/components/schemas/indexSettings'
30293054
x-additionalPropertiesName: indexName
@@ -3035,7 +3060,7 @@ components:
30353060
30363061
30373062
For more information, see the [`linkExtractor`
3038-
documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/link-extractor/).
3063+
documentation](https://www.algolia.com/doc/tools/crawler/apis/linkextractor/).
30393064
properties:
30403065
__type:
30413066
$ref: '#/components/schemas/configurationRecordExtractorType'
@@ -3067,11 +3092,18 @@ components:
30673092
maximum: 100
30683093
maxUrls:
30693094
type: number
3070-
description: |
3071-
Maximum number of crawled URLs.
3095+
description: >
3096+
Limits the number of URLs your crawler processes.
3097+
3098+
3099+
Change it to a low value, such as 100, for quick crawling tests.
3100+
3101+
Change it to a higher explicit value for full crawls to prevent it
3102+
from getting "lost" in complex site structures.
3103+
30723104
3073-
Setting `maxUrls` doesn't guarantee consistency between crawls
3074-
because the crawler processes URLs in parallel.
3105+
Because the Crawler works on many pages simultaneously, `maxUrls`
3106+
doesn't guarantee finding the same pages each time it runs.
30753107
minimum: 1
30763108
maximum: 15000000
30773109
rateLimit:

0 commit comments

Comments
 (0)