Skip to content

Commit ad0ff0a

Browse files
gazconroyGary Conroyshortcuts
authored
fix(specs): Update Crawler spec (#4415)
Co-authored-by: Gary Conroy <[email protected]> Co-authored-by: Clément Vannicatte <[email protected]>
1 parent ab2bc48 commit ad0ff0a

File tree

9 files changed

+125
-139
lines changed

9 files changed

+125
-139
lines changed

specs/crawler/common/parameters.yml

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ TaskIdParameter:
1717
CrawlerVersionParameter:
1818
name: version
1919
in: path
20-
description: The version of the targeted Crawler revision.
20+
description: This crawler's version nmber.
2121
required: true
2222
schema:
2323
type: integer
@@ -88,7 +88,7 @@ UrlsCrawledGroup:
8888
description: Number of URLs with this status.
8989
readable:
9090
type: string
91-
description: Readable representation of the reason for the status message.
91+
description: Reason for this status.
9292
example:
9393
status: SKIPPED
9494
reason: forbidden_by_robotstxt
@@ -98,15 +98,21 @@ UrlsCrawledGroup:
9898

9999
urlsCrawledGroupStatus:
100100
type: string
101-
description: Status of crawling these URLs.
101+
description: |
102+
Crawled URL status.
103+
104+
For more information, see [Troubleshooting by crawl status](https://www.algolia.com/doc/tools/crawler/troubleshooting/crawl-status/).
102105
enum:
103106
- DONE
104107
- SKIPPED
105108
- FAILED
106109

107110
urlsCrawledGroupCategory:
108111
type: string
109-
description: Step where the status information was generated.
112+
description: |
113+
Step where the status information was generated.
114+
115+
For more information, see [Troubleshooting by crawl status](https://www.algolia.com/doc/tools/crawler/troubleshooting/crawl-status/).
110116
enum:
111117
- fetch
112118
- extraction

specs/crawler/common/schemas/action.yml

Lines changed: 40 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,37 @@
11
Action:
22
type: object
3-
description: Instructions about how to process crawled URLs.
3+
description: |
4+
How to process crawled URLs.
5+
6+
Each action defines:
7+
8+
- The targeted subset of URLs it processes.
9+
- What information to extract from the web pages.
10+
- The Algolia indices where the extracted records will be stored.
11+
12+
If a single web page matches several actions,
13+
one record is generated for each action.
414
properties:
515
autoGenerateObjectIDs:
616
type: boolean
7-
description: |
8-
Whether to generate `objectID` properties for each extracted record.
9-
10-
If false, you must manually add `objectID` properties to the extracted records.
17+
description: Whether to generate an `objectID` for records that don't have one.
1118
default: true
1219
cache:
1320
$ref: '#/cache'
1421
discoveryPatterns:
1522
type: array
1623
description: |
17-
Patterns for additional pages to visit to find links without extracting records.
24+
Indicates additional pages that the crawler should visit.
1825
19-
The crawler looks for matching pages and crawls them for links, but doesn't extract records from the (intermediate) pages themselves.
26+
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/discovery-patterns/).
2027
items:
2128
$ref: '#/urlPattern'
2229
fileTypesToMatch:
2330
type: array
2431
description: |
2532
File types for crawling non-HTML documents.
2633
27-
Non-HTML documents are first converted to HTML by an [Apache Tika](https://tika.apache.org/) server.
28-
29-
Crawling non-HTML documents has the following limitations:
30-
31-
- It's slower than crawling HTML documents.
32-
- PDFs must include the used fonts.
33-
- The produced HTML pages might not be semantic. This makes achieving good relevance more difficult.
34-
- Natural language detection isn't supported.
35-
- Extracted metadata might vary between files produced by different programs and versions.
34+
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
3635
maxItems: 100
3736
items:
3837
$ref: '#/fileTypes'
@@ -47,8 +46,8 @@ Action:
4746
type: string
4847
maxLength: 256
4948
description: |
50-
Index name where to store the extracted records from this action.
51-
The name is combined with the prefix you specified in the `indexPrefix` option.
49+
Reference to the index used to store the action's extracted records.
50+
`indexName` is combined with the prefix you specified in `indexPrefix`.
5251
example: algolia_website
5352
name:
5453
type: string
@@ -57,24 +56,29 @@ Action:
5756
$ref: '#/pathAliases'
5857
pathsToMatch:
5958
type: array
60-
description: Patterns for URLs to which this action should apply.
59+
description: |
60+
URLs to which this action should apply.
61+
62+
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
6163
minItems: 1
6264
maxItems: 100
6365
items:
6466
$ref: '#/urlPattern'
6567
recordExtractor:
6668
title: recordExtractor
6769
type: object
68-
description: Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
70+
description: |
71+
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
72+
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
73+
74+
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
6975
properties:
7076
__type:
7177
$ref: '#/configurationRecordExtractorType'
7278
source:
7379
type: string
7480
description: |
75-
JavaScript function (as a string) for extracting information from a crawled page and transforming it into Algolia records for indexing.
76-
The [Crawler dashboard](https://crawler.algolia.com/admin) has an editor with autocomplete and validation,
77-
which makes editing the `recordExtractor` property easier.
81+
A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.
7882
selectorsToMatch:
7983
type: array
8084
description: |
@@ -107,13 +111,8 @@ fileTypes:
107111
type: string
108112
description: |
109113
Supported file type for indexing non-HTML documents.
110-
A single type can match multiple file formats:
111-
112-
- `doc`: `.doc`, `.docx`
113-
- `ppt`: `.ppt`, `.pptx`
114-
- `xls`: `.xls`, `.xlsx`
115-
116-
The `email` type supports crawling Microsoft Outlook mail message (`.msg`) documents.
114+
115+
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
117116
enum:
118117
- doc
119118
- email
@@ -129,19 +128,19 @@ urlPattern:
129128
type: string
130129
description: |
131130
Pattern for matching URLs.
132-
Wildcards and negations are supported via the [micromatch](https://github.com/micromatch/micromatch) library.
131+
132+
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
133133
example: https://www.algolia.com/**
134134

135135
hostnameAliases:
136136
type: object
137137
example:
138138
'dev.example.com': 'example.com'
139139
description: |
140-
Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects.
140+
Key-value pairs to replace matching hostnames found in a sitemap,
141+
on a page, in canonical links, or redirects.
141142
142-
The crawler continues from the _transformed_ URLs.
143-
The mapping doesn't transform URLs listed in the `startUrls`, `siteMaps`, `pathsToMatch`, and other settings.
144-
The mapping also doesn't replace hostnames found in extracted text.
143+
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/hostname-aliases/).
145144
additionalProperties:
146145
type: string
147146
description: Hostname that should be added in the records.
@@ -154,10 +153,13 @@ pathAliases:
154153
'/foo': '/bar'
155154
description: |
156155
Key-value pairs to replace matching paths with new values.
156+
157+
It doesn't replace:
158+
159+
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
160+
- Paths found in extracted text.
157161
158162
The crawl continues from the _transformed_ URLs.
159-
The mapping doesn't transform URLs listed in the `startUrls`, `siteMaps`, `pathsToMatch`, and other settings.
160-
The mapping also doesn't replace paths found in extracted text.
161163
additionalProperties:
162164
type: object
163165
description: Hostname for which matching paths should be replaced.
@@ -172,17 +174,7 @@ cache:
172174
description: |
173175
Whether the crawler should cache crawled pages.
174176
175-
With caching, the crawler only crawls changed pages.
176-
To detect changed pages, the crawler makes [HTTP conditional requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests) to your pages.
177-
The crawler uses the `ETag` and `Last-Modified` response headers returned by your web server during the previous crawl.
178-
The crawler sends this information in the `If-None-Match` and `If-Modified-Since` request headers.
179-
180-
If your web server responds with `304 Not Modified` to the conditional request, the crawler reuses the records from the previous crawl.
181-
182-
Caching is ignored in these cases:
183-
184-
- If your crawler configuration changed between two crawls.
185-
- If `externalData` changed between two crawls.
177+
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/cache/).
186178
properties:
187179
enabled:
188180
type: boolean

0 commit comments

Comments
 (0)