You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: specs/crawler/common/schemas/action.yml
+40-48Lines changed: 40 additions & 48 deletions
Original file line number
Diff line number
Diff line change
@@ -1,38 +1,37 @@
1
1
Action:
2
2
type: object
3
-
description: Instructions about how to process crawled URLs.
3
+
description: |
4
+
How to process crawled URLs.
5
+
6
+
Each action defines:
7
+
8
+
- The targeted subset of URLs it processes.
9
+
- What information to extract from the web pages.
10
+
- The Algolia indices where the extracted records will be stored.
11
+
12
+
If a single web page matches several actions,
13
+
one record is generated for each action.
4
14
properties:
5
15
autoGenerateObjectIDs:
6
16
type: boolean
7
-
description: |
8
-
Whether to generate `objectID` properties for each extracted record.
9
-
10
-
If false, you must manually add `objectID` properties to the extracted records.
17
+
description: Whether to generate an `objectID` for records that don't have one.
11
18
default: true
12
19
cache:
13
20
$ref: '#/cache'
14
21
discoveryPatterns:
15
22
type: array
16
23
description: |
17
-
Patterns for additional pages to visit to find links without extracting records.
24
+
Indicates additional pages that the crawler should visit.
18
25
19
-
The crawler looks for matching pages and crawls them for links, but doesn't extract records from the (intermediate) pages themselves.
26
+
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/discovery-patterns/).
20
27
items:
21
28
$ref: '#/urlPattern'
22
29
fileTypesToMatch:
23
30
type: array
24
31
description: |
25
32
File types for crawling non-HTML documents.
26
33
27
-
Non-HTML documents are first converted to HTML by an [Apache Tika](https://tika.apache.org/) server.
28
-
29
-
Crawling non-HTML documents has the following limitations:
30
-
31
-
- It's slower than crawling HTML documents.
32
-
- PDFs must include the used fonts.
33
-
- The produced HTML pages might not be semantic. This makes achieving good relevance more difficult.
34
-
- Natural language detection isn't supported.
35
-
- Extracted metadata might vary between files produced by different programs and versions.
34
+
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
36
35
maxItems: 100
37
36
items:
38
37
$ref: '#/fileTypes'
@@ -47,8 +46,8 @@ Action:
47
46
type: string
48
47
maxLength: 256
49
48
description: |
50
-
Index name where to store the extracted records from this action.
51
-
The name is combined with the prefix you specified in the `indexPrefix` option.
49
+
Reference to the index used to store the action's extracted records.
50
+
`indexName` is combined with the prefix you specified in `indexPrefix`.
52
51
example: algolia_website
53
52
name:
54
53
type: string
@@ -57,24 +56,29 @@ Action:
57
56
$ref: '#/pathAliases'
58
57
pathsToMatch:
59
58
type: array
60
-
description: Patterns for URLs to which this action should apply.
59
+
description: |
60
+
URLs to which this action should apply.
61
+
62
+
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
61
63
minItems: 1
62
64
maxItems: 100
63
65
items:
64
66
$ref: '#/urlPattern'
65
67
recordExtractor:
66
68
title: recordExtractor
67
69
type: object
68
-
description: Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
70
+
description: |
71
+
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
72
+
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
73
+
74
+
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
69
75
properties:
70
76
__type:
71
77
$ref: '#/configurationRecordExtractorType'
72
78
source:
73
79
type: string
74
80
description: |
75
-
JavaScript function (as a string) for extracting information from a crawled page and transforming it into Algolia records for indexing.
76
-
The [Crawler dashboard](https://crawler.algolia.com/admin) has an editor with autocomplete and validation,
77
-
which makes editing the `recordExtractor` property easier.
81
+
A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.
78
82
selectorsToMatch:
79
83
type: array
80
84
description: |
@@ -107,13 +111,8 @@ fileTypes:
107
111
type: string
108
112
description: |
109
113
Supported file type for indexing non-HTML documents.
110
-
A single type can match multiple file formats:
111
-
112
-
- `doc`: `.doc`, `.docx`
113
-
- `ppt`: `.ppt`, `.pptx`
114
-
- `xls`: `.xls`, `.xlsx`
115
-
116
-
The `email` type supports crawling Microsoft Outlook mail message (`.msg`) documents.
114
+
115
+
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
117
116
enum:
118
117
- doc
119
118
- email
@@ -129,19 +128,19 @@ urlPattern:
129
128
type: string
130
129
description: |
131
130
Pattern for matching URLs.
132
-
Wildcards and negations are supported via the [micromatch](https://github.com/micromatch/micromatch) library.
131
+
132
+
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
133
133
example: https://www.algolia.com/**
134
134
135
135
hostnameAliases:
136
136
type: object
137
137
example:
138
138
'dev.example.com': 'example.com'
139
139
description: |
140
-
Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects.
140
+
Key-value pairs to replace matching hostnames found in a sitemap,
141
+
on a page, in canonical links, or redirects.
141
142
142
-
The crawler continues from the _transformed_ URLs.
143
-
The mapping doesn't transform URLs listed in the `startUrls`, `siteMaps`, `pathsToMatch`, and other settings.
144
-
The mapping also doesn't replace hostnames found in extracted text.
143
+
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/hostname-aliases/).
145
144
additionalProperties:
146
145
type: string
147
146
description: Hostname that should be added in the records.
@@ -154,10 +153,13 @@ pathAliases:
154
153
'/foo': '/bar'
155
154
description: |
156
155
Key-value pairs to replace matching paths with new values.
156
+
157
+
It doesn't replace:
158
+
159
+
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
160
+
- Paths found in extracted text.
157
161
158
162
The crawl continues from the _transformed_ URLs.
159
-
The mapping doesn't transform URLs listed in the `startUrls`, `siteMaps`, `pathsToMatch`, and other settings.
160
-
The mapping also doesn't replace paths found in extracted text.
161
163
additionalProperties:
162
164
type: object
163
165
description: Hostname for which matching paths should be replaced.
@@ -172,17 +174,7 @@ cache:
172
174
description: |
173
175
Whether the crawler should cache crawled pages.
174
176
175
-
With caching, the crawler only crawls changed pages.
176
-
To detect changed pages, the crawler makes [HTTP conditional requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests) to your pages.
177
-
The crawler uses the `ETag` and `Last-Modified` response headers returned by your web server during the previous crawl.
178
-
The crawler sends this information in the `If-None-Match` and `If-Modified-Since` request headers.
179
-
180
-
If your web server responds with `304 Not Modified` to the conditional request, the crawler reuses the records from the previous crawl.
181
-
182
-
Caching is ignored in these cases:
183
-
184
-
- If your crawler configuration changed between two crawls.
185
-
- If `externalData` changed between two crawls.
177
+
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/cache/).
0 commit comments