Skip to content

Commit b20f303

Browse files
Gary ConroyGary Conroy
authored andcommitted
fix(specs): New Crawler API parameter - ignorePaginationAttributes
1 parent 1251112 commit b20f303

File tree

3 files changed

+282
-39
lines changed

3 files changed

+282
-39
lines changed

specs/crawler/common/schemas/action.yml

Lines changed: 67 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,14 @@ Action:
33
description: |
44
How to process crawled URLs.
55
6+
67
Each action defines:
78
9+
810
- The targeted subset of URLs it processes.
11+
912
- What information to extract from the web pages.
13+
1014
- The Algolia indices where the extracted records will be stored.
1115
1216
If a single web page matches several actions,
@@ -21,16 +25,23 @@ Action:
2125
discoveryPatterns:
2226
type: array
2327
description: |
24-
Indicates _intermediary_ pages that the crawler should visit.
28+
Which _intermediary_ web pages the crawler should visit.
29+
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages, _not_ their content.
30+
2531
26-
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
32+
It functions similarly to the `pathsToMatch` action but without record extraction.
33+
34+
35+
`discoveryPatterns` uses [micromatch](https://github.com/micromatch/micromatch) to support matching with wildcards, negation, and other features.
36+
The crawler adds all matching URLs to its queue.
2737
items:
2838
$ref: '#/urlPattern'
2939
fileTypesToMatch:
3040
type: array
3141
description: |
3242
File types for crawling non-HTML documents.
3343
44+
3445
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
3546
maxItems: 100
3647
items:
@@ -59,6 +70,7 @@ Action:
5970
description: |
6071
URLs to which this action should apply.
6172
73+
6274
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
6375
minItems: 1
6476
maxItems: 100
@@ -69,9 +81,12 @@ Action:
6981
type: object
7082
description: |
7183
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
72-
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
7384
74-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).
85+
86+
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor`.
87+
88+
89+
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
7590
properties:
7691
__type:
7792
$ref: '#/configurationRecordExtractorType'
@@ -110,7 +125,8 @@ ActionSchedule:
110125
fileTypes:
111126
type: string
112127
description: |
113-
Supported file type for indexing non-HTML documents.
128+
Supported file types for indexing non-HTML documents.
129+
114130
115131
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
116132
enum:
@@ -129,6 +145,7 @@ urlPattern:
129145
description: |
130146
Pattern for matching URLs.
131147
148+
132149
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
133150
example: https://www.algolia.com/**
134151

@@ -140,7 +157,43 @@ hostnameAliases:
140157
Key-value pairs to replace matching hostnames found in a sitemap,
141158
on a page, in canonical links, or redirects.
142159
143-
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).
160+
161+
During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.
162+
This helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).
163+
164+
165+
For example, with this `hostnameAliases` mapping:
166+
167+
{
168+
hostnameAliases: {
169+
'dev.example.com': 'example.com'
170+
}
171+
}
172+
173+
1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.
174+
175+
1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.
176+
177+
1. The crawler follows the transformed URL (not the original).
178+
179+
180+
**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**
181+
182+
183+
The crawler can discover URLs in places such as:
184+
185+
186+
- Crawled pages
187+
188+
- Sitemaps
189+
190+
- [Canonical URLs](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behavior)
191+
192+
- Redirects.
193+
194+
195+
However, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,
196+
nor does it affect the `pathsToMatch` action or other configuration elements.
144197
additionalProperties:
145198
type: string
146199
description: Hostname that should be added in the records.
@@ -153,12 +206,16 @@ pathAliases:
153206
'/foo': '/bar'
154207
description: |
155208
Key-value pairs to replace matching paths with new values.
209+
156210
157211
It doesn't replace:
158-
212+
213+
159214
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
215+
160216
- Paths found in extracted text.
161217
218+
162219
The crawl continues from the _transformed_ URLs.
163220
additionalProperties:
164221
type: object
@@ -172,9 +229,10 @@ pathAliases:
172229
cache:
173230
type: object
174231
description: |
175-
Whether the crawler should cache crawled pages.
232+
Whether the crawler should cache crawled pages.
233+
176234
177-
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
235+
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
178236
properties:
179237
enabled:
180238
type: boolean

0 commit comments

Comments
 (0)