Skip to content

Commit c207593

Browse files
committed
fix: whitespace
1 parent 17bbe04 commit c207593

File tree

2 files changed

+43
-171
lines changed

2 files changed

+43
-171
lines changed

specs/crawler/common/schemas/action.yml

Lines changed: 8 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,10 @@ Action:
33
description: |
44
How to process crawled URLs.
55
6-
76
Each action defines:
87
9-
108
- The targeted subset of URLs it processes.
11-
129
- What information to extract from the web pages.
13-
1410
- The Algolia indices where the extracted records will be stored.
1511
1612
If a single web page matches several actions,
@@ -26,22 +22,19 @@ Action:
2622
type: array
2723
description: |
2824
Which _intermediary_ web pages the crawler should visit.
29-
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages, _not_ their content.
30-
31-
25+
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages,
26+
_not_ their content.
3227
It functions similarly to the `pathsToMatch` action but without record extraction.
3328
34-
35-
Uses [micromatch](https://github.com/micromatch/micromatch) to match wildcards, negation, and other features.
36-
The crawler adds all matching URLs to its queue.
29+
`discoveryPatterns` uses [micromatch](https://github.com/micromatch/micromatch) to support matching with wildcards,
30+
negation, and other features.
3731
items:
3832
$ref: '#/urlPattern'
3933
fileTypesToMatch:
4034
type: array
4135
description: |
4236
File types for crawling non-HTML documents.
4337
44-
4538
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
4639
maxItems: 100
4740
items:
@@ -70,7 +63,6 @@ Action:
7063
description: |
7164
URLs to which this action should apply.
7265
73-
7466
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
7567
minItems: 1
7668
maxItems: 100
@@ -82,11 +74,8 @@ Action:
8274
description: |
8375
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
8476
85-
8677
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor`.
87-
88-
89-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
78+
For details, see the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
9079
properties:
9180
__type:
9281
$ref: '#/configurationRecordExtractorType'
@@ -127,7 +116,6 @@ fileTypes:
127116
description: |
128117
Supported file types for indexing non-HTML documents.
129118
130-
131119
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
132120
enum:
133121
- doc
@@ -145,55 +133,14 @@ urlPattern:
145133
description: |
146134
Pattern for matching URLs.
147135
148-
149136
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
150137
example: https://www.algolia.com/**
151138

152139
hostnameAliases:
153140
type: object
154141
example:
155142
'dev.example.com': 'example.com'
156-
description: |
157-
Key-value pairs to replace matching hostnames found in a sitemap,
158-
on a page, in canonical links, or redirects.
159-
160-
161-
During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.
162-
This helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).
163-
164-
165-
For example, with this `hostnameAliases` mapping:
166-
167-
{
168-
hostnameAliases: {
169-
'dev.example.com': 'example.com'
170-
}
171-
}
172-
173-
1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.
174-
175-
1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.
176-
177-
1. The crawler follows the transformed URL (not the original).
178-
179-
180-
**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**
181-
182-
183-
The crawler can discover URLs in places such as:
184-
185-
186-
- Crawled pages
187-
188-
- Sitemaps
189-
190-
- [Canonical URLs](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behavior)
191-
192-
- Redirects.
193-
194-
195-
However, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,
196-
nor does it affect the `pathsToMatch` action or other configuration elements.
143+
description: "Key-value pairs to replace matching hostnames found in a sitemap,\non a page, in canonical links, or redirects.\n\n\nDuring a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.\nThis helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).\n\n\nFor example, with this `hostnameAliases` mapping:\n\n {\n hostnameAliases: {\n 'dev.example.com': 'example.com'\n }\n }\n\n1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.\n\n1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.\n\n1. The crawler follows the transformed URL (not the original).\n\n\n**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**\n\n\nThe crawler can discover URLs in places such as:\n\n\n- Crawled pages\n\n- Sitemaps\n\n- [Canonical URLs](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behavior)\n\n- Redirects. \n\n\nHowever, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,\nnor does it affect the `pathsToMatch` action or other configuration elements.\n"
197144
additionalProperties:
198145
type: string
199146
description: Hostname that should be added in the records.
@@ -207,15 +154,11 @@ pathAliases:
207154
description: |
208155
Key-value pairs to replace matching paths with new values.
209156
210-
211157
It doesn't replace:
212158
213-
214159
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
215-
216160
- Paths found in extracted text.
217161
218-
219162
The crawl continues from the _transformed_ URLs.
220163
221164
@@ -237,10 +180,9 @@ pathAliases:
237180
cache:
238181
type: object
239182
description: |
240-
Whether the crawler should cache crawled pages.
241-
183+
Whether the crawler should cache crawled pages.
242184
243-
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
185+
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
244186
properties:
245187
enabled:
246188
type: boolean

0 commit comments

Comments
 (0)