You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document errored pages and Content Signals on /crawl endpoint (#29095)
* Document errored pages and Content Signals on /crawl endpoint
- Add 'Errored and blocked pages' subsection explaining how HTTP errors (402, 403, etc.) are surfaced in crawl results via metadata.status and metadata.html
- Add crawlPurposes parameter to the optional parameters table
- Add 'Content Signals' subsection under Crawler behavior explaining the three signal categories (search, ai-input, ai-train), enforcement behavior, and how to narrow declared purposes
- Add crawlPurposes to the all-optional-parameters example
- Add 'Crawl rejected by Content Signals' troubleshooting entry for the 400 Bad Request error
* Update src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx
Co-authored-by: Cameron Whiteside <35665916+CameronWhiteside@users.noreply.github.com>
* Address review: fix record field references and remove duplicate description
* Update src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx
---------
Co-authored-by: Cameron Whiteside <35665916+CameronWhiteside@users.noreply.github.com>
Co-authored-by: Caley Burton <caley@cloudflare.com>
Copy file name to clipboardExpand all lines: src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx
+68Lines changed: 68 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -165,6 +165,21 @@ Example response:
165
165
}
166
166
```
167
167
168
+
### Errored and blocked pages
169
+
170
+
If a crawled page returns an HTTP error (such as `402`, `403`, or `500`), the record for that URL will have `"status": "errored"`.
171
+
172
+
This information is only available in the crawl results (step 2) — the [initiation response](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) only returns the job `id`. Because crawl jobs run asynchronously, the crawler does not fetch page content at initiation time.
173
+
174
+
To view only errored records, filter by `status=errored`:
175
+
176
+
```bash
177
+
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}?status=errored' \
178
+
-H 'Authorization: Bearer YOUR_API_TOKEN'
179
+
```
180
+
181
+
The record's `status` field contains the HTTP status code returned by the origin server, and `html` contains the response body. This is useful for understanding site owners' intent when they block crawlers — for example, sites using [AI Crawl Control](https://blog.cloudflare.com/ai-crawl-control) may return a custom status code and message.
182
+
168
183
## Cancel a crawl job
169
184
170
185
To cancel a crawl job that is currently in progress, use the job `id` you received:
@@ -194,6 +209,7 @@ The following optional parameters can be used in your crawl request, in addition
194
209
|`options.includeSubdomains`| Boolean | If true, follows links to subdomains of the starting URL (default is false). |
195
210
|`options.includePatterns`| Array of strings | Only visits URLs that match one of these wildcard patterns. Use `*` to match any characters except `/`, or `**` to match any characters including `/`. |
196
211
|`options.excludePatterns`| Array of strings | Does not visit URLs that match any of these wildcard patterns. Use `*` to match any characters except `/`, or `**` to match any characters including `/`. |
212
+
|`crawlPurposes`| Array of strings | Declares the intended use of crawled content for [Content Signals](https://contentsignals.org/) enforcement. Allowed values: `search`, `ai-input`, `ai-train`. Default is `["search", "ai-input", "ai-train"]`. If a target site's `robots.txt` includes a `Content-Signal` directive that sets any of your declared purposes to `no`, the crawl request will be rejected with a `400` error. Refer to [Content Signals](#content-signals) for details. |
197
213
198
214
### Pattern behavior
199
215
@@ -228,6 +244,7 @@ curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser
228
244
-H 'Content-Type: application/json' \
229
245
-d '{
230
246
"url": "https://www.exampledocs.com/docs/",
247
+
"crawlPurposes": ["search"],
231
248
"limit": 50,
232
249
"depth": 2,
233
250
"formats": ["markdown"],
@@ -439,6 +456,53 @@ The `/crawl` endpoint uses `CloudflareBrowserRenderingCrawler/1.0` as its User-A
439
456
440
457
For a full list of default User-Agent strings, refer to [Automatic request headers](/browser-rendering/reference/automatic-request-headers/#user-agent).
441
458
459
+
### Content Signals
460
+
461
+
The `/crawl` endpoint respects [Content Signals](https://contentsignals.org/) directives found in a target site's `robots.txt` file. Content Signals are a way for site owners to express preferences about how their content can be used by automated systems. For more background, refer to [Giving users choice with Cloudflare's new Content Signals Policy](https://blog.cloudflare.com/content-signals-policy/).
462
+
463
+
A site owner can include a `Content-Signal` directive in their `robots.txt` to allow or disallow specific categories of use:
464
+
465
+
-`search` — Building a search index and providing search results with links and excerpts.
466
+
-`ai-input` — Inputting content into AI models at query time (for example, retrieval-augmented generation or grounding).
467
+
-`ai-train` — Training or fine-tuning AI models.
468
+
469
+
For example, a `robots.txt` that allows search indexing but disallows AI training:
470
+
471
+
```txt title="robots.txt"
472
+
User-Agent: *
473
+
Content-Signal: search=yes, ai-train=no
474
+
Allow: /
475
+
```
476
+
477
+
#### How /crawl enforces Content Signals
478
+
479
+
By default, `/crawl` declares all three purposes: `["search", "ai-input", "ai-train"]`. If a target site sets any of those content signals to `no`, the crawl request will be rejected at initiation with a `400 Bad Request` error unless you explicitly narrow your declared purposes using the `crawlPurposes` parameter to exclude the disallowed use.
480
+
481
+
This means:
482
+
483
+
1.**Site has no Content Signals** — The crawl proceeds normally.
484
+
2.**Site has Content Signals, and all your declared purposes are allowed** — The crawl proceeds normally.
485
+
3.**Site sets a content signal to `no`, and that purpose is in your `crawlPurposes`** — The crawl request is rejected with a `400` error and the message `Crawl purpose(s) completely disallowed by Content-Signal directive`.
486
+
487
+
To crawl a site that disallows AI training but allows search, set `crawlPurposes` to only the purposes you need:
488
+
489
+
```bash
490
+
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
491
+
-H 'Authorization: Bearer <apiToken>' \
492
+
-H 'Content-Type: application/json' \
493
+
-d '{
494
+
"url": "https://example.com",
495
+
"crawlPurposes": ["search"],
496
+
"formats": ["markdown"]
497
+
}'
498
+
```
499
+
500
+
In this example, because the operator declared only `search` as their purpose, the crawl will succeed even if the site sets `ai-train=no`.
501
+
502
+
:::note
503
+
Content Signals are trust-based. By setting `crawlPurposes`, you are declaring to the site owner how you intend to use the crawled content.
504
+
:::
505
+
442
506
## Troubleshooting
443
507
444
508
### Crawl job returns no results or all URLs are skipped
@@ -449,6 +513,10 @@ If your crawl job completes but returns an empty records array, or all URLs show
449
513
-**Pattern filters too restrictive** — Your `includePatterns` may not match any URLs on the site. Try crawling without patterns first to confirm URLs are discoverable, then add patterns.
450
514
-**No links found** — The starting URL may not contain links. Try using `source: "sitemaps"`, increasing the `depth` parameter, or setting `includeSubdomains` or `includeExternalLinks` to `true`.
451
515
516
+
### Crawl rejected by Content Signals
517
+
518
+
If your crawl request returns a `400 Bad Request` with the message `Crawl purpose(s) completely disallowed by Content-Signal directive`, the target site's `robots.txt` includes a `Content-Signal` directive that disallows one or more of your declared `crawlPurposes`. To resolve this, check the site's `robots.txt` for `Content-Signal:` entries and set `crawlPurposes` to only the purposes you need. For example, if the site sets `ai-train=no` and you only need search indexing, use `"crawlPurposes": ["search"]`. Refer to [Content Signals](#content-signals) for details.
519
+
452
520
### Crawl job takes too long
453
521
454
522
If a crawl job remains in `running` status for an extended period:
0 commit comments