Skip to content

Commit 0d72d53

Browse files
kathaylCameronWhitesidecaley-b
authored
Document errored pages and Content Signals on /crawl endpoint (#29095)
* Document errored pages and Content Signals on /crawl endpoint - Add 'Errored and blocked pages' subsection explaining how HTTP errors (402, 403, etc.) are surfaced in crawl results via metadata.status and metadata.html - Add crawlPurposes parameter to the optional parameters table - Add 'Content Signals' subsection under Crawler behavior explaining the three signal categories (search, ai-input, ai-train), enforcement behavior, and how to narrow declared purposes - Add crawlPurposes to the all-optional-parameters example - Add 'Crawl rejected by Content Signals' troubleshooting entry for the 400 Bad Request error * Update src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx Co-authored-by: Cameron Whiteside <35665916+CameronWhiteside@users.noreply.github.com> * Address review: fix record field references and remove duplicate description * Update src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx --------- Co-authored-by: Cameron Whiteside <35665916+CameronWhiteside@users.noreply.github.com> Co-authored-by: Caley Burton <caley@cloudflare.com>
1 parent 770fe91 commit 0d72d53

File tree

1 file changed

+68
-0
lines changed

1 file changed

+68
-0
lines changed

src/content/docs/browser-rendering/rest-api/crawl-endpoint.mdx

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,21 @@ Example response:
165165
}
166166
```
167167

168+
### Errored and blocked pages
169+
170+
If a crawled page returns an HTTP error (such as `402`, `403`, or `500`), the record for that URL will have `"status": "errored"`.
171+
172+
This information is only available in the crawl results (step 2) — the [initiation response](/browser-rendering/rest-api/crawl-endpoint/#initiate-the-crawl-job) only returns the job `id`. Because crawl jobs run asynchronously, the crawler does not fetch page content at initiation time.
173+
174+
To view only errored records, filter by `status=errored`:
175+
176+
```bash
177+
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}?status=errored' \
178+
-H 'Authorization: Bearer YOUR_API_TOKEN'
179+
```
180+
181+
The record's `status` field contains the HTTP status code returned by the origin server, and `html` contains the response body. This is useful for understanding site owners' intent when they block crawlers — for example, sites using [AI Crawl Control](https://blog.cloudflare.com/ai-crawl-control) may return a custom status code and message.
182+
168183
## Cancel a crawl job
169184

170185
To cancel a crawl job that is currently in progress, use the job `id` you received:
@@ -194,6 +209,7 @@ The following optional parameters can be used in your crawl request, in addition
194209
| `options.includeSubdomains` | Boolean | If true, follows links to subdomains of the starting URL (default is false). |
195210
| `options.includePatterns` | Array of strings | Only visits URLs that match one of these wildcard patterns. Use `*` to match any characters except `/`, or `**` to match any characters including `/`. |
196211
| `options.excludePatterns` | Array of strings | Does not visit URLs that match any of these wildcard patterns. Use `*` to match any characters except `/`, or `**` to match any characters including `/`. |
212+
| `crawlPurposes` | Array of strings | Declares the intended use of crawled content for [Content Signals](https://contentsignals.org/) enforcement. Allowed values: `search`, `ai-input`, `ai-train`. Default is `["search", "ai-input", "ai-train"]`. If a target site's `robots.txt` includes a `Content-Signal` directive that sets any of your declared purposes to `no`, the crawl request will be rejected with a `400` error. Refer to [Content Signals](#content-signals) for details. |
197213

198214
### Pattern behavior
199215

@@ -228,6 +244,7 @@ curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser
228244
-H 'Content-Type: application/json' \
229245
-d '{
230246
"url": "https://www.exampledocs.com/docs/",
247+
"crawlPurposes": ["search"],
231248
"limit": 50,
232249
"depth": 2,
233250
"formats": ["markdown"],
@@ -439,6 +456,53 @@ The `/crawl` endpoint uses `CloudflareBrowserRenderingCrawler/1.0` as its User-A
439456

440457
For a full list of default User-Agent strings, refer to [Automatic request headers](/browser-rendering/reference/automatic-request-headers/#user-agent).
441458

459+
### Content Signals
460+
461+
The `/crawl` endpoint respects [Content Signals](https://contentsignals.org/) directives found in a target site's `robots.txt` file. Content Signals are a way for site owners to express preferences about how their content can be used by automated systems. For more background, refer to [Giving users choice with Cloudflare's new Content Signals Policy](https://blog.cloudflare.com/content-signals-policy/).
462+
463+
A site owner can include a `Content-Signal` directive in their `robots.txt` to allow or disallow specific categories of use:
464+
465+
- `search` — Building a search index and providing search results with links and excerpts.
466+
- `ai-input` — Inputting content into AI models at query time (for example, retrieval-augmented generation or grounding).
467+
- `ai-train` — Training or fine-tuning AI models.
468+
469+
For example, a `robots.txt` that allows search indexing but disallows AI training:
470+
471+
```txt title="robots.txt"
472+
User-Agent: *
473+
Content-Signal: search=yes, ai-train=no
474+
Allow: /
475+
```
476+
477+
#### How /crawl enforces Content Signals
478+
479+
By default, `/crawl` declares all three purposes: `["search", "ai-input", "ai-train"]`. If a target site sets any of those content signals to `no`, the crawl request will be rejected at initiation with a `400 Bad Request` error unless you explicitly narrow your declared purposes using the `crawlPurposes` parameter to exclude the disallowed use.
480+
481+
This means:
482+
483+
1. **Site has no Content Signals** — The crawl proceeds normally.
484+
2. **Site has Content Signals, and all your declared purposes are allowed** — The crawl proceeds normally.
485+
3. **Site sets a content signal to `no`, and that purpose is in your `crawlPurposes`** — The crawl request is rejected with a `400` error and the message `Crawl purpose(s) completely disallowed by Content-Signal directive`.
486+
487+
To crawl a site that disallows AI training but allows search, set `crawlPurposes` to only the purposes you need:
488+
489+
```bash
490+
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
491+
-H 'Authorization: Bearer <apiToken>' \
492+
-H 'Content-Type: application/json' \
493+
-d '{
494+
"url": "https://example.com",
495+
"crawlPurposes": ["search"],
496+
"formats": ["markdown"]
497+
}'
498+
```
499+
500+
In this example, because the operator declared only `search` as their purpose, the crawl will succeed even if the site sets `ai-train=no`.
501+
502+
:::note
503+
Content Signals are trust-based. By setting `crawlPurposes`, you are declaring to the site owner how you intend to use the crawled content.
504+
:::
505+
442506
## Troubleshooting
443507

444508
### Crawl job returns no results or all URLs are skipped
@@ -449,6 +513,10 @@ If your crawl job completes but returns an empty records array, or all URLs show
449513
- **Pattern filters too restrictive** — Your `includePatterns` may not match any URLs on the site. Try crawling without patterns first to confirm URLs are discoverable, then add patterns.
450514
- **No links found** — The starting URL may not contain links. Try using `source: "sitemaps"`, increasing the `depth` parameter, or setting `includeSubdomains` or `includeExternalLinks` to `true`.
451515

516+
### Crawl rejected by Content Signals
517+
518+
If your crawl request returns a `400 Bad Request` with the message `Crawl purpose(s) completely disallowed by Content-Signal directive`, the target site's `robots.txt` includes a `Content-Signal` directive that disallows one or more of your declared `crawlPurposes`. To resolve this, check the site's `robots.txt` for `Content-Signal:` entries and set `crawlPurposes` to only the purposes you need. For example, if the site sets `ai-train=no` and you only need search indexing, use `"crawlPurposes": ["search"]`. Refer to [Content Signals](#content-signals) for details.
519+
452520
### Crawl job takes too long
453521

454522
If a crawl job remains in `running` status for an extended period:

0 commit comments

Comments
 (0)