Skip to content

Commit fcf9128

Browse files
CameronWhitesideCameron WhitesideOxyjun
authored
AI Crawl Control: Add documentation and changelog for new robots.txt tab (#25966)
* AI Crawl Control: Added docs and changelog for new robots.txt tab * AI Crawl Control: imporved changelog according to style guide * Update src/content/docs/ai-crawl-control/features/track-robots-txt.mdx * Update src/content/docs/ai-crawl-control/features/track-robots-txt.mdx Co-authored-by: Jun Lee <[email protected]> * Update src/content/changelog/ai-crawl-control/2025-10-21-track-robots-txt.mdx --------- Co-authored-by: Cameron Whiteside <[email protected]> Co-authored-by: Jun Lee <[email protected]>
1 parent a774d72 commit fcf9128

File tree

4 files changed

+190
-42
lines changed

4 files changed

+190
-42
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: New Robots.txt tab for tracking crawler compliance
3+
description: Monitor robots.txt file health, track crawler violations, and gain visibility into how AI crawlers interact with your directives.
4+
date: 2025-10-21
5+
---
6+
7+
AI Crawl Control now includes a **Robots.txt** tab that provides insights into how AI crawlers interact with your `robots.txt` files.
8+
9+
## What's new
10+
11+
The Robots.txt tab allows you to:
12+
13+
- Monitor the health status of `robots.txt` files across all your hostnames, including HTTP status codes, and identify hostnames that need a `robots.txt` file.
14+
- Track the total number of requests to each `robots.txt` file, with breakdowns of successful versus unsuccessful requests.
15+
- Check whether your `robots.txt` files contain [Content Signals](https://contentsignals.org/) directives for AI training, search, and AI input.
16+
- Identify crawlers that request paths explicitly disallowed by your `robots.txt` directives, including the crawler name, operator, violated path, specific directive, and violation count.
17+
- Filter `robots.txt` request data by crawler, operator, category, and custom time ranges.
18+
19+
## Take action
20+
21+
When you identify non-compliant crawlers, you can:
22+
23+
- Block the crawler in the [Crawlers tab](/ai-crawl-control/features/manage-ai-crawlers/)
24+
- Create custom [WAF rules](/waf/) for path-specific security
25+
- Use [Redirect Rules](/rules/url-forwarding/) to guide crawlers to appropriate areas of your site
26+
27+
To get started, go to **AI Crawl Control** > **Robots.txt** in the Cloudflare dashboard. Learn more in the [Track robots.txt documentation](/ai-crawl-control/features/track-robots-txt/).

src/content/docs/ai-crawl-control/features/analyze-ai-traffic.mdx

Lines changed: 45 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ sidebar:
77
order: 2
88
---
99

10-
import { Steps, Tabs, TabItem, DashButton } from "~/components";
10+
import { Aside, Steps, Tabs, TabItem, DashButton } from "~/components";
1111

1212
AI Crawl Control metrics provide you with insight on how AI crawlers are interacting with your website ([Cloudflare zone](/fundamentals/concepts/accounts-and-zones/#zones)).
1313

@@ -24,29 +24,55 @@ You can find meaningful information across both **Crawlers** and **Metrics** tab
2424

2525
The **Crawlers** tab provides you with the following information:
2626

27-
- Total number of requests to crawl your website from common AI crawlers
28-
- Number of requests made by each AI crawler
29-
- Number of `robots.txt` violations for each crawler
27+
| Metric | Description |
28+
| ----------------------- | ----------------------------------------------------------------------- |
29+
| **Total requests** | Total number of requests to crawl your website from common AI crawlers. |
30+
| **Requests by crawler** | Number of requests made by each AI crawler. |
3031

3132
## View AI Crawl Control metrics
3233

3334
The **Metrics** tab provides you with the following metrics to help you understand how AI crawlers are interacting with your website.
3435

35-
| Metric | Description |
36-
| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
37-
| Total requests | The total number of requests to crawl your website, from all AI crawlers |
38-
| Allowed requests | The number of crawler requests that received a successful response from your site |
39-
| Unsuccessful requests | The number of crawler requests that failed (HTTP 4xx or 5xx) as a result of a blocked request, other security rules, or website errors such as a crawler attempting to access a non-existent page |
40-
| Overall popular paths | The most popular pages crawled by AI crawlers, from all AI crawlers |
41-
| Most active AI crawlers by operators | The AI crawler owners with the highest number of requests to access your site |
42-
| Request by AI crawlers | A graph which displays the number of crawl requests from each AI crawler |
43-
| Most popular paths by AI crawlers | The most popular pages crawled by AI crawlers, for each AI crawler |
44-
| Referrals | A graph which displays the number of visits that were directed to your site from each AI operator |
45-
| Referers | The list of referers who directed visits to your site |
46-
47-
## Filter date range
48-
49-
You can use the date filter to choose the period of time you wish to analyze.
36+
### Analyze referrer data
37+
38+
<Aside type="note">
39+
This feature is available for customers on a paid plan.
40+
</Aside>
41+
42+
Identify traffic sources with referrer analytics to understand discovery patterns and content popularity from AI operators.
43+
44+
- View top referrers driving traffic to your site.
45+
- Understand discovery patterns and content popularity from AI operators.
46+
47+
### Track crawler requests over time
48+
49+
Visualize crawler activity patterns over time using the **Requests over time** chart. You can group data by different dimensions to get more specific insights:
50+
51+
| Dimension | Description |
52+
| --------------- | ------------------------------------------------------------------------------------------- |
53+
| **Crawler** | Track activity from individual AI crawlers (like GPTBot, ClaudeBot, and Bytespider). |
54+
| **Category** | Analyze crawlers by their purpose or type. |
55+
| **Operator** | Discover which companies (such as OpenAI, Anthropic, and ByteDance) are crawling your site. |
56+
| **Host** | Break down activity across multiple subdomains. |
57+
| **Status Code** | Monitor HTTP response codes (200s, 300s, 400s, 500s) to crawlers. |
58+
59+
### Understand what content is crawled
60+
61+
The **Most popular paths** table shows you which pages on your site are most frequently requested by AI crawlers. This can help you understand what content is most popular with different AI models.
62+
63+
| Column | Description |
64+
| -------------------- | ----------------------------------------------------------------------- |
65+
| **Path** | The path of the page on your website that was requested. |
66+
| **Hostname** | The hostname of the requested page. |
67+
| **Crawler** | The name of the AI crawler that made the request. |
68+
| **Operator** | The company that operates the AI crawler. |
69+
| **Allowed requests** | The number of times the path was successfully requested by the crawler. |
70+
71+
You can also filter the results by path or content type to narrow down your analysis.
72+
73+
## Filter and export data
74+
75+
You can use the date filter to choose the period of time you wish to analyze. To export your data, select **Download CSV**. The downloaded file will include all applied filters and groupings.
5076

5177
<Tabs>
5278
<TabItem label="Free plans">
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Track robots.txt
3+
pcx_content_type: concept
4+
sidebar:
5+
order: 6
6+
---
7+
8+
import { Steps, GlossaryTooltip, DashButton } from "~/components";
9+
10+
The **Robots.txt** tab in AI Crawl Control provide insights into how AI crawlers interact with your <GlossaryTooltip term="robots.txt">`robots.txt`</GlossaryTooltip> files across your hostnames. You can monitor request patterns, verify file availability, and identify crawlers that violate your directives.
11+
12+
To access robots.txt insights:
13+
14+
1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/), and select your account and domain.
15+
2. Go to **AI Crawl Control**.
16+
17+
<DashButton url="/?to=/:account/:zone/ai" />
18+
19+
3. Go to the **Robots.txt** tab.
20+
21+
## Check managed robots.txt status
22+
23+
The status card at the top of the tab shows whether Cloudflare is managing your `robots.txt` file.
24+
25+
When enabled, Cloudflare will include directives to block common AI crawlers used for training and include its [Content Signals Policy](/bots/additional-configurations/managed-robots-txt/#content-signals-policy) in your `robots.txt`. For more details on how Cloudflare manages your `robots.txt` file, refer to [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/).
26+
27+
## Filter robots.txt request data
28+
29+
You can apply filters at the top of the tab to narrow your analysis of robots.txt requests:
30+
31+
- Filter by specific crawler name (for example, Googlebot or specific AI bots).
32+
- Filter by the entity running the crawler to understand direct licensing opportunities or existing agreements.
33+
- Filter by general use cases (for example, AI training, general search, or AI assistant).
34+
- Select a custom time frame for historical analysis.
35+
36+
The values in all tables and metrics will update according to your filters.
37+
38+
## Monitor robots.txt availability
39+
40+
The **Availability** table shows the historical request frequency and health status of `robots.txt` files across your hostnames over the selected time frame.
41+
42+
| Column | Description |
43+
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
44+
| Path | The specific hostname's `robots.txt` file being requested. Paths are listed from the most requested to the least. |
45+
| Requests | The total number of requests made to this path. Requests are broken down into:<br/>- **Successful:** HTTP status codes below 400 (including **200 OK** and redirects).<br/>- **Unsuccessful:** HTTP status codes of 400 or above. |
46+
| Status | The HTTP status code from pinging the `robots.txt` file. |
47+
| Content Signals | An indicator showing whether the `robots.txt` file contains [Content Signals](https://contentsignals.org/), directives for usage in AI training, search, or AI input. |
48+
49+
From this table, you can take the following actions:
50+
51+
- Monitor for a high number of unsuccessful requests, which suggests that crawlers are having trouble accessing your `robots.txt` file.
52+
- If the **Status** is `404 Not Found`, create a `robots.txt` file to provide clear directives.
53+
- If the file exists, check for upstream WAF rules or other security settings that may be blocking access.
54+
- If the **Content Signals** column indicates that signals are missing, add them to your `robots.txt` file. You can do this by following the [Content Signals](https://contentsignals.org/) instructions or by enabling [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/) to have Cloudflare manage them for you.
55+
56+
## Track robots.txt violations
57+
58+
The **Violations** table identifies AI crawlers that have requested paths explicitly disallowed by your `robots.txt` file. This helps you identify non-compliant crawlers and take appropriate action.
59+
60+
:::note[How violations are calculated]
61+
62+
The Violations table identifies mismatches between your **current** `robots.txt` directives and past crawler requests. Because violations are not logged in real-time, recently added or changed rules may cause previously legitimate requests to be flagged as violations.
63+
64+
For example, if you add a new `Disallow` rule, all past requests to that path will appear as violations, even though they were not violations at the time of the request.
65+
:::
66+
67+
| Column | Description |
68+
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
69+
| Crawler | The name of the bot that violated your `robots.txt` directives. The operator of the crawler is listed directly beneath the crawler name. |
70+
| Path | The specific URL or path the crawler attempted to access that was disallowed by your `robots.txt` file. |
71+
| Directive | The exact line from your `robots.txt` file that disallowed access to the path. |
72+
| Violations | The count of HTTP requests made to the disallowed path/directive pair within the selected time frame. |
73+
74+
When you identify crawlers violating your `robots.txt` directives, you have several options:
75+
76+
- Navigate to the [**Crawlers** tab](/ai-crawl-control/features/manage-ai-crawlers/) to permanently block the non-compliant crawler.
77+
- Use [Cloudflare WAF](/waf/) to create a path-specific security rules for the violating crawler.
78+
- Use [Redirect Rules](/rules/url-forwarding/) to guide violating crawlers to an appropriate area of your site.
79+
80+
## Related resources
81+
82+
- [Manage AI crawlers](/ai-crawl-control/features/manage-ai-crawlers/)
83+
- [Analyze AI traffic](/ai-crawl-control/features/analyze-ai-traffic/)
84+
- [Cloudflare WAF](/waf/)

src/content/docs/ai-crawl-control/index.mdx

Lines changed: 34 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,15 @@ head:
1111
description: Monitor and control how AI services access your website content.
1212
---
1313

14-
import { Description, Feature, FeatureTable, Plan, LinkButton, RelatedProduct, Card } from "~/components";
14+
import {
15+
Description,
16+
Feature,
17+
FeatureTable,
18+
Plan,
19+
LinkButton,
20+
RelatedProduct,
21+
Card,
22+
} from "~/components";
1523

1624
<Plan type="all" />
1725

@@ -53,6 +61,15 @@ With AI Crawl Control, you can:
5361
Gain insight into how AI crawlers are interacting with your pages.
5462
</Feature>
5563

64+
<Feature
65+
header="Track robots.txt"
66+
href="/ai-crawl-control/features/track-robots-txt/"
67+
cta="Track robots.txt"
68+
>
69+
Track the health of `robots.txt` files and identify which crawlers are
70+
violating your directives.
71+
</Feature>
72+
5673
<Feature
5774
header="Pay Per Crawl"
5875
href="/ai-crawl-control/features/pay-per-crawl/what-is-pay-per-crawl/"
@@ -66,41 +83,35 @@ With AI Crawl Control, you can:
6683
## Use cases
6784

6885
<Card title="Publishers and content creators">
69-
Publishers and content creators can monitor which AI crawlers are accessing their articles and educational content. Set policies to allow beneficial crawlers while blocking others.
86+
Publishers and content creators can monitor which AI crawlers are accessing
87+
their articles and educational content. Set policies to allow beneficial
88+
crawlers while blocking others.
7089
</Card>
7190

7291
<Card title="E-commerce and business sites">
73-
E-commerce and business sites can identify AI crawler activity on product pages and business information. Control access to sensitive data like pricing and inventory.
92+
E-commerce and business sites can identify AI crawler activity on product
93+
pages and business information. Control access to sensitive data like pricing
94+
and inventory.
7495
</Card>
7596

7697
<Card title="Documentation sites">
77-
Documentation sites can track how AI crawlers are accessing their technical documentation. Gain insight into how AI crawlers are engaging with your site.
98+
Documentation sites can track how AI crawlers are accessing their technical
99+
documentation. Gain insight into how AI crawlers are engaging with your site.
78100
</Card>
79101

80102
---
81103

82104
## Related Products
83105

84-
<RelatedProduct
85-
header="Bots"
86-
href="/bots/"
87-
product="bots"
88-
>
89-
Identify and mitigate automated traffic to protect your domain from bad bots.
106+
<RelatedProduct header="Bots" href="/bots/" product="bots">
107+
Identify and mitigate automated traffic to protect your domain from bad bots.
90108
</RelatedProduct>
91109

92-
<RelatedProduct
93-
header="Web Application Firewall"
94-
href="/waf/"
95-
product="waf"
96-
>
97-
Get automatic protection from vulnerabilities and the flexibility to create custom rules.
110+
<RelatedProduct header="Web Application Firewall" href="/waf/" product="waf">
111+
Get automatic protection from vulnerabilities and the flexibility to create
112+
custom rules.
98113
</RelatedProduct>
99114

100-
<RelatedProduct
101-
header="Analytics"
102-
href="/analytics/"
103-
product="analytics"
104-
>
105-
View and analyze traffic on your domain.
106-
</RelatedProduct>
115+
<RelatedProduct header="Analytics" href="/analytics/" product="analytics">
116+
View and analyze traffic on your domain.
117+
</RelatedProduct>

0 commit comments

Comments
 (0)