Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
title: New Robots.txt tab for tracking crawler compliance
description: Monitor robots.txt file health, track crawler violations, and gain visibility into how AI crawlers interact with your directives.
date: 2025-10-21
---

AI Crawl Control now includes a **Robots.txt** tab that provides insights into how AI crawlers interact with your `robots.txt` files.

## What's new

The Robots.txt tab allows you to:

- Monitor the health status of `robots.txt` files across all your hostnames, including HTTP status codes, and identify hostnames that need a `robots.txt` file.
- Track the total number of requests to each `robots.txt` file, with breakdowns of successful versus unsuccessful requests.
- Check whether your `robots.txt` files contain [Content Signals](https://contentsignals.org/) directives for AI training, search, and AI input.
- Identify crawlers that request paths explicitly disallowed by your `robots.txt` directives, including the crawler name, operator, violated path, specific directive, and violation count.
- Filter `robots.txt` request data by crawler, operator, category, and custom time ranges.

## Take action

When you identify non-compliant crawlers, you can:

- Block the crawler in the [Crawlers tab](/ai-crawl-control/features/manage-ai-crawlers/)
- Create custom [WAF rules](/waf/) for path-specific security
- Use [Redirect Rules](/rules/url-forwarding/) to guide crawlers to appropriate areas of your site

To get started, go to **AI Crawl Control** > **Robots.txt** in the Cloudflare dashboard. Learn more in the [Track robots.txt documentation](/ai-crawl-control/features/track-robots-txt/).
64 changes: 45 additions & 19 deletions src/content/docs/ai-crawl-control/features/analyze-ai-traffic.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ sidebar:
order: 2
---

import { Steps, Tabs, TabItem, DashButton } from "~/components";
import { Aside, Steps, Tabs, TabItem, DashButton } from "~/components";

AI Crawl Control metrics provide you with insight on how AI crawlers are interacting with your website ([Cloudflare zone](/fundamentals/concepts/accounts-and-zones/#zones)).

Expand All @@ -24,29 +24,55 @@ You can find meaningful information across both **Crawlers** and **Metrics** tab

The **Crawlers** tab provides you with the following information:

- Total number of requests to crawl your website from common AI crawlers
- Number of requests made by each AI crawler
- Number of `robots.txt` violations for each crawler
| Metric | Description |
| ----------------------- | ----------------------------------------------------------------------- |
| **Total requests** | Total number of requests to crawl your website from common AI crawlers. |
| **Requests by crawler** | Number of requests made by each AI crawler. |

## View AI Crawl Control metrics

The **Metrics** tab provides you with the following metrics to help you understand how AI crawlers are interacting with your website.

| Metric | Description |
| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Total requests | The total number of requests to crawl your website, from all AI crawlers |
| Allowed requests | The number of crawler requests that received a successful response from your site |
| Unsuccessful requests | The number of crawler requests that failed (HTTP 4xx or 5xx) as a result of a blocked request, other security rules, or website errors such as a crawler attempting to access a non-existent page |
| Overall popular paths | The most popular pages crawled by AI crawlers, from all AI crawlers |
| Most active AI crawlers by operators | The AI crawler owners with the highest number of requests to access your site |
| Request by AI crawlers | A graph which displays the number of crawl requests from each AI crawler |
| Most popular paths by AI crawlers | The most popular pages crawled by AI crawlers, for each AI crawler |
| Referrals | A graph which displays the number of visits that were directed to your site from each AI operator |
| Referers | The list of referers who directed visits to your site |

## Filter date range

You can use the date filter to choose the period of time you wish to analyze.
### Analyze referrer data

<Aside type="note">
This feature is available for customers on a paid plan.
</Aside>

Identify traffic sources with referrer analytics to understand discovery patterns and content popularity from AI operators.

- View top referrers driving traffic to your site.
- Understand discovery patterns and content popularity from AI operators.

### Track crawler requests over time

Visualize crawler activity patterns over time using the **Requests over time** chart. You can group data by different dimensions to get more specific insights:

| Dimension | Description |
| --------------- | ------------------------------------------------------------------------------------------- |
| **Crawler** | Track activity from individual AI crawlers (like GPTBot, ClaudeBot, and Bytespider). |
| **Category** | Analyze crawlers by their purpose or type. |
| **Operator** | Discover which companies (such as OpenAI, Anthropic, and ByteDance) are crawling your site. |
| **Host** | Break down activity across multiple subdomains. |
| **Status Code** | Monitor HTTP response codes (200s, 300s, 400s, 500s) to crawlers. |

### Understand what content is crawled

The **Most popular paths** table shows you which pages on your site are most frequently requested by AI crawlers. This can help you understand what content is most popular with different AI models.

| Column | Description |
| -------------------- | ----------------------------------------------------------------------- |
| **Path** | The path of the page on your website that was requested. |
| **Hostname** | The hostname of the requested page. |
| **Crawler** | The name of the AI crawler that made the request. |
| **Operator** | The company that operates the AI crawler. |
| **Allowed requests** | The number of times the path was successfully requested by the crawler. |

You can also filter the results by path or content type to narrow down your analysis.

## Filter and export data

You can use the date filter to choose the period of time you wish to analyze. To export your data, select **Download CSV**. The downloaded file will include all applied filters and groupings.

<Tabs>
<TabItem label="Free plans">
Expand Down
84 changes: 84 additions & 0 deletions src/content/docs/ai-crawl-control/features/track-robots-txt.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Track robots.txt
pcx_content_type: concept
sidebar:
order: 6
---

import { Steps, GlossaryTooltip, DashButton } from "~/components";

The **Robots.txt** tab in AI Crawl Control provide insights into how AI crawlers interact with your <GlossaryTooltip term="robots.txt">`robots.txt`</GlossaryTooltip> files across your hostnames. You can monitor request patterns, verify file availability, and identify crawlers that violate your directives.

To access robots.txt insights:

1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/), and select your account and domain.
2. Go to **AI Crawl Control**.

<DashButton url="/?to=/:account/:zone/ai" />

3. Go to the **Robots.txt** tab.

## Check managed robots.txt status

The status card at the top of the tab shows whether Cloudflare is managing your `robots.txt` file.

When enabled, Cloudflare will include directives to block common AI crawlers used for training and include its [Content Signals Policy](/bots/additional-configurations/managed-robots-txt/#content-signals-policy) in your `robots.txt`. For more details on how Cloudflare manages your `robots.txt` file, refer to [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/).

## Filter robots.txt request data

You can apply filters at the top of the tab to narrow your analysis of robots.txt requests:

- Filter by specific crawler name (for example, Googlebot or specific AI bots).
- Filter by the entity running the crawler to understand direct licensing opportunities or existing agreements.
- Filter by general use cases (for example, AI training, general search, or AI assistant).
- Select a custom time frame for historical analysis.

The values in all tables and metrics will update according to your filters.

## Monitor robots.txt availability

The **Availability** table shows the historical request frequency and health status of `robots.txt` files across your hostnames over the selected time frame.

| Column | Description |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Path | The specific hostname's `robots.txt` file being requested. Paths are listed from the most requested to the least. |
| Requests | The total number of requests made to this path. Requests are broken down into:<br/>- **Successful:** HTTP status codes below 400 (including **200 OK** and redirects).<br/>- **Unsuccessful:** HTTP status codes of 400 or above. |
| Status | The HTTP status code from pinging the `robots.txt` file. |
| Content Signals | An indicator showing whether the `robots.txt` file contains [Content Signals](https://contentsignals.org/), directives for usage in AI training, search, or AI input. |

From this table, you can take the following actions:

- Monitor for a high number of unsuccessful requests, which suggests that crawlers are having trouble accessing your `robots.txt` file.
- If the **Status** is `404 Not Found`, create a `robots.txt` file to provide clear directives.
- If the file exists, check for upstream WAF rules or other security settings that may be blocking access.
- If the **Content Signals** column indicates that signals are missing, add them to your `robots.txt` file. You can do this by following the [Content Signals](https://contentsignals.org/) instructions or by enabling [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/) to have Cloudflare manage them for you.

## Track robots.txt violations

The **Violations** table identifies AI crawlers that have requested paths explicitly disallowed by your `robots.txt` file. This helps you identify non-compliant crawlers and take appropriate action.

:::note[How violations are calculated]

The Violations table identifies mismatches between your **current** `robots.txt` directives and past crawler requests. Because violations are not logged in real-time, recently added or changed rules may cause previously legitimate requests to be flagged as violations.

For example, if you add a new `Disallow` rule, all past requests to that path will appear as violations, even though they were not violations at the time of the request.
:::

| Column | Description |
| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| Crawler | The name of the bot that violated your `robots.txt` directives. The operator of the crawler is listed directly beneath the crawler name. |
| Path | The specific URL or path the crawler attempted to access that was disallowed by your `robots.txt` file. |
| Directive | The exact line from your `robots.txt` file that disallowed access to the path. |
| Violations | The count of HTTP requests made to the disallowed path/directive pair within the selected time frame. |

When you identify crawlers violating your `robots.txt` directives, you have several options:

- Navigate to the [**Crawlers** tab](/ai-crawl-control/features/manage-ai-crawlers/) to permanently block the non-compliant crawler.
- Use [Cloudflare WAF](/waf/) to create a path-specific security rules for the violating crawler.
- Use [Redirect Rules](/rules/url-forwarding/) to guide violating crawlers to an appropriate area of your site.

## Related resources

- [Manage AI crawlers](/ai-crawl-control/features/manage-ai-crawlers/)
- [Analyze AI traffic](/ai-crawl-control/features/analyze-ai-traffic/)
- [Cloudflare WAF](/waf/)
57 changes: 34 additions & 23 deletions src/content/docs/ai-crawl-control/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,15 @@ head:
description: Monitor and control how AI services access your website content.
---

import { Description, Feature, FeatureTable, Plan, LinkButton, RelatedProduct, Card } from "~/components";
import {
Description,
Feature,
FeatureTable,
Plan,
LinkButton,
RelatedProduct,
Card,
} from "~/components";

<Plan type="all" />

Expand Down Expand Up @@ -53,6 +61,15 @@ With AI Crawl Control, you can:
Gain insight into how AI crawlers are interacting with your pages.
</Feature>

<Feature
header="Track robots.txt"
href="/ai-crawl-control/features/track-robots-txt/"
cta="Track robots.txt"
>
Track the health of `robots.txt` files and identify which crawlers are
violating your directives.
</Feature>

<Feature
header="Pay Per Crawl"
href="/ai-crawl-control/features/pay-per-crawl/what-is-pay-per-crawl/"
Expand All @@ -66,41 +83,35 @@ With AI Crawl Control, you can:
## Use cases

<Card title="Publishers and content creators">
Publishers and content creators can monitor which AI crawlers are accessing their articles and educational content. Set policies to allow beneficial crawlers while blocking others.
Publishers and content creators can monitor which AI crawlers are accessing
their articles and educational content. Set policies to allow beneficial
crawlers while blocking others.
</Card>

<Card title="E-commerce and business sites">
E-commerce and business sites can identify AI crawler activity on product pages and business information. Control access to sensitive data like pricing and inventory.
E-commerce and business sites can identify AI crawler activity on product
pages and business information. Control access to sensitive data like pricing
and inventory.
</Card>

<Card title="Documentation sites">
Documentation sites can track how AI crawlers are accessing their technical documentation. Gain insight into how AI crawlers are engaging with your site.
Documentation sites can track how AI crawlers are accessing their technical
documentation. Gain insight into how AI crawlers are engaging with your site.
</Card>

---

## Related Products

<RelatedProduct
header="Bots"
href="/bots/"
product="bots"
>
Identify and mitigate automated traffic to protect your domain from bad bots.
<RelatedProduct header="Bots" href="/bots/" product="bots">
Identify and mitigate automated traffic to protect your domain from bad bots.
</RelatedProduct>

<RelatedProduct
header="Web Application Firewall"
href="/waf/"
product="waf"
>
Get automatic protection from vulnerabilities and the flexibility to create custom rules.
<RelatedProduct header="Web Application Firewall" href="/waf/" product="waf">
Get automatic protection from vulnerabilities and the flexibility to create
custom rules.
</RelatedProduct>

<RelatedProduct
header="Analytics"
href="/analytics/"
product="analytics"
>
View and analyze traffic on your domain.
</RelatedProduct>
<RelatedProduct header="Analytics" href="/analytics/" product="analytics">
View and analyze traffic on your domain.
</RelatedProduct>
Loading