-
Notifications
You must be signed in to change notification settings - Fork 9.9k
AI Crawl Control: Add documentation and changelog for new robots.txt tab #25966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
CameronWhiteside
merged 5 commits into
cloudflare:production
from
CameronWhiteside:cwhiteside/robots-txt-new-tab
Oct 24, 2025
Merged
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
a02a9af
AI Crawl Control: Added docs and changelog for new robots.txt tab
c761638
AI Crawl Control: imporved changelog according to style guide
6fdd685
Update src/content/docs/ai-crawl-control/features/track-robots-txt.mdx
CameronWhiteside 073eadb
Update src/content/docs/ai-crawl-control/features/track-robots-txt.mdx
CameronWhiteside fe8794a
Update src/content/changelog/ai-crawl-control/2025-10-21-track-robots…
CameronWhiteside File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
27 changes: 27 additions & 0 deletions
27
src/content/changelog/ai-crawl-control/2025-10-21-track-robots-txt.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| --- | ||
| title: New Robots.txt tab for tracking crawler compliance | ||
| description: Monitor robots.txt file health, track crawler violations, and gain visibility into how AI crawlers interact with your directives. | ||
| date: 2025-10-21 | ||
| --- | ||
|
|
||
| AI Crawl Control now includes a **Robots.txt** tab that provides insights into how AI crawlers interact with your `robots.txt` files. | ||
|
|
||
| ## What's new | ||
|
|
||
| The Robots.txt tab allows you to: | ||
|
|
||
| - Monitor the health status of `robots.txt` files across all your hostnames, including HTTP status codes, and identify hostnames that need a `robots.txt` file. | ||
| - Track the total number of requests to each `robots.txt` file, with breakdowns of allowed versus unsuccessful requests. | ||
| - Check whether your `robots.txt` files contain [Content Signals](https://contentsignals.org/) directives for AI training, search, and AI input. | ||
| - Identify crawlers that request paths explicitly disallowed by your `robots.txt` directives, including the crawler name, operator, violated path, specific directive, and violation count. | ||
| - Filter `robots.txt` request data by crawler, operator, category, and custom time ranges. | ||
|
|
||
| ## Take action | ||
|
|
||
| When you identify non-compliant crawlers, you can: | ||
|
|
||
| - Block the crawler in the [Crawlers tab](/ai-crawl-control/features/manage-ai-crawlers/) | ||
| - Create custom [WAF rules](/waf/) for path-specific security | ||
| - Use [Redirect Rules](/rules/url-forwarding/) to guide crawlers to appropriate areas of your site | ||
|
|
||
| To get started, go to **AI Crawl Control** > **Robots.txt** in the Cloudflare dashboard. Learn more in the [Track robots.txt documentation](/ai-crawl-control/features/track-robots-txt/). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
84 changes: 84 additions & 0 deletions
84
src/content/docs/ai-crawl-control/features/track-robots-txt.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| --- | ||
| title: Track robots.txt | ||
| pcx_content_type: concept | ||
| sidebar: | ||
| order: 6 | ||
| --- | ||
|
|
||
| import { Steps, GlossaryTooltip, DashButton } from "~/components"; | ||
|
|
||
| The **Robots.txt** tab in AI Crawl Control provide insights into how AI crawlers interact with your <GlossaryTooltip term="robots.txt">`robots.txt`</GlossaryTooltip> files across your hostnames. You can monitor request patterns, verify file availability, and identify crawlers that violate your directives. | ||
|
|
||
| To access robots.txt insights: | ||
|
|
||
| 1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/), and select your account and domain. | ||
| 2. Go to **AI Crawl Control**. | ||
|
|
||
| <DashButton url="/?to=/:account/:zone/ai" /> | ||
|
|
||
| 3. Go to the **Robots.txt** tab. | ||
|
|
||
| ## Check managed robots.txt status | ||
|
|
||
| The status card at the top of the tab shows whether Cloudflare is managing your `robots.txt` file. | ||
|
|
||
| When enabled, Cloudflare will include directives to block common AI crawlers used for training and include its [Content Signals Policy](/bots/additional-configurations/managed-robots-txt/#content-signals-policy) in your `robots.txt`. For more details on how Cloudflare manages your `robots.txt` file, refer to [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/). | ||
|
|
||
| ## Filter robots.txt request data | ||
|
|
||
| You can apply filters at the top of the tab to narrow your analysis of robots.txt requests: | ||
|
|
||
| - Filter by specific crawler name (for example, Googlebot or specific AI bots). | ||
| - Filter by the entity running the crawler to understand direct licensing opportunities or existing agreements. | ||
| - Filter by general use cases (for example, AI training, general search, or AI assistant). | ||
| - Select a custom time frame for historical analysis. | ||
|
|
||
| The values in all tables and metrics will update according to your filters. | ||
|
|
||
| ## Monitor robots.txt availability | ||
|
|
||
| The **Availability** table shows the historical request frequency and health status of `robots.txt` files across your hostnames over the selected time frame. | ||
|
|
||
| | Column | Description | | ||
| | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||
| | Path | The specific hostname's `robots.txt` file being requested. Paths are organized by most requested first. | | ||
CameronWhiteside marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | Requests | The total number of requests made to this path. Requests are broken down into:<br/>- **Allowed:** HTTP status codes below 400 (including **200 OK** and redirects).<br/>- **Unsuccessful:** HTTP status codes of 400 or above. | | ||
CameronWhiteside marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | Status | The HTTP status code from pinging the `robots.txt` file. | | ||
| | Content Signals | An indicator showing whether the `robots.txt` file contains [Content Signals](https://contentsignals.org/), directives for usage in AI training, search, or AI input. | | ||
|
|
||
| From this table, you can take the following actions: | ||
|
|
||
| - Monitor for a high number of unsuccessful requests, which suggests that crawlers are having trouble accessing your `robots.txt` file. | ||
| - If the **Status** is `404 Not Found`, create a `robots.txt` file to provide clear directives. | ||
| - If the file exists, check for upstream WAF rules or other security settings that may be blocking access. | ||
| - If the **Content Signals** column indicates that signals are missing, add them to your `robots.txt` file. You can do this by following the [Content Signals](https://contentsignals.org/) instructions or by enabling [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/) to have Cloudflare manage them for you. | ||
|
|
||
| ## Track robots.txt violations | ||
|
|
||
| The **Violations** table identifies AI crawlers that have requested paths explicitly disallowed by your `robots.txt` file. This helps you identify non-compliant crawlers and take appropriate action. | ||
|
|
||
| :::note[How violations are calculated] | ||
|
|
||
| The Violations table identifies mismatches between your **current** `robots.txt` directives and past crawler requests. Because violations are not logged in real-time, recently added or changed rules may cause previously legitimate requests to be flagged as violations. | ||
|
|
||
| For example, if you add a new `Disallow` rule, all past requests to that path will appear as violations, even though they were not violations at the time of the request. | ||
| ::: | ||
|
|
||
| | Column | Description | | ||
| | ---------- | ---------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | Crawler | The name of the bot that violated your `robots.txt` directives. The operator of the crawler is listed directly beneath the crawler name. | | ||
| | Path | The specific URL or path the crawler attempted to access that was disallowed by your `robots.txt` file. | | ||
| | Directive | The exact line from your `robots.txt` file that disallowed access to the path. | | ||
| | Violations | The count of HTTP requests made to the disallowed path/directive pair within the selected time frame. | | ||
|
|
||
| When you identify crawlers violating your `robots.txt` directives, you have several options: | ||
|
|
||
| - Navigate to the [**Crawlers** tab](/ai-crawl-control/features/manage-ai-crawlers/) to permanently block the non-compliant crawler. | ||
| - Use [Cloudflare WAF](/waf/) to create a path-specific security rules for the violating crawler. | ||
| - Use [Redirect Rules](/rules/url-forwarding/) to guide violating crawlers to an appropriate area of your site. | ||
|
|
||
| ## Related resources | ||
|
|
||
| - [Manage AI crawlers](/ai-crawl-control/features/manage-ai-crawlers/) | ||
| - [Analyze AI traffic](/ai-crawl-control/features/analyze-ai-traffic/) | ||
| - [Cloudflare WAF](/waf/) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.