diff --git a/src/content/changelog/ai-crawl-control/2025-10-21-track-robots-txt.mdx b/src/content/changelog/ai-crawl-control/2025-10-21-track-robots-txt.mdx new file mode 100644 index 00000000000000..2c0beb6cbb8848 --- /dev/null +++ b/src/content/changelog/ai-crawl-control/2025-10-21-track-robots-txt.mdx @@ -0,0 +1,27 @@ +--- +title: New Robots.txt tab for tracking crawler compliance +description: Monitor robots.txt file health, track crawler violations, and gain visibility into how AI crawlers interact with your directives. +date: 2025-10-21 +--- + +AI Crawl Control now includes a **Robots.txt** tab that provides insights into how AI crawlers interact with your `robots.txt` files. + +## What's new + +The Robots.txt tab allows you to: + +- Monitor the health status of `robots.txt` files across all your hostnames, including HTTP status codes, and identify hostnames that need a `robots.txt` file. +- Track the total number of requests to each `robots.txt` file, with breakdowns of successful versus unsuccessful requests. +- Check whether your `robots.txt` files contain [Content Signals](https://contentsignals.org/) directives for AI training, search, and AI input. +- Identify crawlers that request paths explicitly disallowed by your `robots.txt` directives, including the crawler name, operator, violated path, specific directive, and violation count. +- Filter `robots.txt` request data by crawler, operator, category, and custom time ranges. + +## Take action + +When you identify non-compliant crawlers, you can: + +- Block the crawler in the [Crawlers tab](/ai-crawl-control/features/manage-ai-crawlers/) +- Create custom [WAF rules](/waf/) for path-specific security +- Use [Redirect Rules](/rules/url-forwarding/) to guide crawlers to appropriate areas of your site + +To get started, go to **AI Crawl Control** > **Robots.txt** in the Cloudflare dashboard. Learn more in the [Track robots.txt documentation](/ai-crawl-control/features/track-robots-txt/). diff --git a/src/content/docs/ai-crawl-control/features/analyze-ai-traffic.mdx b/src/content/docs/ai-crawl-control/features/analyze-ai-traffic.mdx index c2acc01e59a49f..dbe402838e3d59 100644 --- a/src/content/docs/ai-crawl-control/features/analyze-ai-traffic.mdx +++ b/src/content/docs/ai-crawl-control/features/analyze-ai-traffic.mdx @@ -7,7 +7,7 @@ sidebar: order: 2 --- -import { Steps, Tabs, TabItem, DashButton } from "~/components"; +import { Aside, Steps, Tabs, TabItem, DashButton } from "~/components"; AI Crawl Control metrics provide you with insight on how AI crawlers are interacting with your website ([Cloudflare zone](/fundamentals/concepts/accounts-and-zones/#zones)). @@ -24,29 +24,55 @@ You can find meaningful information across both **Crawlers** and **Metrics** tab The **Crawlers** tab provides you with the following information: -- Total number of requests to crawl your website from common AI crawlers -- Number of requests made by each AI crawler -- Number of `robots.txt` violations for each crawler +| Metric | Description | +| ----------------------- | ----------------------------------------------------------------------- | +| **Total requests** | Total number of requests to crawl your website from common AI crawlers. | +| **Requests by crawler** | Number of requests made by each AI crawler. | ## View AI Crawl Control metrics The **Metrics** tab provides you with the following metrics to help you understand how AI crawlers are interacting with your website. -| Metric | Description | -| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Total requests | The total number of requests to crawl your website, from all AI crawlers | -| Allowed requests | The number of crawler requests that received a successful response from your site | -| Unsuccessful requests | The number of crawler requests that failed (HTTP 4xx or 5xx) as a result of a blocked request, other security rules, or website errors such as a crawler attempting to access a non-existent page | -| Overall popular paths | The most popular pages crawled by AI crawlers, from all AI crawlers | -| Most active AI crawlers by operators | The AI crawler owners with the highest number of requests to access your site | -| Request by AI crawlers | A graph which displays the number of crawl requests from each AI crawler | -| Most popular paths by AI crawlers | The most popular pages crawled by AI crawlers, for each AI crawler | -| Referrals | A graph which displays the number of visits that were directed to your site from each AI operator | -| Referers | The list of referers who directed visits to your site | - -## Filter date range - -You can use the date filter to choose the period of time you wish to analyze. +### Analyze referrer data + + + +Identify traffic sources with referrer analytics to understand discovery patterns and content popularity from AI operators. + +- View top referrers driving traffic to your site. +- Understand discovery patterns and content popularity from AI operators. + +### Track crawler requests over time + +Visualize crawler activity patterns over time using the **Requests over time** chart. You can group data by different dimensions to get more specific insights: + +| Dimension | Description | +| --------------- | ------------------------------------------------------------------------------------------- | +| **Crawler** | Track activity from individual AI crawlers (like GPTBot, ClaudeBot, and Bytespider). | +| **Category** | Analyze crawlers by their purpose or type. | +| **Operator** | Discover which companies (such as OpenAI, Anthropic, and ByteDance) are crawling your site. | +| **Host** | Break down activity across multiple subdomains. | +| **Status Code** | Monitor HTTP response codes (200s, 300s, 400s, 500s) to crawlers. | + +### Understand what content is crawled + +The **Most popular paths** table shows you which pages on your site are most frequently requested by AI crawlers. This can help you understand what content is most popular with different AI models. + +| Column | Description | +| -------------------- | ----------------------------------------------------------------------- | +| **Path** | The path of the page on your website that was requested. | +| **Hostname** | The hostname of the requested page. | +| **Crawler** | The name of the AI crawler that made the request. | +| **Operator** | The company that operates the AI crawler. | +| **Allowed requests** | The number of times the path was successfully requested by the crawler. | + +You can also filter the results by path or content type to narrow down your analysis. + +## Filter and export data + +You can use the date filter to choose the period of time you wish to analyze. To export your data, select **Download CSV**. The downloaded file will include all applied filters and groupings. diff --git a/src/content/docs/ai-crawl-control/features/track-robots-txt.mdx b/src/content/docs/ai-crawl-control/features/track-robots-txt.mdx new file mode 100644 index 00000000000000..3a95a62f21f61e --- /dev/null +++ b/src/content/docs/ai-crawl-control/features/track-robots-txt.mdx @@ -0,0 +1,84 @@ +--- +title: Track robots.txt +pcx_content_type: concept +sidebar: + order: 6 +--- + +import { Steps, GlossaryTooltip, DashButton } from "~/components"; + +The **Robots.txt** tab in AI Crawl Control provide insights into how AI crawlers interact with your `robots.txt` files across your hostnames. You can monitor request patterns, verify file availability, and identify crawlers that violate your directives. + +To access robots.txt insights: + +1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/), and select your account and domain. +2. Go to **AI Crawl Control**. + + + +3. Go to the **Robots.txt** tab. + +## Check managed robots.txt status + +The status card at the top of the tab shows whether Cloudflare is managing your `robots.txt` file. + +When enabled, Cloudflare will include directives to block common AI crawlers used for training and include its [Content Signals Policy](/bots/additional-configurations/managed-robots-txt/#content-signals-policy) in your `robots.txt`. For more details on how Cloudflare manages your `robots.txt` file, refer to [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/). + +## Filter robots.txt request data + +You can apply filters at the top of the tab to narrow your analysis of robots.txt requests: + +- Filter by specific crawler name (for example, Googlebot or specific AI bots). +- Filter by the entity running the crawler to understand direct licensing opportunities or existing agreements. +- Filter by general use cases (for example, AI training, general search, or AI assistant). +- Select a custom time frame for historical analysis. + +The values in all tables and metrics will update according to your filters. + +## Monitor robots.txt availability + +The **Availability** table shows the historical request frequency and health status of `robots.txt` files across your hostnames over the selected time frame. + +| Column | Description | +| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Path | The specific hostname's `robots.txt` file being requested. Paths are listed from the most requested to the least. | +| Requests | The total number of requests made to this path. Requests are broken down into:
- **Successful:** HTTP status codes below 400 (including **200 OK** and redirects).
- **Unsuccessful:** HTTP status codes of 400 or above. | +| Status | The HTTP status code from pinging the `robots.txt` file. | +| Content Signals | An indicator showing whether the `robots.txt` file contains [Content Signals](https://contentsignals.org/), directives for usage in AI training, search, or AI input. | + +From this table, you can take the following actions: + +- Monitor for a high number of unsuccessful requests, which suggests that crawlers are having trouble accessing your `robots.txt` file. + - If the **Status** is `404 Not Found`, create a `robots.txt` file to provide clear directives. + - If the file exists, check for upstream WAF rules or other security settings that may be blocking access. +- If the **Content Signals** column indicates that signals are missing, add them to your `robots.txt` file. You can do this by following the [Content Signals](https://contentsignals.org/) instructions or by enabling [Managed `robots.txt`](/bots/additional-configurations/managed-robots-txt/) to have Cloudflare manage them for you. + +## Track robots.txt violations + +The **Violations** table identifies AI crawlers that have requested paths explicitly disallowed by your `robots.txt` file. This helps you identify non-compliant crawlers and take appropriate action. + +:::note[How violations are calculated] + +The Violations table identifies mismatches between your **current** `robots.txt` directives and past crawler requests. Because violations are not logged in real-time, recently added or changed rules may cause previously legitimate requests to be flagged as violations. + +For example, if you add a new `Disallow` rule, all past requests to that path will appear as violations, even though they were not violations at the time of the request. +::: + +| Column | Description | +| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------- | +| Crawler | The name of the bot that violated your `robots.txt` directives. The operator of the crawler is listed directly beneath the crawler name. | +| Path | The specific URL or path the crawler attempted to access that was disallowed by your `robots.txt` file. | +| Directive | The exact line from your `robots.txt` file that disallowed access to the path. | +| Violations | The count of HTTP requests made to the disallowed path/directive pair within the selected time frame. | + +When you identify crawlers violating your `robots.txt` directives, you have several options: + +- Navigate to the [**Crawlers** tab](/ai-crawl-control/features/manage-ai-crawlers/) to permanently block the non-compliant crawler. +- Use [Cloudflare WAF](/waf/) to create a path-specific security rules for the violating crawler. +- Use [Redirect Rules](/rules/url-forwarding/) to guide violating crawlers to an appropriate area of your site. + +## Related resources + +- [Manage AI crawlers](/ai-crawl-control/features/manage-ai-crawlers/) +- [Analyze AI traffic](/ai-crawl-control/features/analyze-ai-traffic/) +- [Cloudflare WAF](/waf/) diff --git a/src/content/docs/ai-crawl-control/index.mdx b/src/content/docs/ai-crawl-control/index.mdx index 1ca3b9890f8c03..16d6500ec0d9ca 100644 --- a/src/content/docs/ai-crawl-control/index.mdx +++ b/src/content/docs/ai-crawl-control/index.mdx @@ -11,7 +11,15 @@ head: description: Monitor and control how AI services access your website content. --- -import { Description, Feature, FeatureTable, Plan, LinkButton, RelatedProduct, Card } from "~/components"; +import { + Description, + Feature, + FeatureTable, + Plan, + LinkButton, + RelatedProduct, + Card, +} from "~/components"; @@ -53,6 +61,15 @@ With AI Crawl Control, you can: Gain insight into how AI crawlers are interacting with your pages. + + Track the health of `robots.txt` files and identify which crawlers are + violating your directives. + + -Publishers and content creators can monitor which AI crawlers are accessing their articles and educational content. Set policies to allow beneficial crawlers while blocking others. + Publishers and content creators can monitor which AI crawlers are accessing + their articles and educational content. Set policies to allow beneficial + crawlers while blocking others. -E-commerce and business sites can identify AI crawler activity on product pages and business information. Control access to sensitive data like pricing and inventory. + E-commerce and business sites can identify AI crawler activity on product + pages and business information. Control access to sensitive data like pricing + and inventory. -Documentation sites can track how AI crawlers are accessing their technical documentation. Gain insight into how AI crawlers are engaging with your site. + Documentation sites can track how AI crawlers are accessing their technical + documentation. Gain insight into how AI crawlers are engaging with your site. --- ## Related Products - -Identify and mitigate automated traffic to protect your domain from bad bots. + + Identify and mitigate automated traffic to protect your domain from bad bots. - -Get automatic protection from vulnerabilities and the flexibility to create custom rules. + + Get automatic protection from vulnerabilities and the flexibility to create + custom rules. - -View and analyze traffic on your domain. - \ No newline at end of file + + View and analyze traffic on your domain. +