Skip to content

Commit aba7498

Browse files
csp
1 parent 09e953b commit aba7498

File tree

1 file changed

+81
-20
lines changed

1 file changed

+81
-20
lines changed

src/content/docs/bots/additional-configurations/managed-robots-txt.mdx

Lines changed: 81 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,19 @@
11
---
22
pcx_content_type: reference
3-
title: Direct AI crawlers with managed robots.txt
3+
title: Instruct AI crawlers with managed robots.txt
44
sidebar:
55
order: 10
66
label: Managed robots.txt
77
---
88

9-
import { Render, Tabs, TabItem, Steps } from "~/components";
9+
import { Render, Tabs, TabItem, Steps, DashButton } from "~/components";
1010

1111
Protect your website or application from AI crawlers by implementing a `robots.txt` file on your domain to direct AI bot operators on what content they can and cannot scrape for AI model training.
1212

1313
AI bots are expected to follow the `robots.txt` directives.
1414

15+
`robots.txt` files express your preferences. They do not prevent crawler operators from crawling your content at a technical level. Some crawler operators may disregard your `robots.txt` preferences and crawl your content regardless of what your `robots.txt` file says.
16+
1517
:::note
1618
Respecting `robots.txt` is voluntary. If you want to prevent crawling, use AI Crawl Control's [manage AI crawlers](/ai-crawl-control/features/manage-ai-crawlers/) feature.
1719
:::
@@ -38,19 +40,37 @@ Sitemap: https://www.crawlstop.com/sitemap.xml
3840
With the managed `robots.txt` enabled, Cloudflare will prepend our managed content before your original content, resulting in what you can view at https://www.crawlstop.com/robots.txt.
3941

4042
```txt title="Feature enabled"
41-
# NOTICE: The collection of content and other data on this
42-
# site through automated means, including any device, tool,
43-
# or process designed to data mine or scrape content, is
44-
# prohibited except (1) for the purpose of search engine indexing or
45-
# artificial intelligence retrieval augmented generation or (2) with express
46-
# written permission from this site’s operator.
47-
48-
# To request permission to license our intellectual
49-
# property and/or other materials, please contact this
50-
# site’s operator directly.
43+
# As a condition of accessing this website, you agree to abide by the
44+
# following content-signals:
45+
46+
# (a) If a content-signal = yes, you may collect content for the
47+
# corresponding use.
48+
# (b) If a content-signal = no, you may not collect content for the
49+
# corresponding use.
50+
# (c) If the website operator does not include a content signal for a
51+
# corresponding use, the website operator neither grants nor restricts
52+
# permission via content signal with respect to the corresponding use.
53+
54+
# The content signals and their meanings are:
55+
56+
# search: building a search index and providing search results (e.g., returning
57+
# hyperlinks and short excerpts from your website's contents). Search
58+
# does not include providing AI-generated search summaries.
59+
# ai-input: inputting content into one or more AI models (e.g., retrieval
60+
# augmented generation, grounding, or other real-time taking of
61+
# content for generative AI search answers).
62+
# ai-train: training or fine-tuning AI models.
63+
64+
# ANY RESTRICTIONS EXPRESSED VIA CONTENT-SIGNALS ARE EXPRESS RESERVATIONS OF
65+
# RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION DIRECTIVE 2019/790 ON COPYRIGHT
66+
# AND RELATED RIGHTS IN THE DIGITAL SINGLE MARKET.
5167
5268
# BEGIN Cloudflare Managed content
5369
70+
User-Agent: *
71+
Content-signal: search=yes,ai-train=no
72+
Allow: /
73+
5474
User-agent: Amazonbot
5575
Disallow: /
5676
@@ -81,7 +101,6 @@ Disallow: /lp
81101
Disallow: /feedback
82102
Disallow: /langtest
83103
84-
85104
Sitemap: https://www.crawlstop.com/sitemap.xml
86105
```
87106

@@ -99,20 +118,62 @@ To implement a `robots.txt` file on your domain:
99118
1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/), and select your account and domain.
100119
2. Go to **Security** > **Bots**.
101120
3. Select **Configure Bot Fight Mode**.
102-
4. Turn **Manage bot traffic with robots.txt** on.
121+
4. Turn **Instruct bot traffic with robots.txt** on.
103122
</Steps>
104123
</TabItem>
105124
<TabItem label="New dashboard" icon="rocket">
106125
<Steps>
107-
1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/login), and select your account and domain.
108-
2. Go to **Security** > **Settings**.
109-
3. Filter by **Bot traffic**.
110-
4. Go to **Instruct AI bot traffic with robots.txt**.
111-
5. Turn **Instruct AI bot traffic with robots.txt** on.
126+
1. In the Cloudflare dashboard, go to the Security Settings page.
127+
128+
<DashButton url="/?to=/:account/:zone/security/settings" />
129+
2. Filter by **Bot traffic**.
130+
3. Go to **Instruct AI bot traffic with robots.txt**.
131+
4. Turn **Instruct AI bot traffic with robots.txt** on.
112132
</Steps>
113133
</TabItem>
114134
</Tabs>
115135

136+
## Content Signals Policy
137+
138+
Free zones that do not have their own `robots.txt` file and do not use the managed `robots.txt` feature will display the Content Signals Policy when a crawler requests the `robots.txt` file for your zone.
139+
140+
This file only outlines the Content Signals framework. It does not express your preferences or rights associated with your content.
141+
142+
```txt title="Content Signals Policy"
143+
# As a condition of accessing this website, you agree to abide by the
144+
# following content-signals:
145+
146+
# (a) If a content-signal = yes, you may collect content for the
147+
# corresponding use.
148+
# (b) If a content-signal = no, you may not collect content for the
149+
# corresponding use.
150+
# (c) If the website operator does not include a content signal for a
151+
# corresponding use, the website operator neither grants nor restricts
152+
# permission via content signal with respect to the corresponding use.
153+
154+
# The content signals and their meanings are:
155+
156+
# search: building a search index and providing search results (e.g., returning
157+
# hyperlinks and short excerpts from your website's contents). Search
158+
# does not include providing AI-generated search summaries.
159+
# ai-input: inputting content into one or more AI models (e.g., retrieval
160+
# augmented generation, grounding, or other real-time taking of
161+
# content for generative AI search answers).
162+
# ai-train: training or fine-tuning AI models.
163+
164+
# ANY RESTRICTIONS EXPRESSED VIA CONTENT-SIGNALS ARE EXPRESS RESERVATIONS OF
165+
# RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION DIRECTIVE 2019/790 ON COPYRIGHT
166+
# AND RELATED RIGHTS IN THE DIGITAL SINGLE MARKET.
167+
```
168+
169+
Cloudflare's Content Signals Policy is included by default in the `robots.txt` file when you turn on **Instruct AI bot traffic with robots.txt**.
170+
171+
If you would like to opt out of displaying the policy in your `robots.txt` file, you can uncheck **Display Content Signals Policy** under **Control AI Crawlers** in your zone's overview.
172+
173+
<DashButton url="/?to=/:account/:zone/security/overview" />
174+
175+
Alternatively, you can use [Security Settings](#implementation).
176+
116177
## Availability
117178

118-
Managed `robots.txt` for AI crawlers is available on all plans.
179+
Managed `robots.txt` for AI crawlers is available on all plans.

0 commit comments

Comments
 (0)