Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@ import { DirectoryListing } from "~/components"

Refer to the following pages for more information on additional bot management configurations:

<DirectoryListing />
<DirectoryListing />
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
---
pcx_content_type: reference
title: Direct AI crawlers with managed robots.txt
title: Instruct AI crawlers with managed robots.txt
sidebar:
order: 10
label: Managed robots.txt
---

import { Render, Tabs, TabItem, Steps } from "~/components";
import { Render, Tabs, TabItem, Steps, DashButton } from "~/components";

Protect your website or application from AI crawlers by implementing a `robots.txt` file on your domain to direct AI bot operators on what content they can and cannot scrape for AI model training.

AI bots are expected to follow the `robots.txt` directives.

`robots.txt` files express your preferences. They do not prevent crawler operators from crawling your content at a technical level. Some crawler operators may disregard your `robots.txt` preferences and crawl your content regardless of what your `robots.txt` file says.

:::note
Respecting `robots.txt` is voluntary. If you want to prevent crawling, use AI Crawl Control's [manage AI crawlers](/ai-crawl-control/features/manage-ai-crawlers/) feature.
:::
Expand All @@ -38,19 +40,37 @@ Sitemap: https://www.crawlstop.com/sitemap.xml
With the managed `robots.txt` enabled, Cloudflare will prepend our managed content before your original content, resulting in what you can view at https://www.crawlstop.com/robots.txt.

```txt title="Feature enabled"
# NOTICE: The collection of content and other data on this
# site through automated means, including any device, tool,
# or process designed to data mine or scrape content, is
# prohibited except (1) for the purpose of search engine indexing or
# artificial intelligence retrieval augmented generation or (2) with express
# written permission from this site’s operator.

# To request permission to license our intellectual
# property and/or other materials, please contact this
# site’s operator directly.
# As a condition of accessing this website, you agree to abide by the
# following content-signals:

# (a) If a content-signal = yes, you may collect content for the
# corresponding use.
# (b) If a content-signal = no, you may not collect content for the
# corresponding use.
# (c) If the website operator does not include a content signal for a
# corresponding use, the website operator neither grants nor restricts
# permission via content signal with respect to the corresponding use.

# The content signals and their meanings are:

# search: building a search index and providing search results (e.g., returning
# hyperlinks and short excerpts from your website's contents). Search
# does not include providing AI-generated search summaries.
# ai-input: inputting content into one or more AI models (e.g., retrieval
# augmented generation, grounding, or other real-time taking of
# content for generative AI search answers).
# ai-train: training or fine-tuning AI models.

# ANY RESTRICTIONS EXPRESSED VIA CONTENT-SIGNALS ARE EXPRESS RESERVATIONS OF
# RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION DIRECTIVE 2019/790 ON COPYRIGHT
# AND RELATED RIGHTS IN THE DIGITAL SINGLE MARKET.

# BEGIN Cloudflare Managed content

User-Agent: *
Content-signal: search=yes,ai-train=no
Allow: /

User-agent: Amazonbot
Disallow: /

Expand Down Expand Up @@ -81,7 +101,6 @@ Disallow: /lp
Disallow: /feedback
Disallow: /langtest


Sitemap: https://www.crawlstop.com/sitemap.xml
```

Expand All @@ -99,20 +118,62 @@ To implement a `robots.txt` file on your domain:
1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/), and select your account and domain.
2. Go to **Security** > **Bots**.
3. Select **Configure Bot Fight Mode**.
4. Turn **Manage bot traffic with robots.txt** on.
4. Turn **Instruct bot traffic with robots.txt** on.
</Steps>
</TabItem>
<TabItem label="New dashboard" icon="rocket">
<Steps>
1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com/login), and select your account and domain.
2. Go to **Security** > **Settings**.
3. Filter by **Bot traffic**.
4. Go to **Instruct AI bot traffic with robots.txt**.
5. Turn **Instruct AI bot traffic with robots.txt** on.
1. In the Cloudflare dashboard, go to the Security Settings page.

<DashButton url="/?to=/:account/:zone/security/settings" />
2. Filter by **Bot traffic**.
3. Go to **Instruct AI bot traffic with robots.txt**.
4. Turn **Instruct AI bot traffic with robots.txt** on.
</Steps>
</TabItem>
</Tabs>

## Content Signals Policy

Free zones that do not have their own `robots.txt` file and do not use the managed `robots.txt` feature will display the Content Signals Policy when a crawler requests the `robots.txt` file for your zone.

This file only outlines the Content Signals framework. It does not express your preferences or rights associated with your content.

```txt title="Content Signals Policy"
# As a condition of accessing this website, you agree to abide by the
# following content-signals:

# (a) If a content-signal = yes, you may collect content for the
# corresponding use.
# (b) If a content-signal = no, you may not collect content for the
# corresponding use.
# (c) If the website operator does not include a content signal for a
# corresponding use, the website operator neither grants nor restricts
# permission via content signal with respect to the corresponding use.

# The content signals and their meanings are:

# search: building a search index and providing search results (e.g., returning
# hyperlinks and short excerpts from your website's contents). Search
# does not include providing AI-generated search summaries.
# ai-input: inputting content into one or more AI models (e.g., retrieval
# augmented generation, grounding, or other real-time taking of
# content for generative AI search answers).
# ai-train: training or fine-tuning AI models.

# ANY RESTRICTIONS EXPRESSED VIA CONTENT-SIGNALS ARE EXPRESS RESERVATIONS OF
# RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION DIRECTIVE 2019/790 ON COPYRIGHT
# AND RELATED RIGHTS IN THE DIGITAL SINGLE MARKET.
```

Cloudflare's Content Signals Policy is included by default in the `robots.txt` file when you turn on **Instruct AI bot traffic with robots.txt**.

If you would like to opt out of displaying the policy in your `robots.txt` file, you can uncheck **Display Content Signals Policy** under **Control AI Crawlers** in your zone's overview.

<DashButton url="/?to=/:account/:zone/security/overview" />

Alternatively, you can use [Security Settings](#implementation).

## Availability

Managed `robots.txt` for AI crawlers is available on all plans.
Managed `robots.txt` for AI crawlers is available on all plans.
Loading