You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/docs/ai-search/configuration/data-source/website.mdx
+23-2Lines changed: 23 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,8 @@ sidebar:
5
5
order: 2
6
6
---
7
7
8
+
import { DashButton, Steps } from"~/components"
9
+
8
10
The Website data source allows you to connect a domain you own so its pages can be crawled, stored, and indexed.
9
11
10
12
:::note
@@ -13,11 +15,12 @@ You can only crawl domains that you have onboarded onto the same Cloudflare acco
13
15
Refer to [Onboard a domain](/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account.
14
16
:::
15
17
16
-
:::caution[Bot protection may block crawling]
17
-
If you use Cloudflare products that control or restrict bot traffic such as [Bot Management](/bots/), [Web Application Firewall (WAF)](/waf/), or [Turnstile](/turnstile/), the same rules will apply to the AI Search (AutoRAG) crawler. Make sure to configure an exception or an allow-list for the AutoRAG crawler in your settings.
18
+
:::caution[Bot protection may block crawling]
19
+
If you use Cloudflare products that control or restrict bot traffic such as [Bot Management](/bots/), [Web Application Firewall (WAF)](/waf/), or [Turnstile](/turnstile/), the same rules will apply to the AI Search (AutoRAG) crawler. Make sure to configure an exception or an allow-list for the AutoRAG crawler in your settings.
18
20
:::
19
21
20
22
## How website crawling works
23
+
21
24
When you connect a domain, the crawler looks for your website’s sitemap to determine which pages to visit:
22
25
23
26
1. The crawler first checks the `robots.txt` for listed sitemaps. If it exists, it reads all sitemaps existing inside.
@@ -26,6 +29,24 @@ When you connect a domain, the crawler looks for your website’s sitemap to det
26
29
27
30
Pages are visited, according to the `<priority>` attribute set on the sitemaps, if this field is defined.
28
31
32
+
## How to set WAF rules to allowlist AutoRAG crawler
33
+
34
+
If you have Security rules configured to block bot activity, you can add a rule to allowlist AutoRAG's crawler bot.
35
+
36
+
<Steps>
37
+
1. In the Cloudflare dashboard, go to the **Security rules** page of your account and domain.
2. To create a new empty rule, select **Create rule** > **Custom rules**.
42
+
3. Enter a descriptive name for the rule in **Rule name**, such as `Allow AutoRAG`.
43
+
4. Under **When incoming requests match**, use the **Field** drop-down list to choose _Bot Detection ID_. For **Operator**, select _equals_. For **Value**, enter `122933950`.
44
+
5. Under **Then take action**, in the **Choose action** dropdown, choose _Skip_.
45
+
6. Under **Place at**, select the order of the rule in the **Select order** dropdown to be _First_. Setting the order as _First_ allows this rule to be applied before subsequent rules.
46
+
7. To save and deploy your rule, select **Deploy**.
47
+
48
+
</Steps>
49
+
29
50
## Parsing options
30
51
You can choose how pages are parsed during crawling:
0 commit comments