Skip to content

Commit 9ebb27c

Browse files
aninibreadmaxvp
andauthored
AI Search Custom Headers (cloudflare#26519)
Co-authored-by: Max Phillips <[email protected]>
1 parent 97ed380 commit 9ebb27c

File tree

3 files changed

+84
-10
lines changed

3 files changed

+84
-10
lines changed
38.6 KB
Loading
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
title: AI Search support for crawling login protected website content
3+
description: Index websites behind login walls by adding custom authentication headers to AI Search's website crawler.
4+
products:
5+
- ai-search
6+
date: 2025-11-14
7+
---
8+
9+
[AI Search](/ai-search/) now supports [custom HTTP headers](/ai-search/configuration/data-source/website/#access-protected-content) for website crawling, solving a common problem where valuable content behind authentication or access controls could not be indexed.
10+
11+
Previously, AI Search could only crawl publicly accessible pages, leaving knowledge bases, documentation, and other protected content out of your search results. With custom headers support, you can now include authentication credentials that allow the crawler to access this protected content.
12+
13+
This is particularly useful for indexing content like:
14+
- **Internal documentation** behind corporate login systems
15+
- **Premium content** that requires users to provide access to unlock
16+
- **Sites protected by Cloudflare Access** using service tokens
17+
18+
To add custom headers when creating an AI Search instance, select **Parse options**. In the **Extra headers** section, you can add up to five custom headers per Website data source.
19+
20+
![Custom headers configuration in AI Search](~/assets/images/ai-search/ai-search-extra-headers.png)
21+
22+
For example, to crawl a site protected by [Cloudflare Access](/cloudflare-one/access-controls/), you can add service token credentials as custom headers:
23+
24+
```
25+
CF-Access-Client-Id: your-token-id.access
26+
CF-Access-Client-Secret: your-token-secret
27+
```
28+
29+
The crawler will automatically include these headers in all requests, allowing it to access protected pages that would otherwise be blocked.
30+
31+
Learn more about [configuring custom headers for website crawling](/ai-search/configuration/data-source/website/#access-protected-content) in AI Search.

src/content/docs/ai-search/configuration/data-source/website.mdx

Lines changed: 53 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,19 @@ sidebar:
55
order: 2
66
---
77

8-
import { DashButton, Steps } from "~/components"
8+
import { DashButton, Steps } from "~/components";
99

1010
The Website data source allows you to connect a domain you own so its pages can be crawled, stored, and indexed.
1111

12-
:::note
13-
You can only crawl domains that you have onboarded onto the same Cloudflare account.
14-
15-
Refer to [Onboard a domain](/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account.
16-
:::
12+
You can only crawl domains that you have onboarded onto the same Cloudflare account. Refer to [Onboard a domain](/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account.
1713

1814
:::caution[Bot protection may block crawling]
1915
If you use Cloudflare products that control or restrict bot traffic such as [Bot Management](/bots/), [Web Application Firewall (WAF)](/waf/), or [Turnstile](/turnstile/), the same rules will apply to the AI Search (AutoRAG) crawler. Make sure to configure an exception or an allow-list for the AutoRAG crawler in your settings.
2016
:::
2117

2218
## How website crawling works
2319

24-
When you connect a domain, the crawler looks for your websites sitemap to determine which pages to visit:
20+
When you connect a domain, the crawler looks for your website's sitemap to determine which pages to visit:
2521

2622
1. The crawler first checks the `robots.txt` for listed sitemaps. If it exists, it reads all sitemaps existing inside.
2723
2. If no `robots.txt` is found, the crawler first checks for a sitemap at `/sitemap.xml`.
@@ -34,7 +30,7 @@ Pages are visited, according to the `<priority>` attribute set on the sitemaps,
3430
If you have Security rules configured to block bot activity, you can add a rule to allowlist the crawler bot.
3531

3632
<Steps>
37-
1. In the Cloudflare dashboard, go to the **Security rules** page of your account and domain.
33+
1. In the Cloudflare dashboard, go to the **Security rules** page.
3834

3935
<DashButton url="/?to=/:account/:zone/security/security-rules" />
4036

@@ -48,24 +44,71 @@ If you have Security rules configured to block bot activity, you can add a rule
4844
</Steps>
4945

5046
## Parsing options
47+
5148
You can choose how pages are parsed during crawling:
5249

5350
- **Static sites**: Downloads the raw HTML for each page.
5451
- **Rendered sites**: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. Note that the [Browser Rendering](/browser-rendering/pricing/) limits and billing apply.
5552

53+
## Access protected content
54+
55+
If your website has pages behind authentication or are only visible to logged-in users, you can configure custom HTTP headers to allow the AI Search crawler to access this protected content. You can add up to five custom HTTP headers to the requests AI Search sends when crawling your site.
56+
57+
### Providing access to sites protected by Cloudflare Access
58+
59+
To allow AI Search to crawl a site protected by [Cloudflare Access](/cloudflare-one/access-controls/), you need to create service token credentials and configure them as custom headers.
60+
61+
Service tokens bypass user authentication, so ensure your Access policies are configured appropriately for the content you want to index. The service token will allow the AI Search crawler to access all content covered by the Service Auth policy.
62+
63+
<Steps>
64+
65+
1. In [Cloudflare One](https://one.dash.cloudflare.com/), [create a service token](/cloudflare-one/access-controls/service-credentials/service-tokens/#create-a-service-token). Once the Client ID and Client Secret are generated, save them for the next steps. For example they can look like:
66+
67+
```
68+
CF-Access-Client-Id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access
69+
CF-Access-Client-Secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
70+
```
71+
72+
2. [Create a policy](/cloudflare-one/access-controls/policies/policy-management/#create-a-policy) with the following configuration:
73+
- Add an **Include** rule with **Selector** set to **Service token**.
74+
- In **Value**, select the Service Token you created in step 1.
75+
3. [Add your self-hosted application to Access](/cloudflare-one/access-controls/applications/http-apps/self-hosted-public-app/) and with the following configuration:
76+
- In Access policies, click **Select existing policies**.
77+
- Select the policy that you have just created and select **Confirm**.
78+
4. In the Cloudflare dashboard, go to the **AI Search** page.
79+
80+
<DashButton url="/?to=/:account/ai/ai-search" />
81+
82+
5. Select **Create**.
83+
6. Select **Website** as your data source.
84+
7. Under **Parse options**, locate **Extra headers** and add the following two headers using your saved credentials:
85+
- Header 1:
86+
- **Key**: `CF-Access-Client-Id`
87+
- **Value**: `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access`
88+
- Header 2:
89+
- **Key**: `CF-Access-Client-Secret`
90+
- **Value**: `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx`
91+
92+
8. Complete the AI Search setup process to create your search instance.
93+
94+
</Steps>
95+
5696
## Storage
97+
5798
During setup, AI Search creates a dedicated R2 bucket in your account to store the pages that have been crawled and downloaded as HTML files. This bucket is automatically managed and is used only for content discovered by the crawler. Any files or objects that you add directly to this bucket will not be indexed.
5899

59100
:::note
60-
We recommend not to modify the bucket as it may distrupt the indexing flow and cause content to not be updated properly.
101+
We recommend not modifying the bucket as it may disrupt the indexing flow and cause content to not be updated properly.
61102
:::
62103

63104
## Sync and updates
64-
During scheduled or manual [sync jobs](/ai-search/configuration/indexing/), the crawler will check for changes to the `<lastmod>` attribute in your sitemap. If it has been changed to a date occuring after the last sync date, then the page will be crawled, the updated version is stored in the R2 bucket, and automatically reindexed so that your search results always reflect the latest content.
105+
106+
During scheduled or manual [sync jobs](/ai-search/configuration/indexing/), the crawler will check for changes to the `<lastmod>` attribute in your sitemap. If it has been changed to a date occurring after the last sync date, then the page will be crawled, the updated version is stored in the R2 bucket, and automatically reindexed so that your search results always reflect the latest content.
65107

66108
If the `<lastmod>` attribute is not defined, then AI Search will automatically crawl each link defined in the sitemap once a day.
67109

68110
## Limits
111+
69112
The regular AI Search [limits](/ai-search/platform/limits-pricing/) apply when using the Website data source.
70113

71114
The crawler will download and index pages only up to the maximum object limit supported for an AI Search instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.

0 commit comments

Comments
 (0)