Skip to content

Commit 8c4e438

Browse files
aninibreadOxyjun
authored andcommitted
websites and 50 max results (#24585)
* websites and 50 max results * Apply suggestions from code review Co-authored-by: Jun Lee <[email protected]> * fixes * add note --------- Co-authored-by: Jun Lee <[email protected]>
1 parent 623926e commit 8c4e438

File tree

8 files changed

+69
-5
lines changed

8 files changed

+69
-5
lines changed
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
title: Data source
3+
pcx_content_type: how-to
4+
sidebar:
5+
order: 2
6+
---
7+
8+
AutoRAG can directly ingest data from the following sources:
9+
10+
| Data Source | Description |
11+
|---------------|-------------|
12+
| [Website](/autorag/configuration/data-source/website/) | Connect a domain you own to index website pages. |
13+
| [R2 Bucket](/autorag/configuration/data-source/r2/) | Connect a Cloudflare R2 bucket to index stored documents. |

src/content/docs/autorag/configuration/data-source.mdx renamed to src/content/docs/autorag/configuration/data-source/r2.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
2-
title: Data source
2+
title: R2
33
pcx_content_type: how-to
44
sidebar:
55
order: 2
66
---
77

88
import { Render } from "~/components";
99

10-
AutoRAG currently supports Cloudflare R2 as the data source for storing your knowledge base. To get started, [configure an R2 bucket](/r2/get-started/) containing your data.
10+
You can use Cloudflare R2 to store data for indexing. To get started, [configure an R2 bucket](/r2/get-started/) containing your data.
1111

1212
AutoRAG will automatically scan and process supported files stored in that bucket. Files that are unsupported or exceed the size limit will be skipped during indexing and logged as errors.
1313

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: Website
3+
pcx_content_type: how-to
4+
sidebar:
5+
order: 2
6+
---
7+
8+
The Website data source allows you to connect a domain you own so its pages can be crawled, stored, and indexed.
9+
10+
:::note
11+
You can only crawl domains that you have onboarded onto the same Cloudflare account.
12+
13+
Refer to [Onboard a domain](/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account.
14+
:::
15+
16+
## How website crawling works
17+
When you connect a domain, the crawler looks for your website’s sitemap to determine which pages to visit:
18+
19+
1. The crawler first checks the `robots.txt` for listed sitemaps. If it exists, it reads all sitemaps existing inside.
20+
2. If no `robots.txt` is found, the crawler first checks for a sitemap at `/sitemap.xml`.
21+
3. If no sitemap is available, the domain cannot be crawled.
22+
23+
Pages are visited, according to the `<priority>` attribute set on the sitemaps, if this field is defined.
24+
25+
## Parsing options
26+
You can choose how pages are parsed during crawling:
27+
28+
- **Static sites**: Downloads the raw HTML for each page.
29+
- **Rendered sites**: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. Note that the [Browser Rendering](/browser-rendering/platform/pricing/) limits and billing apply.
30+
31+
## Storage
32+
During setup, AutoRAG creates a dedicated R2 bucket in your account to store the pages that have been crawled and downloaded as HTML files. This bucket is automatically managed and is used only for content discovered by the crawler. Any files or objects that you add directly to this bucket will not be indexed.
33+
34+
:::note
35+
We recommend not to modify the bucket as it may distrupt the indexing flow and cause content to not be updated properly.
36+
:::
37+
38+
## Sync and updates
39+
During scheduled or manual [sync jobs](/autorag/configuration/indexing/), the crawler will check for changes to the `<lastmod>` attribute in your sitemap. If it has been changed to a date occuring after the last sync date, then the page will be crawled, the updated version is stored in the R2 bucket, and automatically reindexed so that your search results always reflect the latest content.
40+
41+
If the `<lastmod>` attribute is not defined, then AutoRAG will automatically crawl each link defined in the sitemap once a day.
42+
43+
## Limits
44+
The regular AutoRAG [limits](/autorag/platform/limits-pricing/) apply when using the Website data source.
45+
46+
The crawler will download and index pages only up to the maximum object limit supported for an AutoRAG instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.

src/content/docs/autorag/configuration/indexing.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ AutoRAG automatically indexes your data into vector embeddings optimized for sem
99

1010
## Jobs
1111

12-
AutoRAG automatically monitors your data source for updates and reindexes your content every few hours. During each cycle, new or modified files are reprocessed to keep your Vectorize index up to date.
12+
AutoRAG automatically monitors your data source for updates and reindexes your content every **6 hours**. During each cycle, new or modified files are reprocessed to keep your Vectorize index up to date.
1313

1414
You can monitor the status and history of all indexing activity in the Jobs tab, including real-time logs for each job to help you troubleshoot and verify successful syncs.
1515

src/content/docs/autorag/platform/limits-pricing.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ During the open beta, AutoRAG is **free to enable**. When you create an AutoRAG
1515
| [**Vectorize**](/vectorize/platform/pricing/) | Stores vector embeddings and powers semantic search |
1616
| [**Workers AI**](/workers-ai/platform/pricing/) | Handles image-to-Markdown conversion, embedding, query rewriting, and response generation |
1717
| [**AI Gateway**](/ai-gateway/reference/pricing/) | Monitors and controls model usage |
18+
| [**Browser Rendering**](/browser-rendering/platform/pricing/) | Loads dynamic JavaScript content during [website](/autorag/configuration/data-source/website/) crawling with the Render option |
1819

1920
For more information about how each resource is used within AutoRAG, reference [How AutoRAG works](/autorag/concepts/how-autorag-works/).
2021

src/content/partials/autorag/ai-search-api-params.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Rewrites the original query into a search optimized query to improve retrieval a
1818

1919
`max_num_results` <Type text="number" /> <MetaInfo text="optional" />
2020

21-
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `20`.
21+
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `50`.
2222

2323
`ranking_options` <Type text="object" /> <MetaInfo text="optional" />
2424

src/content/partials/autorag/search-api-params.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Rewrites the original query into a search optimized query to improve retrieval a
1414

1515
`max_num_results` <Type text="number" /> <MetaInfo text="optional" />
1616

17-
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `20`.
17+
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `50`.
1818

1919
`ranking_options` <Type text="object" /> <MetaInfo text="optional" />
2020

src/content/release-notes/autorag.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@ link: "/autorag/platform/release-note/"
33
productName: AutoRAG
44
productLink: "/autorag/"
55
entries:
6+
- publish_date: "2025-08-20"
7+
title: Increased maximum query results to 50
8+
description: |-
9+
The maximum number of results returned from a query has been increased from **20** to **50**. This allows you to surface more relevant matches in a single request.
610
- publish_date: "2025-07-16"
711
title: Deleted files now removed from index on next sync
812
description: |-

0 commit comments

Comments
 (0)