Skip to content

Commit fc27070

Browse files
committed
websites and 50 max results
1 parent 8d6fd15 commit fc27070

File tree

8 files changed

+57
-5
lines changed

8 files changed

+57
-5
lines changed
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
title: Data source
3+
pcx_content_type: how-to
4+
sidebar:
5+
order: 2
6+
---
7+
8+
You can have AutoRAG ingest data directly from the following sources:
9+
10+
| Data Source | Description |
11+
|---------------|-------------|
12+
| [Website](/autorag/configuration/data-source/website/) | Connect a domain you own to index website pages. |
13+
| [R2 Bucket](/autorag/configuration/data-source/r2/) | Connect a Cloudflare R2 bucket to index stored documents. |

src/content/docs/autorag/configuration/data-source.mdx renamed to src/content/docs/autorag/configuration/data-source/r2.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
2-
title: Data source
2+
title: R2
33
pcx_content_type: how-to
44
sidebar:
55
order: 2
66
---
77

88
import { Render } from "~/components";
99

10-
AutoRAG currently supports Cloudflare R2 as the data source for storing your knowledge base. To get started, [configure an R2 bucket](/r2/get-started/) containing your data.
10+
You can use Cloudflare R2 to store data for indexing. To get started, [configure an R2 bucket](/r2/get-started/) containing your data.
1111

1212
AutoRAG will automatically scan and process supported files stored in that bucket. Files that are unsupported or exceed the size limit will be skipped during indexing and logged as errors.
1313

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: Website
3+
pcx_content_type: how-to
4+
sidebar:
5+
order: 2
6+
---
7+
8+
The Website data source allows you to connect a domain you own so its pages can be crawled, stored, and indexed. You can only crawl domains that are part of the **same Cloudflare account**.
9+
10+
## How website crawling works
11+
When you connect a domain, the crawler looks for your site’s sitemap to determine which pages to visit:
12+
13+
1. The crawler first checks for a sitemap at `/sitemap.xml`.
14+
2. If no sitemap is found, it checks `robots.txt` for listed sitemaps.
15+
3. If no sitemap is available, the domain cannot be crawled.
16+
17+
Pages are visited in the order defined by your sitemap.
18+
19+
## Parsing options
20+
You can choose how pages are parsed during crawling:
21+
22+
- **Static sites**: Downloads the raw HTML for each page.
23+
- **Rendered sites**: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. Note that the [Browser Rendering](/browser-rendering/platform/pricing/) limits and billing apply.
24+
25+
## Storage
26+
During setup, AutoRAG creates a dedicated R2 bucket in your account to store the pages that have been crawled and downloaded as HTML files. This bucket is automatically managed and is used only for content discovered by the crawler. Any files or objects that you add directly to this bucket will not be indexed.
27+
28+
## Sync and updates
29+
During scheduled or manual [sync jobs](/autorag/configuration/indexing/) the crawler will check for changes on your website. If a page changes, the updated version is stored in the R2 bucket and reindexed automatically so that your search results always reflect the latest content.
30+
31+
## Limits
32+
The regular AutoRAG [limits](/autorag/platform/limits-pricing/) apply when using the Website data source.
33+
34+
The crawler will download and index pages only up to the maximum object limit supported for an AutoRAG instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.

src/content/docs/autorag/configuration/indexing.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ AutoRAG automatically indexes your data into vector embeddings optimized for sem
99

1010
## Jobs
1111

12-
AutoRAG automatically monitors your data source for updates and reindexes your content every few hours. During each cycle, new or modified files are reprocessed to keep your Vectorize index up to date.
12+
AutoRAG automatically monitors your data source for updates and reindexes your content every **6 hours**. During each cycle, new or modified files are reprocessed to keep your Vectorize index up to date.
1313

1414
You can monitor the status and history of all indexing activity in the Jobs tab, including real-time logs for each job to help you troubleshoot and verify successful syncs.
1515

src/content/docs/autorag/platform/limits-pricing.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ During the open beta, AutoRAG is **free to enable**. When you create an AutoRAG
1515
| [**Vectorize**](/vectorize/platform/pricing/) | Stores vector embeddings and powers semantic search |
1616
| [**Workers AI**](/workers-ai/platform/pricing/) | Handles image-to-Markdown conversion, embedding, query rewriting, and response generation |
1717
| [**AI Gateway**](/ai-gateway/reference/pricing/) | Monitors and controls model usage |
18+
| [**Browser Rendering**](/browser-rendering/platform/pricing/) | Loads dynamic JavaScript content during [website](/autorag/configuration/data-source/website/) crawling with the Render option |
1819

1920
For more information about how each resource is used within AutoRAG, reference [How AutoRAG works](/autorag/concepts/how-autorag-works/).
2021

src/content/partials/autorag/ai-search-api-params.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Rewrites the original query into a search optimized query to improve retrieval a
1818

1919
`max_num_results` <Type text="number" /> <MetaInfo text="optional" />
2020

21-
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `20`.
21+
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `50`.
2222

2323
`ranking_options` <Type text="object" /> <MetaInfo text="optional" />
2424

src/content/partials/autorag/search-api-params.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Rewrites the original query into a search optimized query to improve retrieval a
1414

1515
`max_num_results` <Type text="number" /> <MetaInfo text="optional" />
1616

17-
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `20`.
17+
The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `50`.
1818

1919
`ranking_options` <Type text="object" /> <MetaInfo text="optional" />
2020

src/content/release-notes/autorag.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@ link: "/autorag/platform/release-note/"
33
productName: AutoRAG
44
productLink: "/autorag/"
55
entries:
6+
- publish_date: "2025-08-20"
7+
title: Increased maximum query results to 50
8+
description: |-
9+
The maximum number of results returned from a query has been increased from **20** to **50**. This allows you to surface more relevant matches in a single request.
610
- publish_date: "2025-07-16"
711
title: Deleted files now removed from index on next sync
812
description: |-

0 commit comments

Comments
 (0)