websites and 50 max results (#24585)

aninibread · Oxyjun · maxvp · commit 8c4e4381de7e · 2025-08-24T14:13:10.000-05:00
* websites and 50 max results

* Apply suggestions from code review

Co-authored-by: Jun Lee &lt;junlee@cloudflare.com&gt;

* fixes

* add note

---------

Co-authored-by: Jun Lee &lt;junlee@cloudflare.com&gt;
diff --git a/src/content/docs/autorag/configuration/data-source/index.mdx b/src/content/docs/autorag/configuration/data-source/index.mdx
@@ -0,0 +1,13 @@
+---
+title: Data source
+pcx_content_type: how-to
+sidebar:
+  order: 2
+---
+
+AutoRAG can directly ingest data from the following sources:
+
+| Data Source   | Description |
+|---------------|-------------|
+| [Website](/autorag/configuration/data-source/website/)   | Connect a domain you own to index website pages. |
+| [R2 Bucket](/autorag/configuration/data-source/r2/) | Connect a Cloudflare R2 bucket to index stored documents. |
diff --git a/src/content/docs/autorag/configuration/data-source/r2.mdx b/src/content/docs/autorag/configuration/data-source/r2.mdx
@@ -1,13 +1,13 @@
 ---
-title: Data source
+title: R2
 pcx_content_type: how-to
 sidebar:
   order: 2
 ---
 
 import { Render } from "~/components";
 
-AutoRAG currently supports Cloudflare R2 as the data source for storing your knowledge base. To get started, [configure an R2 bucket](/r2/get-started/) containing your data.
+You can use Cloudflare R2 to store data for indexing. To get started, [configure an R2 bucket](/r2/get-started/) containing your data.
 
 AutoRAG will automatically scan and process supported files stored in that bucket. Files that are unsupported or exceed the size limit will be skipped during indexing and logged as errors.
 
diff --git a/src/content/docs/autorag/configuration/data-source/website.mdx b/src/content/docs/autorag/configuration/data-source/website.mdx
@@ -0,0 +1,46 @@
+---
+title: Website
+pcx_content_type: how-to
+sidebar:
+  order: 2
+---
+
+The Website data source allows you to connect a domain you own so its pages can be crawled, stored, and indexed.
+
+:::note
+You can only crawl domains that you have onboarded onto the same Cloudflare account.
+
+Refer to [Onboard a domain](/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account.
+:::
+
+## How website crawling works
+When you connect a domain, the crawler looks for your website’s sitemap to determine which pages to visit:  
+
+1. The crawler first checks the `robots.txt` for listed sitemaps. If it exists, it reads all sitemaps existing inside.
+2. If no `robots.txt` is found, the crawler first checks for a sitemap at `/sitemap.xml`.
+3. If no sitemap is available, the domain cannot be crawled.  
+
+Pages are visited, according to the `<priority>` attribute set on the sitemaps, if this field is defined.
+
+## Parsing options
+You can choose how pages are parsed during crawling:  
+
+- **Static sites**: Downloads the raw HTML for each page.  
+- **Rendered sites**: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. Note that the [Browser Rendering](/browser-rendering/platform/pricing/) limits and billing apply.
+
+## Storage
+During setup, AutoRAG creates a dedicated R2 bucket in your account to store the pages that have been crawled and downloaded as HTML files. This bucket is automatically managed and is used only for content discovered by the crawler. Any files or objects that you add directly to this bucket will not be indexed.  
+
+:::note
+We recommend not to modify the bucket as it may distrupt the indexing flow and cause content to not be updated properly.
+:::
+
+## Sync and updates
+During scheduled or manual [sync jobs](/autorag/configuration/indexing/), the crawler will check for changes to the `<lastmod>` attribute in your sitemap. If it has been changed to a date occuring after the last sync date, then the page will be crawled, the updated version is stored in the R2 bucket, and automatically reindexed so that your search results always reflect the latest content.
+
+If the `<lastmod>` attribute is not defined, then AutoRAG will automatically crawl each link defined in the sitemap once a day.
+
+## Limits
+The regular AutoRAG [limits](/autorag/platform/limits-pricing/) apply when using the Website data source. 
+
+The crawler will download and index pages only up to the maximum object limit supported for an AutoRAG instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.  
diff --git a/src/content/docs/autorag/configuration/indexing.mdx b/src/content/docs/autorag/configuration/indexing.mdx
@@ -9,7 +9,7 @@ AutoRAG automatically indexes your data into vector embeddings optimized for sem
 
 ## Jobs
 
-AutoRAG automatically monitors your data source for updates and reindexes your content every few hours. During each cycle, new or modified files are reprocessed to keep your Vectorize index up to date.
+AutoRAG automatically monitors your data source for updates and reindexes your content every **6 hours**. During each cycle, new or modified files are reprocessed to keep your Vectorize index up to date.
 
 You can monitor the status and history of all indexing activity in the Jobs tab, including real-time logs for each job to help you troubleshoot and verify successful syncs.
 
diff --git a/src/content/docs/autorag/platform/limits-pricing.mdx b/src/content/docs/autorag/platform/limits-pricing.mdx
@@ -15,6 +15,7 @@ During the open beta, AutoRAG is **free to enable**. When you create an AutoRAG
 | [**Vectorize**](/vectorize/platform/pricing/)    | Stores vector embeddings and powers semantic search                                       |
 | [**Workers AI**](/workers-ai/platform/pricing/)  | Handles image-to-Markdown conversion, embedding, query rewriting, and response generation |
 | [**AI Gateway**](/ai-gateway/reference/pricing/) | Monitors and controls model usage                                                         |
+| [**Browser Rendering**](/browser-rendering/platform/pricing/)     | Loads dynamic JavaScript content during [website](/autorag/configuration/data-source/website/) crawling with the Render option                            |
 
 For more information about how each resource is used within AutoRAG, reference [How AutoRAG works](/autorag/concepts/how-autorag-works/).
 
diff --git a/src/content/partials/autorag/ai-search-api-params.mdx b/src/content/partials/autorag/ai-search-api-params.mdx
@@ -18,7 +18,7 @@ Rewrites the original query into a search optimized query to improve retrieval a
 
 `max_num_results` <Type text="number" /> <MetaInfo text="optional" />
 
-The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `20`.
+The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `50`.
 
 `ranking_options` <Type text="object" /> <MetaInfo text="optional" />
 
diff --git a/src/content/partials/autorag/search-api-params.mdx b/src/content/partials/autorag/search-api-params.mdx
@@ -14,7 +14,7 @@ Rewrites the original query into a search optimized query to improve retrieval a
 
 `max_num_results` <Type text="number" /> <MetaInfo text="optional" />
 
-The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `20`.
+The maximum number of results that can be returned from the Vectorize database. Defaults to `10`. Must be between `1` and `50`.
 
 `ranking_options` <Type text="object" /> <MetaInfo text="optional" />
 
diff --git a/src/content/release-notes/autorag.yaml b/src/content/release-notes/autorag.yaml
@@ -3,6 +3,10 @@ link: "/autorag/platform/release-note/"
 productName: AutoRAG
 productLink: "/autorag/"
 entries:
+  - publish_date: "2025-08-20"
+    title: Increased maximum query results to 50
+    description: |-
+      The maximum number of results returned from a query has been increased from **20** to **50**. This allows you to surface more relevant matches in a single request.  
   - publish_date: "2025-07-16"
     title: Deleted files now removed from index on next sync
     description: |-