|
| 1 | +--- |
| 2 | +title: Website |
| 3 | +pcx_content_type: how-to |
| 4 | +sidebar: |
| 5 | + order: 2 |
| 6 | +--- |
| 7 | + |
| 8 | +The Website data source allows you to connect a domain you own so its pages can be crawled, stored, and indexed. |
| 9 | + |
| 10 | +:::note |
| 11 | +You can only crawl domains that you have onboarded onto the same Cloudflare account. |
| 12 | + |
| 13 | +Refer to [Onboard a domain](/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account. |
| 14 | +::: |
| 15 | + |
| 16 | +## How website crawling works |
| 17 | +When you connect a domain, the crawler looks for your website’s sitemap to determine which pages to visit: |
| 18 | + |
| 19 | +1. The crawler first checks the `robots.txt` for listed sitemaps. If it exists, it reads all sitemaps existing inside. |
| 20 | +2. If no `robots.txt` is found, the crawler first checks for a sitemap at `/sitemap.xml`. |
| 21 | +3. If no sitemap is available, the domain cannot be crawled. |
| 22 | + |
| 23 | +Pages are visited, according to the `<priority>` attribute set on the sitemaps, if this field is defined. |
| 24 | + |
| 25 | +## Parsing options |
| 26 | +You can choose how pages are parsed during crawling: |
| 27 | + |
| 28 | +- **Static sites**: Downloads the raw HTML for each page. |
| 29 | +- **Rendered sites**: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. Note that the [Browser Rendering](/browser-rendering/platform/pricing/) limits and billing apply. |
| 30 | + |
| 31 | +## Storage |
| 32 | +During setup, AutoRAG creates a dedicated R2 bucket in your account to store the pages that have been crawled and downloaded as HTML files. This bucket is automatically managed and is used only for content discovered by the crawler. Any files or objects that you add directly to this bucket will not be indexed. |
| 33 | + |
| 34 | +:::note |
| 35 | +We recommend not to modify the bucket as it may distrupt the indexing flow and cause content to not be updated properly. |
| 36 | +::: |
| 37 | + |
| 38 | +## Sync and updates |
| 39 | +During scheduled or manual [sync jobs](/autorag/configuration/indexing/), the crawler will check for changes to the `<lastmod>` attribute in your sitemap. If it has been changed to a date occuring after the last sync date, then the page will be crawled, the updated version is stored in the R2 bucket, and automatically reindexed so that your search results always reflect the latest content. |
| 40 | + |
| 41 | +If the `<lastmod>` attribute is not defined, then AutoRAG will automatically crawl each link defined in the sitemap once a day. |
| 42 | + |
| 43 | +## Limits |
| 44 | +The regular AutoRAG [limits](/autorag/platform/limits-pricing/) apply when using the Website data source. |
| 45 | + |
| 46 | +The crawler will download and index pages only up to the maximum object limit supported for an AutoRAG instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed. |
0 commit comments