Skip to content

Commit 70fe8a8

Browse files
committed
fixes
1 parent 7a246b1 commit 70fe8a8

File tree

1 file changed

+9
-5
lines changed
  • src/content/docs/autorag/configuration/data-source

1 file changed

+9
-5
lines changed

src/content/docs/autorag/configuration/data-source/website.mdx

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ Refer to [Onboard a domain](/fundamentals/manage-domains/add-site/) for more inf
1414
:::
1515

1616
## How website crawling works
17-
When you connect a domain, the crawler looks for your site’s sitemap to determine which pages to visit:
17+
When you connect a domain, the crawler looks for your website’s sitemap to determine which pages to visit:
1818

19-
1. The crawler first checks for a sitemap at `/sitemap.xml`.
20-
2. If no sitemap is found, it checks `robots.txt` for listed sitemaps.
19+
1. The crawler first checks the `robots.txt` for listed sitemaps. If it exists, it reads all sitemaps existing inside.
20+
2. If no `robots.txt` is found, the crawler first checks for a sitemap at `/sitemap.xml`.
2121
3. If no sitemap is available, the domain cannot be crawled.
2222

23-
Pages are visited in the order defined by your sitemap.
23+
Pages are visited, according to the `<priority>` attribute set on the sitemaps, if this field is defined.
2424

2525
## Parsing options
2626
You can choose how pages are parsed during crawling:
@@ -31,8 +31,12 @@ You can choose how pages are parsed during crawling:
3131
## Storage
3232
During setup, AutoRAG creates a dedicated R2 bucket in your account to store the pages that have been crawled and downloaded as HTML files. This bucket is automatically managed and is used only for content discovered by the crawler. Any files or objects that you add directly to this bucket will not be indexed.
3333

34+
We recommend not to modify the bucket as it may distrupt the indexing flow and cause content to not be updated properly.
35+
3436
## Sync and updates
35-
During scheduled or manual [sync jobs](/autorag/configuration/indexing/), the crawler will check for changes on your website. If a page changes, the updated version is stored in the R2 bucket and automatically reindexed so that your search results always reflect the latest content.
37+
During scheduled or manual [sync jobs](/autorag/configuration/indexing/), the crawler will check for changes to the `<lastmod>` attribute in your sitemap. If it has been changed to a date occuring after the last sync date, then the page will be crawled, the updated version is stored in the R2 bucket, and automatically reindexed so that your search results always reflect the latest content.
38+
39+
If the `<lastmod>` attribute is not defined, then AutoRAG will automatically crawl each link defined in the sitemap once a day.
3640

3741
## Limits
3842
The regular AutoRAG [limits](/autorag/platform/limits-pricing/) apply when using the Website data source.

0 commit comments

Comments
 (0)