-
Notifications
You must be signed in to change notification settings - Fork 10.4k
WIP Autorag #20865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
WIP Autorag #20865
Changes from 42 commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
06728f9
New autrag product section
ToriLindsay 9cdb04b
Added placeholder file to new autorag folder
ToriLindsay 1505d7d
Added Overview and Get Started files with frontmatter
ToriLindsay 02ee040
made autorag not capitalized
ToriLindsay cfa6a80
lowercase
ToriLindsay e272979
Delete src/content/docs/AutoRAG/autorag.mdx
ToriLindsay 25a103a
Delete src/content/docs/AutoRAG/index.mdx
ToriLindsay 699c9a7
Update src/content/docs/autorag/autorag.mdx
ToriLindsay 21be7d9
Update src/content/docs/autorag/autorag.mdx
ToriLindsay 894f3b0
Rename autorag.mdx to get-started.mdx
ToriLindsay 334e359
removed externals and algolia from yaml
ToriLindsay ad39837
Update src/content/docs/autorag/index.mdx
ToriLindsay 0654dc8
getting started + bindings
aninibread 32cdddf
Update src/content/docs/autorag/get-started.mdx
aninibread b0ef444
Update src/content/docs/autorag/get-started.mdx
aninibread 89a45f0
progress
aninibread 450e81d
progress 2
aninibread 1d80199
progress 3
aninibread cb24263
completed content
aninibread ccd78a8
mostly there
aninibread 126a524
fix link
aninibread 2de47d3
deep link and small fixes
aninibread 47546fb
added response structure
aninibread 65656a8
small fix
aninibread ba44681
fix
aninibread ba0ba8a
pricing
aninibread 63817b6
Update how-autorag-works.mdx
kathayl 21944eb
fix general structure and small issues
aninibread 9167606
add references
aninibread 56f13d5
added image fix links
aninibread b5f9c16
Update src/content/docs/autorag/platform/release-note.mdx
aninibread 5c0840f
small fix
aninibread c5f720f
merged production
kodster28 7b77385
remove extra link
kodster28 ef10ff9
Apply suggestions from code review
aninibread 18d0a49
Apply suggestions from code review
aninibread def5623
index edit
aninibread 98ed732
small fixes
aninibread 750a45e
Update src/content/docs/autorag/how-autorag-works.mdx
aninibread b6071aa
Update how-autorag-works.mdx
aninibread 5601d41
edits for new content / structure
aninibread ead7340
add tutorials
aninibread 01d7ec0
Update src/content/docs/autorag/platform/limits-pricing.mdx
kodster28 37a641a
binding fix and changelog addition
aninibread 52018fe
fix doc recommendation
aninibread 7c3628f
better wording in cache
aninibread 603a600
autorag changelog
aninibread c046361
final fixes
aninibread File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: How AutoRAG works | ||
| sidebar: | ||
| order: 2 | ||
| --- | ||
|
|
||
| AutoRAG simplifies the process of building and managing a Retrieval-Augmented Generation (RAG) pipeline. Instead of manually stitching together components, and writing custom logic for indexing, retrieval, and generation, AutoRAG handles it all for you. It also continuously indexes your data to ensure responses stay accurate and up-to-date. | ||
|
|
||
| AutoRAG consists of two core processes: | ||
|
|
||
| - **Indexing:** An asynchronous background process that monitors your data source for changes and transforms your data into vector representation for search. | ||
| - **Querying:** A synchronous process triggered by user queries. It retrieves the most relevant content and generates context-aware responses using a large language model (LLM). | ||
|
|
||
| ## How indexing works | ||
|
|
||
| Indexing begins automatically when you create an AutoRAG instance and connect a data source. | ||
|
|
||
| Here is what happens during indexing: | ||
|
|
||
| 1. **Data ingestion:** AutoRAG reads from your connected data source. | ||
| 2. **Markdown conversion:** AutoRAG uses [Workers AI’s Markdown Conversion](/workers-ai/markdown-conversion/) to convert [supported data types](/autorag/configuration/data-source/) into structured Markdown. This ensures consistency across diverse file types. For images, Workers AI is used to perform object detection followed by vision-to-language transformation to convert images into Markdown text. | ||
| 3. **Chunking:** The extracted text is [chunked](/autorag/configuration/chunking/) into smaller pieces to improve retrieval granularity. | ||
| 4. **Embedding:** Each chunk is embedded using Workers AI’s embedding model to transform the content into vectors. | ||
| 5. **Vector storage:** The resulting vectors, along with metadata like source location and file name, are stored in a the [Vectorize](/vectorize/) database created on your Cloudflare account. | ||
|
|
||
|  | ||
|
|
||
| ## How querying works | ||
|
|
||
| Once indexing is complete, AutoRAG is ready to respond to end-user queries in real time. | ||
|
|
||
| Here is how the querying pipeline works: | ||
|
|
||
| 1. **Receive query from AutoRAG API:** The query workflow begins when you send a request to either the AutoRAG’s [AI Search](/autorag/usage/rest-api/#ai-search) or [Search](/autorag/usage/rest-api/#search) endpoints. | ||
| 2. **Query rewriting (optional):** AutoRAG provides the option to [rewrite the input query](/autorag/configuration/query-rewriting/) using one of Workers AI’s LLMs to improve retrieval quality by transforming the original query into a more effective search query. | ||
| 3. **Embedding the query:** The rewritten (or original) query is transformed into a vector via the same embedding model used to embed your data so that it can be compared against your vectorized data to find the most relevant matches. | ||
| 4. **Querying Vectorize index:** The query vector is [queried](/vectorize/best-practices/query-vectors/) against stored vectors in the associated Vectorize database for your AutoRAG. | ||
| 5. **Content retrieval:** Vectorize returns the metadata of the most relevant chunks, and the original content is retrieved from the R2 bucket. These are passed to a text-generation model. | ||
| 6. **Response generation:** A text-generation model from Workers AI is used to generate a response using the retrieved content and the original user’s query, combined via a [system prompt](/autorag/configuration/system-prompt/). | ||
|
|
||
|  |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| --- | ||
| pcx_content_type: navigation | ||
| title: Concepts | ||
| sidebar: | ||
| order: 3 | ||
| group: | ||
| hideIndex: true | ||
| --- | ||
|
|
||
| import { DirectoryListing } from "~/components"; | ||
|
|
||
| <DirectoryListing /> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: What is RAG | ||
| sidebar: | ||
| order: 1 | ||
| --- | ||
|
|
||
| Retrieval-Augmented Generation (RAG) is a method used to enable a large language model (LLM) to answer questions using your own data, and not just what it was trained on. It works by retrieving relevant content from a provided data source and passing it to the model to generate a grounded response. | ||
|
|
||
| ## How RAG works | ||
|
|
||
| Here’s a simplified overview of the RAG pipeline: | ||
|
|
||
| 1. **Indexing:** Your content (e.g. docs, wikis, product data) is ingested and split into smaller chunks and transformed into vectors using an embedding model. These vectors are stored in a vector index. | ||
| 2. **Retrieval:** When a user asks a question, it’s also embedded and used to find the most relevant chunks from the vector database. | ||
| 3. **Generation:** The retrieved content and the user’s original question are combined into a single prompt via a system prompt. A text-generation model generates a response based on this context. | ||
|
|
||
|  | ||
|
|
||
| :::note[How does AutoRAG work] | ||
| To understand how the AutoRAG pipeline works, reference [How AutoRAG works](/autorag/concepts/how-autorag-works/) for details. | ||
| ::: | ||
|
|
||
| ### Key concepts | ||
|
|
||
| - **Embedding:** Turning text into a vector so it can be searched by meaning, not keywords. | ||
| - **Vector index:** A database that lets you search embeddings by similarity (e.g. Vectorize). | ||
| - **Chunking:** Splitting large text into smaller pieces for better indexing and retrieval. | ||
| - **System prompt:** A template that controls how the generation model uses the retrieved data and query to form a response. | ||
|
|
||
| ## Why use RAG? | ||
|
|
||
| RAG can be used to: | ||
|
|
||
| - Provide accurate and up-to-date answers without fine-tuning a model. | ||
| - Control the source of truth by using your own data. | ||
| - Reduce hallucinations by grounding responses in real content. | ||
|
|
||
| RAG is ideal for building AI-powered apps like: | ||
|
|
||
| - AI assistants for your internal knowledge base | ||
| - Search experiences for your documents |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: Similarity cache | ||
| sidebar: | ||
| order: 6 | ||
| --- | ||
|
|
||
| Similarity-based caching in AutoRAG lets you serve responses from Cloudflare’s cache for queries that are similar to previous requests, rather than creating new, unique responses for every request. This speeds up response times and cuts costs by reusing answers for questions that are close in meaning. | ||
|
|
||
| ## How It Works | ||
|
|
||
| Unlike with basic caching, which creates a new response with every request, this is what happens when a request is received using similarity-based caching: | ||
|
|
||
| 1. AutoRAG checks if a _similar_ prompt (based on your chosen threshold) has been answered before. | ||
| 2. If a match is found, it returns the cached response instantly. | ||
| 3. If no match is found, it generates a new response and caches it. | ||
|
|
||
| To see if a response came from the cache, check the `cf-aig-cache-status` header: `HIT` for cached and `MISS` for new. | ||
|
|
||
| ## What to consider when using similarity cache | ||
|
|
||
| Consider these behaviors when using similarity caching: | ||
| - **Volatile Cache**: If two similar requests hit at the same time, the first might not cache in time for the second to use it, resulting in a `MISS`. | ||
| - **30-Day Cache**: Cached responses last 30 days, then expire automatically. No custom durations for now. | ||
| - **Data Dependency**: Cached responses are tied to specific document chunks. If those chunks change or get deleted, the cache clears to keep answers fresh. | ||
|
|
||
| ## How similarity matching works | ||
|
|
||
| Similarity caching in AutoRAG uses **MinHash with Locality-Sensitive Hashing (LSH)** to detect prompts that are lexically similar. | ||
|
|
||
| When a new prompt is received: | ||
|
|
||
| 1. The prompt is broken into overlapping token sequences (called _shingles_), typically 2–3 words each. | ||
| 2. These shingles are hashed into a compact fingerprint using the MinHash algorithm. Prompts with more overlapping shingles will have more similar fingerprints. | ||
| 3. Fingerprints are grouped into LSH buckets, which allow AutoRAG to quickly find past prompts that are likely to be similar without scanning every cached prompt. | ||
| 4. If a prompt in the same bucket meets the configured similarity threshold, its cached response is reused. | ||
|
|
||
| ## Choosing a threshold | ||
|
|
||
| The similarity threshold decides how close two prompts need to be to reuse a cached response. Here are the available thresholds: | ||
|
|
||
| | Threshold | Description | Example Match | | ||
| | ---------------- | --------------------------- | ------------------------------------------------------------------------------- | | ||
| | Exact | Near-identical matches only | "What’s the weather like today?" matches with "What is the weather like today?" | | ||
| | Strong (default) | High semantic similarity | "What’s the weather like today?" matches with "How’s the weather today?" | | ||
| | Broad | Moderate match, more hits | "What’s the weather like today?" matches with "Tell me today’s weather" | | ||
| | Loose | Low similarity, max reuse | "What’s the weather like today?" matches with "Give me the forecast" | | ||
|
|
||
| Test these values to see which works best with your application. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: Chunking | ||
| sidebar: | ||
| order: 6 | ||
| --- | ||
|
|
||
| Chunking is the process of splitting large data into smaller segments before embedding them for search. AutoRAG uses **recursive chunking**, which breaks your content at natural boundaries (like paragraphs or sentences), and then further splits it if the chunks are too large. | ||
|
|
||
| ## What is recurisve chunking | ||
|
|
||
| Recursive chunking tries to keep chunks meaningful by: | ||
|
|
||
| - **Splitting at natural boundaries:** like paragraphs, then sentences. | ||
| - **Checking the size:** if a chunk is too long (based on token count), it’s split again into smaller parts. | ||
|
|
||
| This way, chunks are easy to embed and retrieve, without cutting off thoughts mid-sentence. | ||
|
|
||
| ## Chunking controls | ||
|
|
||
| AutoRAG exposes two parameters to help you control chunking behavior: | ||
|
|
||
| - **Chunk size**: The number of tokens per chunk. | ||
| - Minimum: `64` | ||
| - Maximum: `512` | ||
| - **Chunk overlap**: The percentage of overlapping tokens between adjacent chunks. | ||
| - Minimum: `0%` | ||
| - Maximum: `30%` | ||
|
|
||
| These settings apply during the indexing step, before your data is embedded and stored in Vectorize. | ||
|
|
||
| ## Choosing chunk size and overlap | ||
|
|
||
| Chunking affects both how your content is retrieved and how much context is passed into the generation model. Try out this external [chunk visualizer tool](https://huggingface.co/spaces/m-ric/chunk_visualizer) to help understand how different chunk settings could look. | ||
|
|
||
| For chunk size, consider how: | ||
|
|
||
| - **Smaller chunks** create more precise vector matches, but may split relevant ideas across multiple chunks. | ||
| - **Larger chunks** retain more context, but may dilute relevance and reduce retrieval precision. | ||
|
|
||
| For chunk overlap, consider how: | ||
|
|
||
| - **More overlap** helps preserve continuity across boundaries, especially in flowing or narrative content. | ||
| - **Less overlap** reduces indexing time and cost, but can miss context if key terms are split between chunks. | ||
|
|
||
| ### Additional considerations: | ||
|
|
||
| - **Vector index size:** Smaller chunk sizes produce more chunks and more total vectors. Refer to the [Vectorize limits](/vectorize/platform/limits/) to ensure your configuration stays within the maximum allowed vectors per index. | ||
| - **Generation model context window:** Generation models have a limited context window that must fit all retrieved chunks (`topK` × `chunk size`), the user query, and the model’s output. Be careful with large chunks or high topK values to avoid context overflows. | ||
| - **Cost and performance:** Larger chunks and higher topK settings result in more tokens passed to the model, which can increase latency and cost. You can monitor this usage in [AI Gateway](/ai-gateway/). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| --- | ||
| title: Data source | ||
| pcx_content_type: how-to | ||
| sidebar: | ||
| order: 2 | ||
| --- | ||
|
|
||
| import { Render } from "~/components"; | ||
|
|
||
| AutoRAG currently supports Cloudflare R2 as the data source for storing your knowledge base. To get started, [configure an R2 bucket](/r2/get-started/) containing your data. | ||
|
|
||
| AutoRAG will automatically scan and process supported files stored in that bucket. Files that are unsupported or exceed the size limit will be skipped during indexing and logged as errors. | ||
|
|
||
| ## File limits | ||
|
|
||
| AutoRAG has different file size limits depending on the file type: | ||
|
|
||
| - Up to **4 MB** for files that are already in plain text or Markdown. | ||
| - Up to **1 MB** for files that need to be converted into Markdown (like PDFs or other rich formats). | ||
|
|
||
| Files that exceed these limits will not be indexed and will show up in the error logs. | ||
|
|
||
| ## File types | ||
|
|
||
| AutoRAG is powered by and accepts the same file types as [Markdown Conversion](/workers-ai/markdown-conversion/). The following table lists the supported formats: | ||
|
|
||
| <Render file="markdown-conversion-support" product="workers-ai" /> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| --- | ||
| pcx_content_type: navigation | ||
| title: Configuration | ||
| sidebar: | ||
| order: 5 | ||
| --- | ||
|
|
||
| import { MetaInfo, Type } from "~/components"; | ||
|
|
||
| When creating an AutoRAG instance, you can customize how your RAG pipeline ingests, processes, and responds to data using a set of configuration options. Some settings can be updated after the instance is created, while others are fixed at creation time. | ||
|
|
||
| The table below lists all available configuration options: | ||
|
|
||
| | Configuration | Editable after creation | Description | | ||
aninibread marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | ---------------------------------------------------------------------------- | ----------------------- | ------------------------------------------------------------------------------------------ | | ||
| | [Data source](/autorag/configuration/data-source/) | no | The source where your knowledge base is stored (for example, R2 bucket) | | ||
| | [Chunk size](/autorag/configuration/chunking/) | yes | Number of tokens per chunk | | ||
| | [Chunk overlap](/autorag/configuration/chunking/) | yes | Number of overlapping tokens between chunks | | ||
| | [Embedding model](/autorag/configuration/models/) | no | Model used to generate vector embeddings | | ||
| | [Query rewrite](/autorag/configuration/query-rewriting/) | yes | Enable or disable query rewriting before retrieval | | ||
| | [Query rewrite model](/autorag/configuration/models/) | yes | Model used for query rewriting | | ||
| | [Query rewrite system prompt](/autorag/configuration/system-prompt/) | yes | Custom system prompt to guide query rewriting behavior | | ||
| | [Match threshold](/autorag/configuration/retrieval-configuration/) | yes | Minimum similarity score required for a vector match | | ||
| | [Maximum number of results](/autorag/configuration/retrieval-configuration/) | yes | Maximum number of vector matches returned (`top_k`) | | ||
| | [Generation model](/autorag/configuration/models/) | yes | Model used to generate the final response | | ||
| | [Generation system prompt](/autorag/configuration/system-prompt/) | yes | Custom system prompt to guide response generation | | ||
| | [Similarity caching](/autorag/configuration/cache/) | yes | Enable or disable caching of responses for similar (not just exact) prompts | | ||
| | [Similarity caching threshold](/autorag/configuration/cache/) | yes | Controls how similar a new prompt must be to a previous one to reuse its cached response | | ||
| | [AI Gateway](/ai-gateway) | yes | AI Gateway for monitoring and controlling model usage | | ||
| | AutoRAG name | no | Name of your AutoRAG instance | | ||
| | Service API token | yes | API token granted to AutoRAG to give it permission to configure resources on your account. | | ||
|
|
||
| :::note[API token] | ||
| The Service API token is different from the AutoRAG API token that you can make to interact with your AutoRAG. The Service API token is only used by AutoRAG to get permissions to configure resources on your account. | ||
| ::: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: Indexing | ||
| sidebar: | ||
| order: 4 | ||
| --- | ||
|
|
||
| AutoRAG automatically indexes your data into vector embeddings optimized for semantic search. Once a data source is connected, indexing runs continuously in the background to keep your knowledge base fresh and queryable. | ||
|
|
||
| ## Jobs | ||
|
|
||
| AutoRAG automatically monitors your data source for updates and reindexes your content **every 4 hours**. During each cycle, only new or modified files are reprocessed to keep your Vectorize index up to date. | ||
|
|
||
| ## Controls | ||
|
|
||
| You can control indexing behavior through the following actions on the dashboard: | ||
|
|
||
| - **Sync Index**: Force AutoRAG to scan your data source for new or modified files and initiate an indexing job to update the associated Vectorize index. A new indexing job can be initiated every 5 minutes. | ||
| - **Pause Indexing**: Temporarily stop all scheduled indexing checks and reprocessing. Useful for debugging or freezing your knowledge base. | ||
|
|
||
| ## Performance | ||
|
|
||
| AutoRAG processes files in parallel for efficient indexing. The total time to index depends on the number and type of files in your data source. | ||
|
|
||
| Factors that affect performance include: | ||
|
|
||
| - Total number of files and their sizes | ||
| - File formats (for example, images take longer than plain text) | ||
| - Latency of Workers AI models used for embedding and image processing | ||
|
|
||
| ## Best practices | ||
|
|
||
| To ensure smooth and reliable indexing: | ||
|
|
||
| - Make sure your files are within the size limit (10 MB) and in a supported format to avoid being skipped. | ||
| - Keep your Service API token valid to prevent indexing failures. | ||
| - Regularly clean up outdated or unnecessary content in your knowledge base to avoid hitting [Vectorize index limits](/vectorize/platform/limits/). |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.