-
Notifications
You must be signed in to change notification settings - Fork 10.4k
WIP Autorag #20865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Autorag #20865
Changes from 34 commits
06728f9
9cdb04b
1505d7d
02ee040
cfa6a80
e272979
25a103a
699c9a7
21be7d9
894f3b0
334e359
ad39837
0654dc8
32cdddf
b0ef444
89a45f0
450e81d
1d80199
cb24263
ccd78a8
126a524
2de47d3
47546fb
65656a8
ba44681
ba0ba8a
63817b6
21944eb
9167606
56f13d5
b5f9c16
5c0840f
c5f720f
7b77385
ef10ff9
18d0a49
def5623
98ed732
750a45e
b6071aa
5601d41
ead7340
01d7ec0
37a641a
52018fe
7c3628f
603a600
c046361
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: Similarity cache | ||
| sidebar: | ||
| order: 6 | ||
| --- | ||
|
|
||
| Similarity-based caching in AutoRAG lets you serve responses from Cloudflare’s cache for queries that are _similar enough_ to previous requests, not just exact matches. This speeds up response times and cuts costs by reusing answers for questions that are close in meaning. | ||
|
|
||
| ## How It Works | ||
|
|
||
| Unlike basic caching, which only works for identical requests to compare prompts based on their content. When a request comes in: | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. AutoRAG checks if a _similar_ prompt (based on your chosen threshold) has been answered before. | ||
| 2. If a match is found, it returns the cached response instantly. | ||
| 3. If no match is found, it generates a new response and caches it. | ||
|
|
||
| To see if a response came from the cache, check the `cf-aig-cache-status` header: `HIT` for cached and `MISS` for new. | ||
|
|
||
| ## Cache behavior | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
aninibread marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Volatile Cache**: If two similar requests hit at the same time, the first might not cache in time for the second to use it, resulting in a `MISS`. | ||
| - **30-Day Cache**: Cached responses last 30 days, then expire automatically. No custom durations for now. | ||
| - **Data Dependency**: Cached responses are tied to specific document chunks. If those chunks change or get deleted, the cache clears to keep answers fresh. | ||
|
|
||
| ## How Similarity Matching Works | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Similarity caching in AutoRAG uses **MinHash with Locality-Sensitive Hashing (LSH)** to detect prompts that are lexically similar. | ||
|
|
||
| When a new prompt is received: | ||
|
|
||
| 1. The prompt is broken into overlapping token sequences (called _shingles_), typically 2–3 words each. | ||
| 2. These shingles are hashed into a compact fingerprint using the MinHash algorithm. Prompts with more overlapping shingles will have more similar fingerprints. | ||
| 3. Fingerprints are grouped into LSH buckets, which allow AutoRAG to quickly find past prompts that are likely to be similar without scanning every cached prompt. | ||
| 4. If a prompt in the same bucket meets the configured similarity threshold, its cached response is reused. | ||
|
|
||
| ## Choosing a Threshold | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The similarity threshold decides how close two prompts need to be to reuse a cached response. Here’s what you can pick from: | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| | Threshold | Description | Example Match | | ||
| | ---------------- | --------------------------- | ------------------------------------------------------------------------------- | | ||
| | Exact | Near-identical matches only | "What’s the weather like today?" matches with "What is the weather like today?" | | ||
| | Strong (default) | High semantic similarity | "What’s the weather like today?" matches with "How’s the weather today?" | | ||
| | Broad | Moderate match, more hits | "What’s the weather like today?" matches with "Tell me today’s weather" | | ||
| | Loose | Low similarity, max reuse | "What’s the weather like today?" matches with "Give me the forecast" | | ||
|
|
||
| Test these values to see which works best with your application. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: Chunking | ||
| sidebar: | ||
| order: 6 | ||
| --- | ||
|
|
||
| Chunking is the process of splitting large data into smaller segments before embedding them for search. AutoRAG performs **fixed size chunking** during indexing to make your content retrievable at the right level of granularity. | ||
|
|
||
| ## Chunking controls | ||
|
|
||
| AutoRAG exposes two parameters to help you control chunking behavior: | ||
|
|
||
| - **Chunk size**: The number of tokens per chunk. | ||
| - Minimum: `64` | ||
| - Maximum: `512` | ||
| - **Chunk overlap**: The percentage of overlapping tokens between adjacent chunks. | ||
| - Minimum: `0%` | ||
| - Maximum: `30%` | ||
|
|
||
| These settings apply during the indexing step, before your data are embedded and stored in Vectorize. | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Example | ||
|
|
||
| Let’s say your document is tokenized as: `[The, quick, brown, fox, jumps, over, the, lazy, dog, ...]` | ||
|
|
||
| With **chunk size = 5** and **chunk_overlap = 40%** (i.e 2 tokens), your chunks will look like: | ||
|
|
||
| - Chunk 1: `[The, quick, brown, fox, jumps]` | ||
| - Chunk 2: `[fox, jumps, over, the, lazy]` | ||
| - Chunk 3: `[the, lazy, dog, ...]` | ||
|
|
||
| ## Choosing chunk size and overlap | ||
|
|
||
| Chunking affects both how your content is retrieved and how much context is passed into the generation model. | ||
|
|
||
| For chunk size, consider how: | ||
|
|
||
| - **Smaller chunks** create more percise vector matches, but may split relevant ideas across multiple chunks. | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **Larger chunks** retain more context, but may dilute relevance and reduce retrieval precision. | ||
|
|
||
| For chunk overlap, consider how: | ||
|
|
||
| - **More overlap** helps preserve continuity across boundaries, especially in flowing or narrative content. | ||
| - **Less overlap** reduces indexing time and cost, but can miss context if key terms are split between chunks. | ||
|
|
||
| ### Additional considerations: | ||
|
|
||
| - **Vector index size:** Smaller chunk sizes produce more chunks and more total vectors. Refer to the [Vectorize limits](/vectorize/platform/limits/) to ensure your configuration stays within the maximum allowed vectors per index. | ||
| - **Generation model context window:** Generation models have a limited context window that must fit all retrieved chunks (`topK` × `chunk size`), the user query, and the model’s output. Be careful with large chunks or high topK values to avoid context overflows. | ||
| - **Cost and performance:** Larger chunks and higher topK settings result in more tokens passed to the model, which can increase latency and cost. You can monitor this usage in [AI Gateway](/ai-gateway/). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| --- | ||
| title: Data source | ||
| pcx_content_type: how-to | ||
| sidebar: | ||
| order: 2 | ||
| --- | ||
|
|
||
| import { Render } from "~/components"; | ||
|
|
||
| AutoRAG currently supports Cloudflare R2 as the data source for storing your knowledge base. To get started, [configure an R2 bucket](/r2/get-started/) containing your data. | ||
|
|
||
| AutoRAG will automatically scan and process supported files stored in that bucket. Files that are unsupported or exceed the size limit will be skipped during indexing and logged as errors. | ||
|
|
||
| ## File Limit | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| AutoRAG has different file size limits depending on the file type: | ||
|
|
||
| - Up to **4 MB** for files that are already in plain text or Markdown. | ||
| - Up to **1 MB** for files that need to be converted into Markdown (like PDFs or other rich formats). | ||
|
|
||
| Files that exceed these limits won’t be indexed and will show up in the error logs. | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## File Type | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| AutoRAG is powered by and accepts the same file types as [Markdown Conversion](/workers-ai/markdown-conversion/). The following table lists the supported formats: | ||
|
|
||
| <Render file="markdown-conversion-support" product="workers-ai" /> | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,35 @@ | ||||||
| --- | ||||||
| pcx_content_type: navigation | ||||||
| title: Configuration | ||||||
| sidebar: | ||||||
| order: 5 | ||||||
| --- | ||||||
|
|
||||||
| import { MetaInfo, Type } from "~/components"; | ||||||
|
|
||||||
| When creating an AutoRAG instance, you can customize how your RAG pipeline ingests, processes, and responds to data using a set of configuration options. Some settings can be updated after the instance is created, while others are fixed at creation time. | ||||||
|
|
||||||
| The table below lists all available configuration options: | ||||||
|
|
||||||
| | Configuration | Editable after creation | Description | | ||||||
aninibread marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| | ---------------------------------------------------------------------------- | ----------------------- | ------------------------------------------------------------------------------------------ | | ||||||
| | [Data source](/autorag/configuration/data-source/) | no | The source where your knowledge base is stored (e.g. R2 bucket) | | ||||||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| | [Chunk size](/autorag/configuration/chunking/) | yes | Number of tokens per chunk | | ||||||
| | [Chunk overlap](/autorag/configuration/chunking/) | yes | Number of overlapping tokens between chunks | | ||||||
| | [Embedding model](/autorag/configuration/models/) | no | Model used to generate vector embeddings | | ||||||
| | [Query rewrite](/autorag/configuration/query-rewriting/) | yes | Enable or disable query rewriting before retrieval | | ||||||
| | [Query rewrite model](/autorag/configuration/models/) | yes | Model used for query rewriting | | ||||||
| | [Query rewrite system prompt](/autorag/configuration/system-prompt/) | yes | Custom system prompt to guide query rewriting behavior | | ||||||
| | [Match threshold](/autorag/configuration/retrieval-configuration/) | yes | Minimum similarity score required for a vector match | | ||||||
| | [Maximum number of results](/autorag/configuration/retrieval-configuration/) | yes | Maximum number of vector matches returned (`top_k`) | | ||||||
| | [Generation model](/autorag/configuration/models/) | yes | Model used to generate the final response | | ||||||
| | [Generation system prompt](/autorag/configuration/system-prompt/) | yes | Custom system prompt to guide response generation | | ||||||
| | [Similarity caching](/autorag/configuration/cache/) | yes | Enable or disable caching of responses for similar (not just exact) prompts | | ||||||
| | [Similarity caching threshold](/autorag/configuration/cache/) | yes | Controls how similar a new prompt must be to a previous one to reuse its cached response | | ||||||
| | [AI Gateway](/ai-gateway) | yes | AI Gateway for monitoring and controlling model usage | | ||||||
| | AutoRAG name | no | Name of your AutoRAG instance | | ||||||
| | Service API token | yes | API token granted to AutoRAG to give it permission to configure resources on your account. | | ||||||
|
|
||||||
| :::note[API token] | ||||||
| Note that the Service API token is different from the AutoRAG API token that you can make to interact with your AutoRAG. The Service API token is only used by AutoRAG to get permissions to configure resources on your account. | ||||||
|
||||||
| Note that the Service API token is different from the AutoRAG API token that you can make to interact with your AutoRAG. The Service API token is only used by AutoRAG to get permissions to configure resources on your account. | |
| The Service API token is different from the AutoRAG API token that you can make to interact with your AutoRAG. The Service API token is only used by AutoRAG to get permissions to configure resources on your account. |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,37 @@ | ||||||
| --- | ||||||
| pcx_content_type: concept | ||||||
| title: Indexing | ||||||
| sidebar: | ||||||
| order: 4 | ||||||
| --- | ||||||
|
|
||||||
| AutoRAG automatically indexes your data into vector embeddings optimized for semantic search. Once a data source is connected, indexing runs continuously in the background to keep your knowledge base fresh and queryable. | ||||||
|
|
||||||
| ## Jobs | ||||||
|
|
||||||
| AutoRAG automatically monitors your data source for updates and reindexes your content **every 4 hours**. During each cycle, only new or modified files are reprocessed to keep your Vectorize index up to date. | ||||||
|
|
||||||
| ## Controls | ||||||
|
|
||||||
| You can control indexing behavior through the following actions on the Dashboard: | ||||||
|
||||||
| You can control indexing behavior through the following actions on the Dashboard: | |
| You can control indexing behavior through the following actions on the dashboard: |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Sync Index**: This forces AutoRAG to scan your data source for new or modified files and initiates an indexing job to update the associated Vectorize index. A new indexing job can be initiated **every 5 minutes**. | |
| - **Sync Index**: Force AutoRAG to scan your data source for new or modified files and initiate an indexing job to update the associated Vectorize index. A new indexing job can be initiated every 5 minutes. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - File formats (e.g. images take longer than plain text) | |
| - File formats (for example, images take longer than plain text) |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Best Practices | |
| ## Best practices |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,35 @@ | ||||||
| --- | ||||||
| pcx_content_type: concept | ||||||
| title: Models | ||||||
| sidebar: | ||||||
| order: 4 | ||||||
| --- | ||||||
|
|
||||||
| AutoRAG uses models at multiple steps of the RAG pipeline. You can configure which models are used, or let AutoRAG automatically select defaults optimized for general use. | ||||||
|
|
||||||
| ## Models used | ||||||
|
|
||||||
| AutoRAG leverages Workers AI models in the following stages: | ||||||
|
|
||||||
| - **Image to markdown conversion (if images are in data source)**: Converts image content to Markdown using object detection and captioning models. | ||||||
| - **Embedding**: Transforms your documents and queries into vector representations for semantic search. | ||||||
| - **Query rewriting (optional)**: Reformulates the user’s query to improve retrieval accuracy. | ||||||
| - **Generation**: Produces the final response from retrieved context. | ||||||
|
|
||||||
| ## Model providers | ||||||
|
|
||||||
| AutoRAG currently only supports **Workers AI** as the model provider. Usage of models through AutoRAG contributes to your Workers AI usage and is billed as part of your account. | ||||||
|
||||||
| AutoRAG currently only supports **Workers AI** as the model provider. Usage of models through AutoRAG contributes to your Workers AI usage and is billed as part of your account. | |
| AutoRAG currently only supports [Workers AI](/workers-ai/) as the model provider. Usage of models through AutoRAG contributes to your Workers AI usage and is billed as part of your account. |
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When configuring your AutoRAG instance, you can specify the exact model to use for each step of embedding, rewriting, and generation. You can find available model that can be used with AutoRAG in the **Settings** of your AutoRAG. | |
| When configuring your AutoRAG instance, you can specify the exact model to use for each step of embedding, rewriting, and generation. You can find available models that can be used with AutoRAG in the **Settings** of your AutoRAG. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you choose Smart Default in your model selection then AutoRAG will select a Cloudflare recommended model. These defaults may change over time as Cloudflare evaluates and updates model choices. You can switch to explicit model configuration at any time by visiting the Settings. | |
| If you choose **Smart Default** in your model selection, then AutoRAG will select a Cloudflare recommended model. These defaults may change over time as Cloudflare evaluates and updates model choices. You can switch to explicit model configuration at any time by visiting **Settings**. |
Assuming that's the UI?
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: Query rewriting | ||
| sidebar: | ||
| order: 5 | ||
| --- | ||
|
|
||
| Query rewriting is an optional step in the AutoRAG pipeline that improves retrieval quality by transforming the original user query into a more effective search query. | ||
|
|
||
| Instead of embedding the raw user input directly, AutoRAG can use a large language model (LLM) to rewrite the query based on a system prompt. The rewritten query is then used to perform the vector search. | ||
|
|
||
| ## Why use query rewriting? | ||
|
|
||
| The wording of a user’s question may not match how your documents are written. Query rewriting helps bridge this gap by: | ||
|
|
||
| - Rephrasing informal or vague queries into precise, information-dense terms | ||
| - Adding synonyms or related keywords | ||
| - Removing filler words or irrelevant details | ||
| - Incorporating domain-specific terminology | ||
|
|
||
| This leads to more relevant vector matches, which in turn improves the accuracy of the final generated response. | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Example | ||
|
|
||
| **Original query:** `how do i make this work when my api call keeps failing?` | ||
|
|
||
| **Rewritten query:** `API call failure troubleshooting authentication headers rate limiting network timeout 500 error` | ||
|
|
||
| In this example, the original query is conversational and vague. The rewritten version extracts the core problem (API call failure) and expands it with relevant technical terms and likely causes. These terms are much more likely to appear in documentation or logs, improving semantic matching during vector search. | ||
|
|
||
| ## How it works | ||
|
|
||
| If query rewriting is enabled, AutoRAG performs the following: | ||
|
|
||
| 1. Sends the **original user query** and the **query rewrite system prompt** to the configured LLM | ||
| 2. Receives the **rewritten query** from the model | ||
| 3. Embeds the rewritten query using the selected embedding model | ||
| 4. Performs vector search in your AutoRAG’s Vectorize index | ||
|
|
||
| For details on how to guide model behavior during this step, see the [system prompt](/autorag/configuration/system-prompt/) documentation. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| --- | ||
| pcx_content_type: concept | ||
| title: Retrieval configuration | ||
| sidebar: | ||
| order: 5 | ||
| --- | ||
|
|
||
| AutoRAG allows you to configure how content is retrieved from your vector index and used to generate a final response. Two options control this behavior: | ||
|
|
||
| - **Match threshold**: Minimum similarity score required for a vector match to be considered relevant. | ||
| - **Maximum number of results**: Maximum number of top-matching results to return (`top_k`). | ||
|
|
||
| AutoRAG uses the [`query()`](/vectorize/best-practices/query-vectors/) method from [Vectorize](/vectorize/) to perform semantic search. This function compares the embedded query vector against the stored vectors in your index and returns the most similar results. | ||
|
|
||
| ## Match threshold | ||
|
|
||
| The `match_threshold` sets the minimum similarity score (e.g., cosine similarity) that a document chunk must meet to be included in the results. Threshold values range from `0` to `1`. | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - A higher threshold means stricter filtering, returning only highly similar matches. | ||
| - A lower threshold allows broader matches, increasing recall but possibly reducing precision. | ||
|
|
||
| ## Maximum number of results | ||
|
|
||
| This setting controls the number of top-matching chunks returned by Vectorize after filtering by similarity score. It corresponds to the `topK` parameter in `query()`. The maximum allowed value is 50. | ||
|
|
||
| - Use a higher value if you want to synthesize across multiple documents. However, providing more input to the model can increase latency and cost. | ||
| - Use a lower value if you prefer concise answers with minimal context. | ||
|
|
||
| ## How they work together | ||
|
|
||
| AutoRAG's retrieval step follows this sequence: | ||
|
|
||
| 1. Your query is embedded using the configured Workers AI model. | ||
| 2. `query()` is called to search the Vectorize index, with `topK` set to the `maximum_number_of_results`. | ||
| 3. Results are filtered using the `match_threshold`. | ||
| 4. The filtered results are passed into the generation step as context. | ||
|
|
||
| If no results meet the threshold, AutoRAG will not generate a response. | ||
|
|
||
| ## Configuration | ||
|
|
||
| These values can be configured at the AutoRAG instance level or overridden on a per-request basis using the [REST API](/autorag/usage/rest-api/) or the [Workers binding](/autorag/usage/workers-binding/). | ||
aninibread marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Use the parameters `match_threshold` and `max_num_results` to customize retrieval behavior per request. | ||
Uh oh!
There was an error while loading. Please reload this page.