|
| 1 | +<!-- Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.--> |
| 2 | + |
| 3 | +<picture> |
| 4 | + <source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg"> |
| 5 | + <source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg"> |
| 6 | + <img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;"> |
| 7 | +</picture> |
| 8 | + |
| 9 | +# Retrieval Augmented Generation (RAG) in Vespa using AWS Bedrock models |
| 10 | + |
| 11 | +This sample application demonstrates an end-to-end Retrieval Augmented |
| 12 | +Generation application in Vespa, leveraging [AWS Bedrock](https://aws.amazon.com/bedrock/) hosted models. |
| 13 | + |
| 14 | +This sample application focuses on the generation part of RAG, and builds upon |
| 15 | +the [MS Marco passage |
| 16 | +ranking](https://github.com/vespa-engine/sample-apps/tree/master/msmarco-ranking) |
| 17 | +sample application. Please refer to that sample application for details on more |
| 18 | +advanced forms of retrieval, such as vector search and cross-encoder |
| 19 | +re-ranking. The generation steps in this sample application happen after |
| 20 | +retrieval, so the techniques there can easily be used in this application as |
| 21 | +well. For the purposes of this sample application, we will use a simple example of [hybrid search and ranking](https://docs.vespa.ai/en/tutorials/hybrid-search.html#hybrid-ranking) to demonstrate Vespa capabilities. |
| 22 | + |
| 23 | +For more details on using retrieval augmented generation in Vespa, please refer to |
| 24 | +the [RAG in Vespa](https://docs.vespa.ai/en/llms-rag.html) documentation page. |
| 25 | +For more on the general use of LLMs in Vespa, please refer to [LLMs in |
| 26 | +Vespa](https://docs.vespa.ai/en/llms-in-vespa.html). |
| 27 | + |
| 28 | +## AWS Setup |
| 29 | + |
| 30 | +### Choose your model |
| 31 | + |
| 32 | +This integration relies on the ability to invoke LLM endpoints with an [OpenAI chat completion API](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-chat-completions.html) from Vespa. At the time of writing, the only AWS Bedrock models which can be invoked with the OpenAI Chat completions API are the OpenAI models `gpt-oss-20b` and `gpt-oss-120b`. |
| 33 | + |
| 34 | +If you want to use another model, an alternate way is to expose an OpenAI chat completions endpoint through a [Bedrock access gateway](https://github.com/aws-samples/bedrock-access-gateway). The same integration instructions apply after creating the endpoint. |
| 35 | + |
| 36 | +### Choose your region |
| 37 | + |
| 38 | +Availability of the models may vary per region. The format of the bedrock runtime endpoint is as follows: |
| 39 | + |
| 40 | +`https://bedrock-runtime.{region}.amazonaws.com` |
| 41 | + |
| 42 | +You may want to collocate your model endpoint with the AWS region where |
| 43 | +Vespa is deployed. The default Vespa zone where this application will be deployed is in `dev` environment in `aws-us-east-1` region. |
| 44 | + |
| 45 | +### Set-up an AWS Bedrock API Key |
| 46 | + |
| 47 | +Create an [AWS Bedrock API key](https://docs.aws.amazon.com/bedrock/latest/userguide/api-keys.html). |
| 48 | + |
| 49 | +### Test your endpoint |
| 50 | + |
| 51 | +You can test your endpoint from curl: |
| 52 | + |
| 53 | +<pre> |
| 54 | +export AWS_BEARER_TOKEN_BEDROCK=ABSKQmVk.... |
| 55 | +curl -X POST https://bedrock-runtime.us-east-1.amazonaws.com/openai/v1/chat/completions \ |
| 56 | + -H "Content-Type: application/json" \ |
| 57 | + -H "Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK" \ |
| 58 | + -d '{ |
| 59 | + "model": "openai.gpt-oss-20b-1:0", |
| 60 | + "messages": [ |
| 61 | + { |
| 62 | + "role": "user", |
| 63 | + "content": "Hello! How are you today?" |
| 64 | + } |
| 65 | + ] |
| 66 | +}' |
| 67 | +</pre> |
| 68 | + |
| 69 | +Once this test completes successfully, you can proceed to next step. |
| 70 | + |
| 71 | +## Vespa setup |
| 72 | + |
| 73 | +The following is a quick start recipe for getting started with a tiny slice of |
| 74 | +the [MS Marco](https://microsoft.github.io/msmarco/) passage ranking dataset to showcase a RAG pattern leveraging AWS Bedrock models. |
| 75 | + |
| 76 | +Please follow the instructions in the [MS Marco passage |
| 77 | +ranking](https://github.com/vespa-engine/sample-apps/tree/master/msmarco-ranking) sample |
| 78 | +application for instructions on downloading the entire dataset. |
| 79 | + |
| 80 | +In the following we will deploy the sample application to Vespa Cloud. |
| 81 | + |
| 82 | +Make sure that [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html) is |
| 83 | +installed. Update to the newest version: |
| 84 | +<pre> |
| 85 | +$ brew install vespa-cli |
| 86 | +</pre> |
| 87 | + |
| 88 | +Download this sample application: |
| 89 | +<pre data-test="exec"> |
| 90 | +$ vespa clone aws-simple-rag bedrock-rag && cd bedrock-rag |
| 91 | +</pre> |
| 92 | + |
| 93 | + |
| 94 | +### Deploying to Vespa Cloud |
| 95 | + |
| 96 | +Deploy the sample application to Vespa Cloud. Note that this application can fit within the free quota, so it is free to try. |
| 97 | + |
| 98 | +In the following section, we will set the Vespa CLI target to the cloud. |
| 99 | +Make sure you have created a tenant at |
| 100 | +[console.vespa-cloud.com](https://console.vespa-cloud.com/). Make a note of the |
| 101 | +tenant's name; it will be used in the next steps. For more information, see the |
| 102 | +Vespa Cloud [getting started](https://cloud.vespa.ai/en/getting-started) guide. |
| 103 | + |
| 104 | +Add your AWS Bedrock API key to the Vespa secret store as described in |
| 105 | +[Secret Management](https://cloud.vespa.ai/en/security/secret-store.html#secret-management). |
| 106 | +Unless you already have one, create a new vault, and add your AWS Bedrock API key as a secret. |
| 107 | + |
| 108 | +The `services.xml` file must refer to the newly added secret in the secret store. |
| 109 | +Replace `<my-vault-name>` and `<my-secret-name>` below with your own values: |
| 110 | + |
| 111 | +```xml |
| 112 | +<secrets> |
| 113 | + <bedrock-api-key vault="<my-vault-name>" name="<my-secret-name>"/> |
| 114 | +</secrets> |
| 115 | +``` |
| 116 | + |
| 117 | +Configure the vespa client. Replace `tenant-name` below with your tenant name. |
| 118 | +We use the application name `aws-app` here, but you are free to choose your own |
| 119 | +application name: |
| 120 | +<pre> |
| 121 | +$ vespa config set target cloud |
| 122 | +$ vespa config set application tenant-name.aws-app |
| 123 | +</pre> |
| 124 | + |
| 125 | +Log in and add your public certificates to the application for Dataplane access: |
| 126 | +<pre> |
| 127 | +$ vespa auth login |
| 128 | +$ vespa auth cert |
| 129 | +</pre> |
| 130 | + |
| 131 | +Grant application access to the secret. |
| 132 | +Applications must be created first so one can use the Vespa Cloud Console to grant access. |
| 133 | +The easiest way is to deploy, which will auto-create the application. |
| 134 | +The first deployment will fail: |
| 135 | + |
| 136 | +<pre> |
| 137 | +$ vespa deploy --wait 900 |
| 138 | +</pre> |
| 139 | + |
| 140 | +``` |
| 141 | +[09:47:43] warning Deployment failed: Invalid application: Vault 'my_vault' does not exist, |
| 142 | +or application does not have access to it |
| 143 | +``` |
| 144 | + |
| 145 | +At this point, open the console |
| 146 | +(the link is like https://console.vespa-cloud.com/tenant/mytenant/account/secrets) |
| 147 | +and grant access: |
| 148 | + |
| 149 | + |
| 150 | + |
| 151 | +Deploy the application again. This can take some time for all nodes to be provisioned: |
| 152 | +<pre> |
| 153 | +$ vespa deploy --wait 900 |
| 154 | +</pre> |
| 155 | + |
| 156 | +The application should now be deployed! |
| 157 | + |
| 158 | +### Feeding |
| 159 | + |
| 160 | +Let's feed the documents: |
| 161 | +<pre data-test="exec"> |
| 162 | +$ vespa feed ext/docs.jsonl |
| 163 | +</pre> |
| 164 | + |
| 165 | +### Querying: Hybrid Retrieval |
| 166 | + |
| 167 | + |
| 168 | +Run a query first to check the retrieval: |
| 169 | +<pre data-test="exec" data-test-assert-contains="Manhattan"> |
| 170 | +$ vespa query \ |
| 171 | + 'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \ |
| 172 | + 'query=What is the Manhattan Project' \ |
| 173 | + 'input.query(e)=embed(@query)' \ |
| 174 | + hits=3 \ |
| 175 | + language=en \ |
| 176 | + ranking=hybrid |
| 177 | +</pre> |
| 178 | + |
| 179 | + |
| 180 | +### RAG with AWS Bedrock |
| 181 | + |
| 182 | +To test generation using the OpenAI client, post a query that runs the `bedrock` search chain: |
| 183 | +<pre> |
| 184 | +$ vespa query \ |
| 185 | + 'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \ |
| 186 | + 'query=What is the Manhattan Project' \ |
| 187 | + 'input.query(e)=embed(@query)' \ |
| 188 | + hits=3 \ |
| 189 | + language=en \ |
| 190 | + ranking=hybrid \ |
| 191 | + searchChain=bedrock \ |
| 192 | + format=sse \ |
| 193 | + traceLevel=1 |
| 194 | + timeout=60s |
| 195 | + </pre> |
| 196 | + |
| 197 | +Here, we specifically set the search chain to `bedrock`. |
| 198 | +This calls the |
| 199 | +[RAGSearcher](https://github.com/vespa-engine/vespa/blob/master/container-search/src/main/java/ai/vespa/search/llm/RAGSearcher.java) |
| 200 | +which is set up to use the |
| 201 | +[OpenAI](https://github.com/vespa-engine/vespa/blob/master/model-integration/src/main/java/ai/vespa/llm/clients/OpenAI.java) client, as we are leveraging the [AWS Bedrock OpenAI chat completions API endpoint](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-chat-completions.html). |
| 202 | +Note that this requires the AWS Bedrock API key. |
| 203 | +We also add a timeout as token generation can take some time. |
| 204 | + |
| 205 | + |
| 206 | +### Structured output |
| 207 | + |
| 208 | +You can also specify a structured output format for the LLM. |
| 209 | +In the example below, we provide a JSON schema to force the LLM to return the answer in 3 different |
| 210 | +formats: |
| 211 | + |
| 212 | +- `answer-short`: a short answer to the question |
| 213 | +- `answer-short-french`: a translation of the short answer in French |
| 214 | +- `answer-short-eli5`: an explanation of the answer as if the user was 5 years old |
| 215 | + |
| 216 | +<pre data-test="exec" data-test-assert-contains="answer-short-french"> |
| 217 | +$ vespa query \ |
| 218 | + 'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \ |
| 219 | + 'query=What is the Manhattan Project' \ |
| 220 | + 'input.query(e)=embed(@query)' \ |
| 221 | + hits=3 \ |
| 222 | + language=en \ |
| 223 | + ranking=hybrid \ |
| 224 | + searchChain=bedrock \ |
| 225 | + format=sse \ |
| 226 | + llm.json_schema="{\"type\":\"object\",\"properties\":{\"answer-short\":{\"type\":\"string\"},\"answer-short-french\":{\"type\":\"string\",\"description\":\"exact translation of short answer in French language\"},\"answer-short-eli5\":{\"type\":\"string\",\"description\":\"explain the answer like I am 5 years old\"}},\"required\":[\"answer-short\",\"answer-short-french\",\"answer-short-eli5\"],\"additionalProperties\":false}" \ |
| 227 | + traceLevel=1 |
| 228 | + timeout=60s |
| 229 | +</pre> |
| 230 | + |
| 231 | +The `llm.json_schema` parameter is used to specify the expected output format of the LLM. |
| 232 | +The schema is defined in JSON Schema format, which allows you to specify the expected structure of the output. |
| 233 | + |
| 234 | +## Query parameters |
| 235 | + |
| 236 | +The parameters here are: |
| 237 | + |
| 238 | +- `query`: the query used both for retrieval and the prompt question. |
| 239 | +- `hits`: the number of hits that Vespa should return in the retrieval stage |
| 240 | +- `searchChain`: the search chain set up in `services.xml` that calls the |
| 241 | + generative process |
| 242 | +- `format`: sets the format to server-sent events, which will stream the tokens |
| 243 | + as they are generated. |
| 244 | +- `traceLevel`: outputs some debug information, such as the actual prompt that |
| 245 | + was sent to the LLM and token timing. |
| 246 | + |
| 247 | +For more information on how to customize the prompt, please refer to the [RAG |
| 248 | +in Vespa](https://docs.vespa.ai/en/llms-rag.html) documentation. |
| 249 | + |
| 250 | + |
| 251 | +## Shutdown and remove the RAG application |
| 252 | + |
| 253 | + |
| 254 | +To remove the application from Vespa Cloud: |
| 255 | +<pre> |
| 256 | +$ vespa destroy |
| 257 | +</pre> |
0 commit comments