Skip to content

Commit a571c2b

Browse files
authored
Merge pull request #1807 from vespa-engine/zoharsan/aws-simple-rag
Zoharsan/aws simple rag
2 parents 5e71970 + fe1c3e2 commit a571c2b

File tree

7 files changed

+1366
-0
lines changed

7 files changed

+1366
-0
lines changed

examples/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,10 @@ For any questions, please register at the Vespa Slack and discuss in the general
142142

143143
[![logo](/assets/vespa-logomark-tiny.png) mcp-server-app](mcp-server-app) This simple sample app combines a job matching platform with an integrated MCP server.
144144

145+
### RAG with AWS Bedrock hosted LLM Models
146+
147+
[![logo](/assets/vespa-logomark-tiny.png) aws-simple-rag](aws-simple-rag) This simple RAG application is a flavor of [Retrieval Augmented Generation (RAG) in Vespa](../retrieval-augmented-generation) where the RAG application runs in Vespa but leveraging LLMs hosted in AWS Bedrock through OpenAI API to generate a final response.
148+
145149
----
146150

147151
### Operations
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# This file excludes unnecessary files from the application package. See
2+
# https://docs.vespa.ai/en/reference/vespaignore.html for more information.
3+
.DS_Store
4+
.gitignore
5+
README.md
6+
ext/

examples/aws-simple-rag/README.md

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
<!-- Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.-->
2+
3+
<picture>
4+
<source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg">
5+
<source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg">
6+
<img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;">
7+
</picture>
8+
9+
# Retrieval Augmented Generation (RAG) in Vespa using AWS Bedrock models
10+
11+
This sample application demonstrates an end-to-end Retrieval Augmented
12+
Generation application in Vespa, leveraging [AWS Bedrock](https://aws.amazon.com/bedrock/) hosted models.
13+
14+
This sample application focuses on the generation part of RAG, and builds upon
15+
the [MS Marco passage
16+
ranking](https://github.com/vespa-engine/sample-apps/tree/master/msmarco-ranking)
17+
sample application. Please refer to that sample application for details on more
18+
advanced forms of retrieval, such as vector search and cross-encoder
19+
re-ranking. The generation steps in this sample application happen after
20+
retrieval, so the techniques there can easily be used in this application as
21+
well. For the purposes of this sample application, we will use a simple example of [hybrid search and ranking](https://docs.vespa.ai/en/tutorials/hybrid-search.html#hybrid-ranking) to demonstrate Vespa capabilities.
22+
23+
For more details on using retrieval augmented generation in Vespa, please refer to
24+
the [RAG in Vespa](https://docs.vespa.ai/en/llms-rag.html) documentation page.
25+
For more on the general use of LLMs in Vespa, please refer to [LLMs in
26+
Vespa](https://docs.vespa.ai/en/llms-in-vespa.html).
27+
28+
## AWS Setup
29+
30+
### Choose your model
31+
32+
This integration relies on the ability to invoke LLM endpoints with an [OpenAI chat completion API](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-chat-completions.html) from Vespa. At the time of writing, the only AWS Bedrock models which can be invoked with the OpenAI Chat completions API are the OpenAI models `gpt-oss-20b` and `gpt-oss-120b`.
33+
34+
If you want to use another model, an alternate way is to expose an OpenAI chat completions endpoint through a [Bedrock access gateway](https://github.com/aws-samples/bedrock-access-gateway). The same integration instructions apply after creating the endpoint.
35+
36+
### Choose your region
37+
38+
Availability of the models may vary per region. The format of the bedrock runtime endpoint is as follows:
39+
40+
`https://bedrock-runtime.{region}.amazonaws.com`
41+
42+
You may want to collocate your model endpoint with the AWS region where
43+
Vespa is deployed. The default Vespa zone where this application will be deployed is in `dev` environment in `aws-us-east-1` region.
44+
45+
### Set-up an AWS Bedrock API Key
46+
47+
Create an [AWS Bedrock API key](https://docs.aws.amazon.com/bedrock/latest/userguide/api-keys.html).
48+
49+
### Test your endpoint
50+
51+
You can test your endpoint from curl:
52+
53+
<pre>
54+
export AWS_BEARER_TOKEN_BEDROCK=ABSKQmVk....
55+
curl -X POST https://bedrock-runtime.us-east-1.amazonaws.com/openai/v1/chat/completions \
56+
-H "Content-Type: application/json" \
57+
-H "Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK" \
58+
-d '{
59+
"model": "openai.gpt-oss-20b-1:0",
60+
"messages": [
61+
{
62+
"role": "user",
63+
"content": "Hello! How are you today?"
64+
}
65+
]
66+
}'
67+
</pre>
68+
69+
Once this test completes successfully, you can proceed to next step.
70+
71+
## Vespa setup
72+
73+
The following is a quick start recipe for getting started with a tiny slice of
74+
the [MS Marco](https://microsoft.github.io/msmarco/) passage ranking dataset to showcase a RAG pattern leveraging AWS Bedrock models.
75+
76+
Please follow the instructions in the [MS Marco passage
77+
ranking](https://github.com/vespa-engine/sample-apps/tree/master/msmarco-ranking) sample
78+
application for instructions on downloading the entire dataset.
79+
80+
In the following we will deploy the sample application to Vespa Cloud.
81+
82+
Make sure that [Vespa CLI](https://docs.vespa.ai/en/vespa-cli.html) is
83+
installed. Update to the newest version:
84+
<pre>
85+
$ brew install vespa-cli
86+
</pre>
87+
88+
Download this sample application:
89+
<pre data-test="exec">
90+
$ vespa clone aws-simple-rag bedrock-rag && cd bedrock-rag
91+
</pre>
92+
93+
94+
### Deploying to Vespa Cloud
95+
96+
Deploy the sample application to Vespa Cloud. Note that this application can fit within the free quota, so it is free to try.
97+
98+
In the following section, we will set the Vespa CLI target to the cloud.
99+
Make sure you have created a tenant at
100+
[console.vespa-cloud.com](https://console.vespa-cloud.com/). Make a note of the
101+
tenant's name; it will be used in the next steps. For more information, see the
102+
Vespa Cloud [getting started](https://cloud.vespa.ai/en/getting-started) guide.
103+
104+
Add your AWS Bedrock API key to the Vespa secret store as described in
105+
[Secret Management](https://cloud.vespa.ai/en/security/secret-store.html#secret-management).
106+
Unless you already have one, create a new vault, and add your AWS Bedrock API key as a secret.
107+
108+
The `services.xml` file must refer to the newly added secret in the secret store.
109+
Replace `<my-vault-name>` and `<my-secret-name>` below with your own values:
110+
111+
```xml
112+
<secrets>
113+
<bedrock-api-key vault="<my-vault-name>" name="<my-secret-name>"/>
114+
</secrets>
115+
```
116+
117+
Configure the vespa client. Replace `tenant-name` below with your tenant name.
118+
We use the application name `aws-app` here, but you are free to choose your own
119+
application name:
120+
<pre>
121+
$ vespa config set target cloud
122+
$ vespa config set application tenant-name.aws-app
123+
</pre>
124+
125+
Log in and add your public certificates to the application for Dataplane access:
126+
<pre>
127+
$ vespa auth login
128+
$ vespa auth cert
129+
</pre>
130+
131+
Grant application access to the secret.
132+
Applications must be created first so one can use the Vespa Cloud Console to grant access.
133+
The easiest way is to deploy, which will auto-create the application.
134+
The first deployment will fail:
135+
136+
<pre>
137+
$ vespa deploy --wait 900
138+
</pre>
139+
140+
```
141+
[09:47:43] warning Deployment failed: Invalid application: Vault 'my_vault' does not exist,
142+
or application does not have access to it
143+
```
144+
145+
At this point, open the console
146+
(the link is like https://console.vespa-cloud.com/tenant/mytenant/account/secrets)
147+
and grant access:
148+
149+
![edit application access dialog](ext/edit-app-access.png)
150+
151+
Deploy the application again. This can take some time for all nodes to be provisioned:
152+
<pre>
153+
$ vespa deploy --wait 900
154+
</pre>
155+
156+
The application should now be deployed!
157+
158+
### Feeding
159+
160+
Let's feed the documents:
161+
<pre data-test="exec">
162+
$ vespa feed ext/docs.jsonl
163+
</pre>
164+
165+
### Querying: Hybrid Retrieval
166+
167+
168+
Run a query first to check the retrieval:
169+
<pre data-test="exec" data-test-assert-contains="Manhattan">
170+
$ vespa query \
171+
'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
172+
'query=What is the Manhattan Project' \
173+
'input.query(e)=embed(@query)' \
174+
hits=3 \
175+
language=en \
176+
ranking=hybrid
177+
</pre>
178+
179+
180+
### RAG with AWS Bedrock
181+
182+
To test generation using the OpenAI client, post a query that runs the `bedrock` search chain:
183+
<pre>
184+
$ vespa query \
185+
'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
186+
'query=What is the Manhattan Project' \
187+
'input.query(e)=embed(@query)' \
188+
hits=3 \
189+
language=en \
190+
ranking=hybrid \
191+
searchChain=bedrock \
192+
format=sse \
193+
traceLevel=1
194+
timeout=60s
195+
</pre>
196+
197+
Here, we specifically set the search chain to `bedrock`.
198+
This calls the
199+
[RAGSearcher](https://github.com/vespa-engine/vespa/blob/master/container-search/src/main/java/ai/vespa/search/llm/RAGSearcher.java)
200+
which is set up to use the
201+
[OpenAI](https://github.com/vespa-engine/vespa/blob/master/model-integration/src/main/java/ai/vespa/llm/clients/OpenAI.java) client, as we are leveraging the [AWS Bedrock OpenAI chat completions API endpoint](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-chat-completions.html).
202+
Note that this requires the AWS Bedrock API key.
203+
We also add a timeout as token generation can take some time.
204+
205+
206+
### Structured output
207+
208+
You can also specify a structured output format for the LLM.
209+
In the example below, we provide a JSON schema to force the LLM to return the answer in 3 different
210+
formats:
211+
212+
- `answer-short`: a short answer to the question
213+
- `answer-short-french`: a translation of the short answer in French
214+
- `answer-short-eli5`: an explanation of the answer as if the user was 5 years old
215+
216+
<pre data-test="exec" data-test-assert-contains="answer-short-french">
217+
$ vespa query \
218+
'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
219+
'query=What is the Manhattan Project' \
220+
'input.query(e)=embed(@query)' \
221+
hits=3 \
222+
language=en \
223+
ranking=hybrid \
224+
searchChain=bedrock \
225+
format=sse \
226+
llm.json_schema="{\"type\":\"object\",\"properties\":{\"answer-short\":{\"type\":\"string\"},\"answer-short-french\":{\"type\":\"string\",\"description\":\"exact translation of short answer in French language\"},\"answer-short-eli5\":{\"type\":\"string\",\"description\":\"explain the answer like I am 5 years old\"}},\"required\":[\"answer-short\",\"answer-short-french\",\"answer-short-eli5\"],\"additionalProperties\":false}" \
227+
traceLevel=1
228+
timeout=60s
229+
</pre>
230+
231+
The `llm.json_schema` parameter is used to specify the expected output format of the LLM.
232+
The schema is defined in JSON Schema format, which allows you to specify the expected structure of the output.
233+
234+
## Query parameters
235+
236+
The parameters here are:
237+
238+
- `query`: the query used both for retrieval and the prompt question.
239+
- `hits`: the number of hits that Vespa should return in the retrieval stage
240+
- `searchChain`: the search chain set up in `services.xml` that calls the
241+
generative process
242+
- `format`: sets the format to server-sent events, which will stream the tokens
243+
as they are generated.
244+
- `traceLevel`: outputs some debug information, such as the actual prompt that
245+
was sent to the LLM and token timing.
246+
247+
For more information on how to customize the prompt, please refer to the [RAG
248+
in Vespa](https://docs.vespa.ai/en/llms-rag.html) documentation.
249+
250+
251+
## Shutdown and remove the RAG application
252+
253+
254+
To remove the application from Vespa Cloud:
255+
<pre>
256+
$ vespa destroy
257+
</pre>

0 commit comments

Comments
 (0)