Skip to content

Commit 9370632

Browse files
Add OpenSearch (#117)
Co-authored-by: Mendon Kissling <[email protected]>
1 parent e3d5ae5 commit 9370632

File tree

12 files changed

+248
-24
lines changed

12 files changed

+248
-24
lines changed

SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@
6666
* [Milvus](configuration-resources/data-storage/milvus.md)
6767
* [Solr](configuration-resources/data-storage/solr.md)
6868
* [JDBC](configuration-resources/data-storage/jdbc.md)
69+
* [OpenSearch](configuration-resources/data-storage/opensearch.md)
6970

7071
## Pipeline Agents
7172

building-applications/configuration.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ LangStream has built-in support for a few Databases and Vector databases (no nee
4848
4949
[Apache Solr](configuration.md#apache-solr)
5050
51+
[OpenSearch](configuration.md#opensearch)
52+
5153
#### Cassandra (with Vector support)
5254
5355
```yaml
@@ -128,6 +130,22 @@ configuration:
128130
129131
</code></pre>
130132
133+
#### OpenSearch
134+
135+
<pre class="language-yaml"><code class="lang-yaml">configuration:
136+
resources:
137+
- type: "vector-database"
138+
name: "OpenSearch"
139+
configuration:
140+
service: "opensearch"
141+
username: "${secrets.opensearch.username}"
142+
password: "${secrets.opensearch.password}"
143+
host: "${secrets.opensearch.host}"
144+
port: "${secrets.opensearch.port}"
145+
index-name: "my-index-000"
146+
147+
</code></pre>
148+
131149
### Manifest
132150
133151
<table><thead><tr><th width="148">Root</th><th width="144">Node</th><th width="94">Type</th><th>Description</th></tr></thead><tbody><tr><td>configuration</td><td><br></td><td><br></td><td>Top level node</td></tr><tr><td><br></td><td>dependencies</td><td>object<br></td><td><p>A collection of artifacts that a pipeline step of resource may need to run. <a href="configuration.md#dependencies">Refer to the spec below.</a></p><p>Example collection:</p><ul><li>type: “xxx”<br>name: “xxx”<br>configuration:<br>…</li><li>type: “xxx”<br>name: “xxx”<br>configuration:<br>…</li></ul></td></tr><tr><td><br></td><td>resources</td><td><br>object</td><td><p>A collection of resources. <a href="configuration.md#dependencies">Refer to the spec below.</a></p><p>Example collection:</p><ul><li>type: “xxx”<br>name: “xxx”<br>sha: “xxx”<br>…</li><li>type: “xxx”<br>name: “xxx”<br>sha: “xxx”<br>…</li></ul></td></tr></tbody></table>

building-applications/vector-databases.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Vector databases are typically used as part of retrieval augmented generation (R
88
* Provide more accurate, up-to-date, and context-aware responses
99
* Extend the knowledge base of the LLM
1010

11-
LangStream makes it easy to build applications using the [RAG pattern](../patterns/rag-pattern.md). It currently has native support for [DataStax Astra DB](https://www.datastax.com/products/vector-search), [Pinecone](https://www.pinecone.io/), [Milvus/Zilliz](https://milvus.io/) and [Apache Cassandra](https://cassandra.apache.org).
11+
LangStream makes it easy to build applications using the [RAG pattern](../patterns/rag-pattern.md). It currently has native support for [DataStax Astra DB](https://www.datastax.com/products/vector-search), [Pinecone](https://www.pinecone.io/), [Milvus/Zilliz](https://milvus.io/), [OpenSearch](https://opensearch.org/docs/latest/) and [Apache Cassandra](https://cassandra.apache.org).
1212

1313
When working with a vector database you will either be writing vector embeddings to a vector database or performing semantic similarity queries across the vectors in the database. Check out the [vector-db-sink agent](../pipeline-agents/input-and-output/vector-db-sink.md) for writing to vector databases and the [query-vector-db agent](../pipeline-agents/text-processors/query-vector-db.md) for querying.
1414

configuration-resources/data-storage/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ resources:
3636
- [Milvus](./milvus.md)
3737
- [JDBC](./jdbc.md)
3838
- [Solr](./solr.md)
39+
- [OpenSearch](./opensearch.md)
3940

4041

4142
### Supporting a new service

configuration-resources/data-storage/astra.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Optional parameters:
2727
- environment: this is the environment provided by the Astra DB service, it can be PROD, STAGING or DEV, depending on the environment you are using (default is PROD, the other values are useful only for Astra developers)
2828
2929
30-
## Handling the secure bundle zip file
30+
### Handling the secure bundle zip file
3131
3232
The secure bundle is a file that contains some TLS certificates and endpoint information to connect to the Astra DB service.
3333
@@ -51,7 +51,7 @@ it is not recommended to store secrets in a configuration file, but only referen
5151

5252

5353

54-
## Special assets for Astra
54+
### Special assets for Astra
5555

5656
For "Vector Database" resources based on Astra, you can use special `assets`in your pipeline file: "astra-keyspace" and "cassandra-table".
5757

@@ -85,4 +85,8 @@ With the "cassandra-table" asset you can create a table in your Astra DB instanc
8585

8686
### Reading and writing to Astra
8787

88-
Astra is compatible with Cassandra, so you can use the same agents you use for Cassandra to read and write to Astra. See the documentation [here](./cassandra.md).
88+
Astra is compatible with Cassandra, so you can use the same agents you use for Cassandra to read and write to Astra. See the documentation [here](./cassandra.md).
89+
90+
### Configuration
91+
92+
Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_astra).

configuration-resources/data-storage/cassandra.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,4 +86,8 @@ Set the table-name to the name of the table you want to write to.
8686
Set the keyspace to the name of the keyspace you want to write to.
8787
The mapping field is a comma-separated list of field mappings, in the form "field-name=expression". The expression is a expression that can reference the value of the current message, for instance "value.filename".
8888

89-
Internally LangStream is using the DataStax Connector for Apache Kafka and Pulsar to write to Cassandra. You can find more information about the mapping parameters in the [documentation](https://docs.datastax.com/en/pulsar-connector/docs/cfgPulsarMapTopicTable.html).
89+
Internally LangStream is using the DataStax Connector for Apache Kafka and Pulsar to write to Cassandra. You can find more information about the mapping parameters in the [documentation](https://docs.datastax.com/en/pulsar-connector/docs/cfgPulsarMapTopicTable.html).
90+
91+
### Configuration
92+
93+
Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_cassandra).

configuration-resources/data-storage/jdbc.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# JDBC
22

3-
#### Connecting to a JDBC Compliant Database
3+
### Connecting to a JDBC Compliant Database
44

55
Connect to any JDBC-compliant database using the "datasource" resource type.
66

@@ -58,7 +58,7 @@ This is a sample .gitignore file to put at the root of your application director
5858
java/lib/*
5959
```
6060

61-
#### Querying a JDBC datasource
61+
### Querying a JDBC datasource
6262

6363
You can query a JDBC datasource using the "query" or the "query-vector-db" agent in your pipeline.
6464

@@ -104,7 +104,7 @@ assets:
104104

105105
You can specify any number of statements in the "create-statements" and in the "delete-statements" sections, for instance to create indexes or other objects.
106106

107-
#### Writing to a JDBC datasource
107+
### Writing to a JDBC datasource
108108

109109
Use the "vector-db-sink" agent with the following parameters to write to a JDBC database:
110110

@@ -133,3 +133,8 @@ Use the "vector-db-sink" agent with the following parameters to write to a JDBC
133133
```
134134

135135
Set the table-name to the name of the table you want to write to. Define the fields in the "fields" list. This works similarly to the ['compute' agent](../../pipeline-agents/data-transform/compute.md), and you can use the same syntax to define the fields. It is important that you tag the fields that are part of the primary key of the table with "primary-key: true". This is needed to correctly manage upserts and deletion from the table.
136+
137+
138+
### Configuration
139+
140+
Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_jdbc).

configuration-resources/data-storage/milvus.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Milvus
22

3-
#### Connecting to Milvus.io
3+
### Connecting to Milvus.io
44

55
To use Milvus.io as a vector database, create a "vector-database" resource in your configuration.yaml file.
66

@@ -46,7 +46,7 @@ The values for **write-mode**:
4646
* "upsert": use upserts
4747
* "delete-insert": delete the document and then insert it again
4848

49-
#### Special assets for Milvus
49+
### Special assets for Milvus
5050

5151
You can automatically create Collections and Indexes in Milvus as part of the deployment of your LangStream application.
5252

@@ -115,11 +115,11 @@ As you can see in the "create-statements" section above, you can configure a num
115115
* load-collection: load the collection in Milvus, to allow searches
116116
* create-index: create a Vector index, this is required if you are using the "create-collection" command
117117
118-
#### Querying Milvus
118+
### Querying Milvus
119119
120120
When you use the "query-vector-db" agent to query Milvus, you can use the following parameters:
121121
122-
````yaml
122+
```yaml
123123
pipeline:
124124
- name: "lookup-related-documents"
125125
type: "query-vector-db"
@@ -135,7 +135,7 @@ pipeline:
135135
fields:
136136
- "value.question_embeddings"
137137
output-field: "value.related_documents"
138-
``
138+
```
139139
140140
As usual you can use the '?' symbol as a placeholder for the fields that you specify in the "fields" section.
141141
@@ -161,6 +161,11 @@ When you use the "vector-db-sink" agent to write to Milvus, you can use the foll
161161
expression: "value.text"
162162
- name: "num_tokens"
163163
expression: "value.chunk_num_tokens"
164-
````
164+
```
165165
166166
Set the collection-name to the name of the collection you want to write to. Then you define the fields in the "fields" list. This works similarly to the ['compute' agent](../../pipeline-agents/data-transform/compute.md), and you can use the same syntax to define the fields.
167+
168+
169+
### Configuration
170+
171+
Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_milvus).
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# OpenSearch
2+
3+
LangStream supports using OpenSearch as a vector database.
4+
5+
Learn more about performing vector search with OpenSearch in the [official documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/)
6+
7+
> Only OpenSearch 2.x is officially supported.
8+
9+
### Connecting to OpenSearch
10+
11+
Create a `vector-database` resource in your configuration.yaml file.
12+
A single resource is bound to a single index.
13+
14+
```yaml
15+
resources:
16+
- type: "vector-database"
17+
name: "OpenSearch"
18+
configuration:
19+
service: "opensearch"
20+
username: "${secrets.opensearch.username}"
21+
password: "${secrets.opensearch.password}"
22+
host: "${secrets.opensearch.host}"
23+
port: "${secrets.opensearch.port}"
24+
index-name: "my-index-000"
25+
```
26+
27+
### Connecting to AWS OpenSearch service
28+
29+
```yaml
30+
resources:
31+
- type: "vector-database"
32+
name: "OpenSearch"
33+
configuration:
34+
service: "opensearch"
35+
username: "${secrets.opensearch.username}"
36+
password: "${secrets.opensearch.password}"
37+
host: "${secrets.opensearch.host}"
38+
region: "${secrets.opensearch.region}"
39+
index-name: "my-index-000"
40+
```
41+
42+
- `username` is the AWS Access Key
43+
- `password` is the AWS Secret Key
44+
- `host` is the endpoint provided by AWS. e.g. for AWS OpenSearch serverless it looks like this: xxxx.<region>.aoss.amazonaws.com
45+
- `region` is the AWS region. It has to match with the one used in the endpoint
46+
47+
48+
49+
#### Declare an index as asset
50+
51+
To bind the application to the OpenSearch index creation at startup, you must use the `opensearch-index` asset type.
52+
53+
You can configure `settings` and `mappings` as you prefer. Other configuration fields are not supported.
54+
55+
This is an example mixing normal fields with vector fields. The `knn` plugin is required in the target OpenSearch instance.
56+
```yaml
57+
- name: "os-index"
58+
asset-type: "opensearch-index"
59+
creation-mode: create-if-not-exists
60+
config:
61+
datasource: "OpenSearch"
62+
settings: |
63+
{
64+
"index": {
65+
"knn": true,
66+
"knn.algo_param.ef_search": 100
67+
}
68+
}
69+
mappings: |
70+
{
71+
"properties": {
72+
"content": {
73+
"type": "text"
74+
},
75+
"embeddings": {
76+
"type": "knn_vector",
77+
"dimension": 1536
78+
}
79+
}
80+
}
81+
```
82+
83+
Refer to the [settings](https://opensearch.org/docs/latest/im-plugin/index-settings/) documentation for the `settings` field.
84+
Refer to the [mappings](https://opensearch.org/docs/latest/field-types/index/) documentation for the `mappings` field.
85+
86+
87+
#### Search
88+
89+
Use the `query-vector-db` agent with the following parameters to perform searches on the index created above :
90+
91+
```yaml
92+
- name: "lookup-related-documents"
93+
type: "query-vector-db"
94+
configuration:
95+
datasource: "OpenSearch"
96+
query: |
97+
{
98+
"size": 1,
99+
"query": {
100+
"knn": {
101+
"embeddings": {
102+
"vector": ?,
103+
"k": 1
104+
}
105+
}
106+
}
107+
}
108+
fields:
109+
- "value.question_embeddings"
110+
output-field: "value.related_documents"
111+
```
112+
113+
You can use the '?' symbol as a placeholder for the fields.
114+
115+
The `query` is the body sent to OpenSearch. Refer to the [documentation](https://opensearch.org/docs/latest/query-dsl/index/) to learn which parameters are supported.
116+
Note that the query will be executed on the configured index. Multi-index queries are not supported, but you can declare multiple datasources and query different indexes in the same application.
117+
118+
The `output-field` will contain the query result.
119+
The result is an array with the following elements:
120+
- `id`: the document ID
121+
- `document`: the document source
122+
- `score`: the document score
123+
- `index`: the index name
124+
125+
For example, if you want to keep only one relevant field from the first result, use the `compute` agent after the search:
126+
127+
```yaml
128+
- name: "lookup-related-documents"
129+
type: "query-vector-db"
130+
configuration:
131+
datasource: "OpenSearch"
132+
query: |
133+
{
134+
"size": 1,
135+
"query": {
136+
"match_all": {}
137+
}
138+
}
139+
output-field: "value.related_documents"
140+
only-first: true
141+
- name: "Format response"
142+
type: compute
143+
configuration:
144+
fields:
145+
- name: "value"
146+
type: STRING
147+
expression: "value.related_documents.document.content"
148+
```
149+
150+
151+
### Indexing
152+
153+
Use the `vector-db-sink` agent to index data, with the following parameters:
154+
155+
```yaml
156+
- name: "Write to Solr"
157+
type: "vector-db-sink"
158+
input: chunks-topic
159+
configuration:
160+
datasource: "OpenSearch"
161+
bulk-parameters:
162+
timeout: 2m
163+
fields:
164+
- name: "id"
165+
expression: "fn:concat(value.filename, value.chunk_id)"
166+
- name: "embeddings"
167+
expression: "fn:toListOfFloat(value.embeddings_vector)"
168+
- name: "text"
169+
expression: "value.text"
170+
```
171+
172+
173+
All indexing is performed using the Bulk operation.
174+
You can customize the [bulk parameters](https://opensearch.org/docs/latest/api-reference/document-apis/bulk/#url-parameters) with the `bulk-parameters` property.
175+
176+
The request will be flushed depending on `flush-interval` and `batch-size` parameters.
177+
178+
### Configuration
179+
180+
Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_opensearch).

configuration-resources/data-storage/pinecone.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,9 @@ pipeline:
7171
```
7272
7373
To write to Pinecone, define the values for the vector.id, vector.vector and vector.metadata fields.
74-
You can add as many vector.metadata fields as you want, but you need to specify the prefix "vector.metadata." for each field.
74+
You can add as many vector.metadata fields as you want, but you need to specify the prefix "vector.metadata." for each field.
75+
76+
77+
### Configuration
78+
79+
Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_pinecone).

0 commit comments

Comments
 (0)