Add OpenSearch (#117)

nicoloboschi · mendonk · web-flow · commit 93706324fc2c · 2023-10-19T10:43:31.000-04:00
Co-authored-by: Mendon Kissling &lt;59585235+mendonk@users.noreply.github.com&gt;
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -66,6 +66,7 @@
   * [Milvus](configuration-resources/data-storage/milvus.md)
   * [Solr](configuration-resources/data-storage/solr.md)
   * [JDBC](configuration-resources/data-storage/jdbc.md)
+  * [OpenSearch](configuration-resources/data-storage/opensearch.md)
 
 ## Pipeline Agents
 
diff --git a/building-applications/configuration.md b/building-applications/configuration.md
@@ -48,6 +48,8 @@ LangStream has built-in support for a few Databases and Vector databases (no nee
 
 [Apache Solr](configuration.md#apache-solr)
 
+[OpenSearch](configuration.md#opensearch)
+
 #### Cassandra (with Vector support)
 
 ```yaml
@@ -128,6 +130,22 @@ configuration:
       
 </code></pre>
 
+#### OpenSearch
+
+<pre class="language-yaml"><code class="lang-yaml">configuration:
+resources:
+    - type: "vector-database"
+      name: "OpenSearch"
+      configuration:
+        service: "opensearch"        
+        username: "${secrets.opensearch.username}"
+        password: "${secrets.opensearch.password}"
+        host: "${secrets.opensearch.host}"
+        port: "${secrets.opensearch.port}"
+        index-name: "my-index-000"
+      
+</code></pre>
+
 ### Manifest
 
 <table><thead><tr><th width="148">Root</th><th width="144">Node</th><th width="94">Type</th><th>Description</th></tr></thead><tbody><tr><td>configuration</td><td><br></td><td><br></td><td>Top level node</td></tr><tr><td><br></td><td>dependencies</td><td>object<br></td><td><p>A collection of artifacts that a pipeline step of resource may need to run. <a href="configuration.md#dependencies">Refer to the spec below.</a></p><p>Example collection:</p><ul><li>type: “xxx”<br>name: “xxx”<br>configuration:<br>…</li><li>type: “xxx”<br>name: “xxx”<br>configuration:<br>…</li></ul></td></tr><tr><td><br></td><td>resources</td><td><br>object</td><td><p>A collection of resources. <a href="configuration.md#dependencies">Refer to the spec below.</a></p><p>Example collection:</p><ul><li>type: “xxx”<br>name: “xxx”<br>sha: “xxx”<br>…</li><li>type: “xxx”<br>name: “xxx”<br>sha: “xxx”<br>…</li></ul></td></tr></tbody></table>
diff --git a/building-applications/vector-databases.md b/building-applications/vector-databases.md
@@ -8,7 +8,7 @@ Vector databases are typically used as part of retrieval augmented generation (R
 * Provide more accurate, up-to-date, and context-aware responses
 * Extend the knowledge base of the LLM
 
-LangStream makes it easy to build applications using the [RAG pattern](../patterns/rag-pattern.md). It currently has native support for [DataStax Astra DB](https://www.datastax.com/products/vector-search), [Pinecone](https://www.pinecone.io/), [Milvus/Zilliz](https://milvus.io/) and [Apache Cassandra](https://cassandra.apache.org).
+LangStream makes it easy to build applications using the [RAG pattern](../patterns/rag-pattern.md). It currently has native support for [DataStax Astra DB](https://www.datastax.com/products/vector-search), [Pinecone](https://www.pinecone.io/), [Milvus/Zilliz](https://milvus.io/), [OpenSearch](https://opensearch.org/docs/latest/) and [Apache Cassandra](https://cassandra.apache.org).
 
 When working with a vector database you will either be writing vector embeddings to a vector database or performing semantic similarity queries across the vectors in the database. Check out the [vector-db-sink agent](../pipeline-agents/input-and-output/vector-db-sink.md) for writing to vector databases and the [query-vector-db agent](../pipeline-agents/text-processors/query-vector-db.md) for querying.
 
diff --git a/configuration-resources/data-storage/README.md b/configuration-resources/data-storage/README.md
@@ -36,6 +36,7 @@ resources:
 - [Milvus](./milvus.md)
 - [JDBC](./jdbc.md)
 - [Solr](./solr.md)
+- [OpenSearch](./opensearch.md)
 
 
 ### Supporting a new service
diff --git a/configuration-resources/data-storage/astra.md b/configuration-resources/data-storage/astra.md
@@ -27,7 +27,7 @@ Optional parameters:
 - environment: this is the environment provided by the Astra DB service, it can be PROD, STAGING or DEV, depending on the environment you are using (default is PROD, the other values are useful only for Astra developers)
 
 
-## Handling the secure bundle zip file
+### Handling the secure bundle zip file
 
 The secure bundle is a file that contains some TLS certificates and endpoint information to connect to the Astra DB service.
 
@@ -51,7 +51,7 @@ it is not recommended to store secrets in a configuration file, but only referen
 
 
 
-## Special assets for Astra
+### Special assets for Astra
 
 For "Vector Database" resources based on Astra, you can use special `assets`in your pipeline file: "astra-keyspace" and "cassandra-table".
 
@@ -85,4 +85,8 @@ With the "cassandra-table" asset you can create a table in your Astra DB instanc
 
 ### Reading and writing to Astra
 
-Astra is compatible with Cassandra, so you can use the same agents you use for Cassandra to read and write to Astra. See the documentation [here](./cassandra.md).
+Astra is compatible with Cassandra, so you can use the same agents you use for Cassandra to read and write to Astra. See the documentation [here](./cassandra.md).
+
+### Configuration
+
+Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_astra).
diff --git a/configuration-resources/data-storage/cassandra.md b/configuration-resources/data-storage/cassandra.md
@@ -86,4 +86,8 @@ Set the table-name to the name of the table you want to write to.
 Set the keyspace to the name of the keyspace you want to write to.
 The mapping field is a comma-separated list of field mappings, in the form "field-name=expression". The expression is a expression that can reference the value of the current message, for instance "value.filename".
 
-Internally LangStream is using the DataStax Connector for Apache Kafka and Pulsar to write to Cassandra. You can find more information about the mapping parameters in the [documentation](https://docs.datastax.com/en/pulsar-connector/docs/cfgPulsarMapTopicTable.html).
+Internally LangStream is using the DataStax Connector for Apache Kafka and Pulsar to write to Cassandra. You can find more information about the mapping parameters in the [documentation](https://docs.datastax.com/en/pulsar-connector/docs/cfgPulsarMapTopicTable.html).
+
+### Configuration
+
+Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_cassandra).
diff --git a/configuration-resources/data-storage/jdbc.md b/configuration-resources/data-storage/jdbc.md
@@ -1,6 +1,6 @@
 # JDBC
 
-#### Connecting to a JDBC Compliant Database
+### Connecting to a JDBC Compliant Database
 
 Connect to any JDBC-compliant database using the "datasource" resource type.
 
@@ -58,7 +58,7 @@ This is a sample .gitignore file to put at the root of your application director
 java/lib/*
 ```
 
-#### Querying a JDBC datasource
+### Querying a JDBC datasource
 
 You can query a JDBC datasource using the "query" or the "query-vector-db" agent in your pipeline.
 
@@ -104,7 +104,7 @@ assets:
 
 You can specify any number of statements in the "create-statements" and in the "delete-statements" sections, for instance to create indexes or other objects.
 
-#### Writing to a JDBC datasource
+### Writing to a JDBC datasource
 
 Use the "vector-db-sink" agent with the following parameters to write to a JDBC database:
 
@@ -133,3 +133,8 @@ Use the "vector-db-sink" agent with the following parameters to write to a JDBC
 ```
 
 Set the table-name to the name of the table you want to write to. Define the fields in the "fields" list. This works similarly to the ['compute' agent](../../pipeline-agents/data-transform/compute.md), and you can use the same syntax to define the fields. It is important that you tag the fields that are part of the primary key of the table with "primary-key: true". This is needed to correctly manage upserts and deletion from the table.
+
+
+### Configuration
+
+Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_jdbc).
diff --git a/configuration-resources/data-storage/milvus.md b/configuration-resources/data-storage/milvus.md
@@ -1,6 +1,6 @@
 # Milvus
 
-#### Connecting to Milvus.io
+### Connecting to Milvus.io
 
 To use Milvus.io as a vector database, create a "vector-database" resource in your configuration.yaml file.
 
@@ -46,7 +46,7 @@ The values for **write-mode**:
 * "upsert": use upserts
 * "delete-insert": delete the document and then insert it again
 
-#### Special assets for Milvus
+### Special assets for Milvus
 
 You can automatically create Collections and Indexes in Milvus as part of the deployment of your LangStream application.
 
@@ -115,11 +115,11 @@ As you can see in the "create-statements" section above, you can configure a num
 * load-collection: load the collection in Milvus, to allow searches
 * create-index: create a Vector index, this is required if you are using the "create-collection" command
 
-#### Querying Milvus
+### Querying Milvus
 
 When you use the "query-vector-db" agent to query Milvus, you can use the following parameters:
 
-````yaml
+```yaml
 pipeline:
   - name: "lookup-related-documents"
     type: "query-vector-db"
@@ -135,7 +135,7 @@ pipeline:
       fields:
         - "value.question_embeddings"
       output-field: "value.related_documents"
-``
+```
 
 As usual you can use the '?' symbol as a placeholder for the fields that you specify in the "fields" section.
 
@@ -161,6 +161,11 @@ When you use the "vector-db-sink" agent to write to Milvus, you can use the foll
           expression: "value.text"
         - name: "num_tokens"
           expression: "value.chunk_num_tokens"
-````
+```
 
 Set the collection-name to the name of the collection you want to write to. Then you define the fields in the "fields" list. This works similarly to the ['compute' agent](../../pipeline-agents/data-transform/compute.md), and you can use the same syntax to define the fields.
+
+
+### Configuration
+
+Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_milvus).
diff --git a/configuration-resources/data-storage/opensearch.md b/configuration-resources/data-storage/opensearch.md
@@ -0,0 +1,180 @@
+# OpenSearch
+
+LangStream supports using OpenSearch as a vector database. 
+
+Learn more about performing vector search with OpenSearch in the [official documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/)
+
+> Only OpenSearch 2.x is officially supported.
+
+### Connecting to OpenSearch 
+
+Create a `vector-database` resource in your configuration.yaml file.
+A single resource is bound to a single index.
+
+```yaml
+resources:
+    - type: "vector-database"
+      name: "OpenSearch"
+      configuration:
+        service: "opensearch"        
+        username: "${secrets.opensearch.username}"
+        password: "${secrets.opensearch.password}"
+        host: "${secrets.opensearch.host}"
+        port: "${secrets.opensearch.port}"
+        index-name: "my-index-000"
+```
+
+### Connecting to AWS OpenSearch service 
+
+```yaml
+resources:
+    - type: "vector-database"
+      name: "OpenSearch"
+      configuration:
+        service: "opensearch"
+        username: "${secrets.opensearch.username}"
+        password: "${secrets.opensearch.password}"
+        host: "${secrets.opensearch.host}"
+        region: "${secrets.opensearch.region}"
+        index-name: "my-index-000"
+```
+
+- `username` is the AWS Access Key
+- `password` is the AWS Secret Key
+- `host` is the endpoint provided by AWS. e.g. for AWS OpenSearch serverless it looks like this: xxxx.<region>.aoss.amazonaws.com
+- `region` is the AWS region. It has to match with the one used in the endpoint
+
+
+
+#### Declare an index as asset
+
+To bind the application to the OpenSearch index creation at startup, you must use the `opensearch-index` asset type.
+
+You can configure `settings` and `mappings` as you prefer. Other configuration fields are not supported.
+
+This is an example mixing normal fields with vector fields. The `knn` plugin is required in the target OpenSearch instance.
+```yaml
+- name: "os-index"
+  asset-type: "opensearch-index"
+  creation-mode: create-if-not-exists
+  config:
+    datasource: "OpenSearch"
+    settings: |
+      {
+            "index": {
+                  "knn": true,
+                  "knn.algo_param.ef_search": 100
+            }
+        }
+    mappings: |
+      {
+            "properties": {
+                  "content": {
+                        "type": "text"
+                  },
+                  "embeddings": {
+                        "type": "knn_vector",
+                        "dimension": 1536
+                  }
+            }
+        }
+```
+
+Refer to the [settings](https://opensearch.org/docs/latest/im-plugin/index-settings/) documentation for the `settings` field.
+Refer to the [mappings](https://opensearch.org/docs/latest/field-types/index/) documentation for the `mappings` field.
+
+
+#### Search
+
+Use the `query-vector-db` agent with the following parameters to perform searches on the index created above :
+
+```yaml
+  - name: "lookup-related-documents"
+    type: "query-vector-db"
+    configuration:
+      datasource: "OpenSearch"
+      query: |
+        {
+          "size": 1,
+          "query": {
+            "knn": {
+              "embeddings": {
+                "vector": ?,
+                "k": 1
+              }
+            }
+          }
+        }
+      fields:
+        - "value.question_embeddings"
+      output-field: "value.related_documents"
+```
+
+You can use the '?' symbol as a placeholder for the fields.
+
+The `query` is the body sent to OpenSearch. Refer to the [documentation](https://opensearch.org/docs/latest/query-dsl/index/) to learn which parameters are supported.
+Note that the query will be executed on the configured index. Multi-index queries are not supported, but you can declare multiple datasources and query different indexes in the same application.
+
+The `output-field` will contain the query result. 
+The result is an array with the following elements:
+- `id`: the document ID
+- `document`: the document source 
+- `score`: the document score
+- `index`: the index name
+
+For example, if you want to keep only one relevant field from the first result, use the `compute` agent after the search:
+
+```yaml
+  - name: "lookup-related-documents"
+    type: "query-vector-db"
+    configuration:
+      datasource: "OpenSearch"
+      query: |
+        {
+          "size": 1,
+          "query": {
+            "match_all": {}
+          }
+        }
+      output-field: "value.related_documents"
+      only-first: true
+  - name: "Format response"
+    type: compute
+    configuration:
+      fields:
+        - name: "value"
+          type: STRING
+          expression: "value.related_documents.document.content"
+```
+
+
+### Indexing
+
+Use the `vector-db-sink` agent to index data, with the following parameters:
+
+```yaml
+  - name: "Write to Solr"
+    type: "vector-db-sink"
+    input: chunks-topic
+    configuration:
+      datasource: "OpenSearch"
+      bulk-parameters:
+        timeout: 2m
+      fields:
+        - name: "id"
+          expression: "fn:concat(value.filename, value.chunk_id)"
+        - name: "embeddings"
+          expression: "fn:toListOfFloat(value.embeddings_vector)"
+        - name: "text"
+          expression: "value.text"
+```
+
+
+All indexing is performed using the Bulk operation.
+You can customize the [bulk parameters](https://opensearch.org/docs/latest/api-reference/document-apis/bulk/#url-parameters) with the `bulk-parameters` property.
+
+The request will be flushed depending on `flush-interval` and `batch-size` parameters.
+
+### Configuration
+
+Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_opensearch).
diff --git a/configuration-resources/data-storage/pinecone.md b/configuration-resources/data-storage/pinecone.md
@@ -71,4 +71,9 @@ pipeline:
 ```
 
 To write to Pinecone, define the values for the vector.id, vector.vector and vector.metadata fields.
-You can add as many vector.metadata fields as you want, but you need to specify the prefix "vector.metadata." for each field.
+You can add as many vector.metadata fields as you want, but you need to specify the prefix "vector.metadata." for each field.
+
+
+### Configuration
+
+Check out the full configuration properties in the [API Reference page](../../building-applications/api-reference/resources.md#datasource_pinecone).
diff --git a/configuration-resources/data-storage/solr.md b/configuration-resources/data-storage/solr.md
diff --git a/patterns/rag-pattern.md b/patterns/rag-pattern.md