Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 124 additions & 7 deletions content/develop/clients/go/vecsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ In the example below, we use the
[`huggingfaceembedder`](https://pkg.go.dev/github.com/henomis/[email protected]/embedder/huggingface)
package from the [`LinGoose`](https://pkg.go.dev/github.com/henomis/[email protected])
framework to generate vector embeddings to store and index with
Redis Query Engine.
Redis Query Engine. The code is first demonstrated for hash documents with a
separate section to explain the
[differences with JSON documents](#differences-with-json-documents).

## Initialize

Expand Down Expand Up @@ -80,10 +82,10 @@ the embeddings for this example are both available for free.

The `huggingfaceembedder` model outputs the embeddings as a
`[]float32` array. If you are storing your documents as
[hash]({{< relref "/develop/data-types/hashes" >}}) objects
(as we are in this example), then you must convert this array
to a `byte` string before adding it as a hash field. In this example,
we will use the function below to produce the `byte` string:
[hash]({{< relref "/develop/data-types/hashes" >}}) objects, then you
must convert this array to a `byte` string before adding it as a hash field.
The function shown below uses Go's [`binary`](https://pkg.go.dev/encoding/binary)
package to produce the `byte` string:

```go
func floatsToBytes(fs []float32) []byte {
Expand All @@ -101,7 +103,8 @@ func floatsToBytes(fs []float32) []byte {
Note that if you are using [JSON]({{< relref "/develop/data-types/json" >}})
objects to store your documents instead of hashes, then you should store
the `[]float32` array directly without first converting it to a `byte`
string.
string (see [Differences with JSON documents](#differences-with-json-documents)
below).

## Create the index

Expand Down Expand Up @@ -187,7 +190,7 @@ hf := huggingfaceembedder.New().
## Add data

You can now supply the data objects, which will be indexed automatically
when you add them with [`hset()`]({{< relref "/commands/hset" >}}), as long as
when you add them with [`HSet()`]({{< relref "/commands/hset" >}}), as long as
you use the `doc:` prefix specified in the index definition.

Use the `Embed()` method of `huggingfacetransformer`
Expand Down Expand Up @@ -310,6 +313,120 @@ As you would expect, the result for `doc:0` with the content text *"That is a ve
is the result that is most similar in meaning to the query text
*"That is a happy person"*.

## Differences with JSON documents

Indexing JSON documents is similar to hash indexing, but there are some
important differences. JSON allows much richer data modelling with nested fields, so
you must supply a [path]({{< relref "/develop/data-types/json/path" >}}) in the schema
to identify each field you want to index. However, you can declare a short alias for each
of these paths (using the `As` option) to avoid typing it in full for
every query. Also, you must set `OnJSON` to `true` when you create the index.

The code below shows these differences, but the index is otherwise very similar to
the one created previously for hashes:

```go
_, err = rdb.FTCreate(ctx,
"vector_json_idx",
&redis.FTCreateOptions{
OnJSON: true,
Prefix: []any{"jdoc:"},
},
&redis.FieldSchema{
FieldName: "$.content",
As: "content",
FieldType: redis.SearchFieldTypeText,
},
&redis.FieldSchema{
FieldName: "$.genre",
As: "genre",
FieldType: redis.SearchFieldTypeTag,
},
&redis.FieldSchema{
FieldName: "$.embedding",
As: "embedding",
FieldType: redis.SearchFieldTypeVector,
VectorArgs: &redis.FTVectorArgs{
HNSWOptions: &redis.FTHNSWOptions{
Dim: 384,
DistanceMetric: "L2",
Type: "FLOAT32",
},
},
},
).Result()
```

Use [`JSONSet()`]({{< relref "/commands/json.set" >}}) to add the data
instead of [`HSet()`]({{< relref "/commands/hset" >}}). The maps
that specify the fields have the same structure as the ones used for `HSet()`.

An important difference with JSON indexing is that the vectors are
specified using lists instead of binary strings. The loop below is similar
to the one used previously to add the hash data, but it doesn't use the
`floatsToBytes()` function to encode the `float32` array.

```go
for i, emb := range embeddings {
_, err = rdb.JSONSet(ctx,
fmt.Sprintf("jdoc:%v", i),
"$",
map[string]any{
"content": sentences[i],
"genre": tags[i],
"embedding": emb.ToFloat32(),
},
).Result()

if err != nil {
panic(err)
}
}
```

The query is almost identical to the one for the hash documents. This
demonstrates how the right choice of aliases for the JSON paths can
save you having to write complex queries. An important thing to notice
is that the vector parameter for the query is still specified as a
binary string (using the `floatsToBytes()` method), even though the data for
the `embedding` field of the JSON was specified as an array.

```go
jsonQueryEmbedding, err := hf.Embed(ctx, []string{
"That is a happy person",
})

if err != nil {
panic(err)
}

jsonBuffer := floatsToBytes(jsonQueryEmbedding[0].ToFloat32())

jsonResults, err := rdb.FTSearchWithArgs(ctx,
"vector_json_idx",
"*=>[KNN 3 @embedding $vec AS vector_distance]",
&redis.FTSearchOptions{
Return: []redis.FTSearchReturn{
{FieldName: "vector_distance"},
{FieldName: "content"},
},
DialectVersion: 2,
Params: map[string]any{
"vec": jsonBuffer,
},
},
).Result()
```

Apart from the `jdoc:` prefixes for the keys, the result from the JSON
query is the same as for hash:

```
ID: jdoc:0, Distance:0.114169843495, Content:'That is a very happy person'
ID: jdoc:1, Distance:0.610845327377, Content:'That is a happy dog'
ID: jdoc:2, Distance:1.48624765873, Content:'Today is a sunny day'
```

## Learn more

See
Expand Down
127 changes: 120 additions & 7 deletions content/develop/clients/redis-py/vecsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,12 @@ similarity of an embedding generated from some query text with embeddings stored
or JSON fields, Redis can retrieve documents that closely match the query in terms
of their meaning.

In the example below, we use the
The example below uses the
[`sentence-transformers`](https://pypi.org/project/sentence-transformers/)
library to generate vector embeddings to store and index with
Redis Query Engine.
Redis Query Engine. The code is first demonstrated for hash documents with a
separate section to explain the
[differences with JSON documents](#differences-with-json-documents).

## Initialize

Expand All @@ -50,6 +52,7 @@ from sentence_transformers import SentenceTransformer
from redis.commands.search.query import Query
from redis.commands.search.field import TextField, TagField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.json.path import Path

import numpy as np
import redis
Expand Down Expand Up @@ -86,7 +89,7 @@ except redis.exceptions.ResponseError:
pass
```

Next, we create the index.
Next, create the index.
The schema in the example below specifies hash objects for storage and includes
three fields: the text content to index, a
[tag]({{< relref "/develop/interact/search-and-query/advanced-concepts/tags" >}})
Expand Down Expand Up @@ -127,10 +130,10 @@ Use the `model.encode()` method of `SentenceTransformer`
as shown below to create the embedding that represents the `content` field.
The `astype()` option that follows the `model.encode()` call specifies that
we want a vector of `float32` values. The `tobytes()` option encodes the
vector components together as a single binary string rather than the
default Python list of `float` values.
Use the binary string representation when you are indexing hash objects
(as we are here), but use the default list of `float` for JSON objects.
vector components together as a single binary string.
Use the binary string representation when you are indexing hashes
or running a query (but use a list of `float` for
[JSON documents](#differences-with-json-documents)).

```python
content = "That is a very happy person"
Expand Down Expand Up @@ -226,6 +229,116 @@ As you would expect, the result for `doc:0` with the content text *"That is a ve
is the result that is most similar in meaning to the query text
*"That is a happy person"*.

## Differences with JSON documents

Indexing JSON documents is similar to hash indexing, but there are some
important differences. JSON allows much richer data modelling with nested fields, so
you must supply a [path]({{< relref "/develop/data-types/json/path" >}}) in the schema
to identify each field you want to index. However, you can declare a short alias for each
of these paths (using the `as_name` keyword argument) to avoid typing it in full for
every query. Also, you must specify `IndexType.JSON` when you create the index.

The code below shows these differences, but the index is otherwise very similar to
the one created previously for hashes:

```py
schema = (
TextField("$.content", as_name="content"),
TagField("$.genre", as_name="genre"),
VectorField(
"$.embedding", "HNSW", {
"TYPE": "FLOAT32",
"DIM": 384,
"DISTANCE_METRIC": "L2"
},
as_name="embedding"
)
)

r.ft("vector_json_idx").create_index(
schema,
definition=IndexDefinition(
prefix=["jdoc:"], index_type=IndexType.JSON
)
)
```

Use [`json().set()`]({{< relref "/commands/json.set" >}}) to add the data
instead of [`hset()`]({{< relref "/commands/hset" >}}). The dictionaries
that specify the fields have the same structure as the ones used for `hset()`
but `json().set()` receives them in a positional argument instead of
the `mapping` keyword argument.

An important difference with JSON indexing is that the vectors are
specified using lists instead of binary strings. Generate the list
using the `tolist()` method instead of `tobytes()` as you would with a
hash.

```py
content = "That is a very happy person"

r.json().set("jdoc:0", Path.root_path(), {
"content": content,
"genre": "persons",
"embedding": model.encode(content).astype(np.float32).tolist(),
})

content = "That is a happy dog"

r.json().set("jdoc:1", Path.root_path(), {
"content": content,
"genre": "pets",
"embedding": model.encode(content).astype(np.float32).tolist(),
})

content = "Today is a sunny day"

r.json().set("jdoc:2", Path.root_path(), {
"content": content,
"genre": "weather",
"embedding": model.encode(content).astype(np.float32).tolist(),
})
```

The query is almost identical to the one for the hash documents. This
demonstrates how the right choice of aliases for the JSON paths can
save you having to write complex queries. An important thing to notice
is that the vector parameter for the query is still specified as a
binary string (using the `tobytes()` method), even though the data for
the `embedding` field of the JSON was specified as a list.

```py
q = Query(
"*=>[KNN 3 @embedding $vec AS vector_distance]"
).return_field("vector_distance").return_field("content").dialect(2)

query_text = "That is a happy person"

res = r.ft("vector_json_idx").search(
q, query_params={
"vec": model.encode(query_text).astype(np.float32).tobytes()
}
)
```

Apart from the `jdoc:` prefixes for the keys, the result from the JSON
query is the same as for hash:

```
Result{
3 total,
docs: [
Document {
'id': 'jdoc:0',
'payload': None,
'vector_distance': '0.114169985056',
'content': 'That is a very happy person'
},
.
.
.
```

## Learn more

See
Expand Down