|
| 1 | +# Coherence AI Module |
| 2 | + |
| 3 | +This module provides Vector Database functionality on top of Oracle Coherence. |
| 4 | + |
| 5 | +## What is a Vector DB |
| 6 | + |
| 7 | +There are many online articles that describe in detail what a Vector DB is and what it does. |
| 8 | +Briefly, a Vector DB stores vectors, which can be though of as arrays of numeric values. |
| 9 | +In the case of the Coherence AI module, vectors are stored with a key and optional metadata. |
| 10 | + |
| 11 | +As well as storing vectors, a VectorDB allows queries to be made against the vector data. |
| 12 | +These queries are typically operations such as nearest neighbour queries, often known as kNN queries. |
| 13 | +A kNN query will take a specific vector and then find a number (k) of nearest neighbours (NN) to that |
| 14 | +vector from the vectors stored in the DB. |
| 15 | + |
| 16 | +There are many different algorithms to determine how similar one vector is to another. |
| 17 | +The Coherence AI module implements some of these algorithms. |
| 18 | + |
| 19 | +## The Coherence AI VectorStore |
| 20 | + |
| 21 | +The `VectorStore` provides the entry point for vector store functionality. |
| 22 | +Coherence AI stores vectors as arrays of Java primitives, `double[]`, `float[]`, `int[]`, `long[]` and `short[]`. |
| 23 | +A `VectorStore` is generically typed using the vector primitive array type, the vector key type and the vector |
| 24 | +metadata type. |
| 25 | + |
| 26 | +```java |
| 27 | +/** |
| 28 | + * @param <V> the type of the store (this will always be a primitive array type) |
| 29 | + * @param <K> the type of the key |
| 30 | + * @param <M> the type of the metadata |
| 31 | + */ |
| 32 | +public interface VectorStore<V, K, M> |
| 33 | + { |
| 34 | + } |
| 35 | +``` |
| 36 | + |
| 37 | +### Creating a VectorStore |
| 38 | + |
| 39 | +A `VectorStore` is backed by a Coherence `NamedMap` which stores the vector data. |
| 40 | +The `VectorStore` interface has various factory methods on it to create vector stores of different types. |
| 41 | + |
| 42 | +For example, a `VectorStore` to store `float[]` vectors using a `Long` as the key and a `String` as the |
| 43 | +metadata can be created like this: |
| 44 | + |
| 45 | +```java |
| 46 | +VectorStore<float[], Long, String> store = VectorStore.ofFloats("my-store"); |
| 47 | +``` |
| 48 | + |
| 49 | +The vector data will be stored in a `NamedMap` named `my-store` obtained from the default `Session`. |
| 50 | + |
| 51 | +It is possible to specify a `Session` when creating a vector store, in which case the `NamedMap` |
| 52 | +will be obtained from that `Session` |
| 53 | + |
| 54 | +```java |
| 55 | +VectorStore<float[], Long, String> store = VectorStore.ofFloats("my-store", session); |
| 56 | +``` |
| 57 | + |
| 58 | +The local `VectorStore` instance is stateless, so applications can create multiple instances of a |
| 59 | +`VectorStore` over the same `NamedMap` without worrying about mismatched local state. |
| 60 | + |
| 61 | +### Add Vectors to a VectorStore |
| 62 | + |
| 63 | +There are a number of different methods to add data to a `VectorStore`. |
| 64 | +The simplest is just to add a simple primitive array: |
| 65 | + |
| 66 | +For example, adding a simple `float[]` to a float `VectorStore` with the key `123L` and metadata `"foo"` |
| 67 | +```java |
| 68 | +VectorStore<float[], Long, String> store = VectorStore.ofFloats("my-store"); |
| 69 | + |
| 70 | +float[] vector = new float[]{1.0f, 2.0f, 3.0f}; |
| 71 | + |
| 72 | +store.add(123L, vector, "foo"); |
| 73 | +``` |
| 74 | + |
| 75 | +Or the vector can be added without metadata, even though the store has a metadata type: |
| 76 | +```java |
| 77 | +store.add(123L, vector); |
| 78 | +``` |
| 79 | + |
| 80 | + |
| 81 | +The `VectorStore` has methods to add all types of primitive vector to a store. |
| 82 | +If the vector is not of the same type as the underlying store it will be up-cast or down-cast to the correct type. |
| 83 | +It is up to the developer to ensure that this casting does not alter the data. |
| 84 | +For example, calling `store.addDoubles()` on a `VectorStore` of `float[]` will downcast the `double` values |
| 85 | +to `float` values. This will be fine if all the values in the double vector are within the range for a valid float. |
| 86 | +If they are outside this range they will be truncated, as they would for any normal java cast. |
| 87 | + |
| 88 | +There are also methods to add vectors to a store in bulk, which can be more efficient than single calls. |
| 89 | + |
| 90 | +### Query a VectorStore |
| 91 | + |
| 92 | +The `VectorStore` has a `query` method to perform different types of query on the vectors. |
| 93 | +The `query` method takes a `SimilarityQuery` instance as its parameter, which defines the query to execute. |
| 94 | +The Coherence AI module will have a number of built-in queries for different kNN algorithms. |
| 95 | + |
| 96 | +For example a Jaccard similarity query can be run on a store of `long[]` vectors like this: |
| 97 | + |
| 98 | +```java |
| 99 | +VectorStore<long[], Integer, Void> store = VectorStore.ofFloats("my-store"); |
| 100 | + |
| 101 | +long[] testVector = new long[]{1L, 2L, 3L}; |
| 102 | + |
| 103 | +Jaccard<long[]> query = Jaccard.forLongs(testVector).withMaxResults(100).build(); |
| 104 | + |
| 105 | +List<QueryResult<long[], Integer, Void>> results = store.query(query); |
| 106 | +``` |
| 107 | + |
| 108 | +The query will return the 100 nearest neighbours to the `testVector`, or less than 100 if there are not 100 vectors in the store. |
| 109 | + |
| 110 | +Another example using a Cosine similarity query (also called "Angular" query), |
| 111 | +the code is almost identical but this time the vectors are |
| 112 | +`float[]` and the query created is a `Cosine` query. |
| 113 | + |
| 114 | +```java |
| 115 | +VectorStore<float[], Integer, Void> store = VectorStore.ofFloats("my-store"); |
| 116 | + |
| 117 | +float[] testVector = new float[]{0.1f, 0.2f, 0.3f}; |
| 118 | + |
| 119 | +Cosine<float[]> query = Cosine.forFloats(testVector).withMaxResults(100).build(); |
| 120 | + |
| 121 | +List<QueryResult<float[], Integer, Void>> results = store.query(query); |
| 122 | +``` |
| 123 | + |
| 124 | +### Metadata |
| 125 | + |
| 126 | +The metadata for a store can be any type that is serializable using the serializer configured for the |
| 127 | +underlying cache service. The metadata is also optional so can be set to `Void`. |
| 128 | + |
| 129 | +A `VectorStore` without metadata can be created by using `Void` for the metadata generic argument like this: |
| 130 | + |
| 131 | +```java |
| 132 | +VectorStore<float[], Long, Void> store = VectorStore.ofFloats("my-store", session); |
| 133 | +``` |
| 134 | + |
| 135 | +and then vectors added without specifying metadata. |
| 136 | +```java |
| 137 | +float[] vector = new float[]{1.0f, 2.0f, 3.0f}; |
| 138 | +store.addFloats(123L, vector); |
| 139 | +``` |
| 140 | + |
| 141 | +## How Does Coherence Store Vectors |
| 142 | + |
| 143 | +Vectors and their optional metadata are stored in a single cache. |
| 144 | +The vector is stored as a Coherence `Binary` that wraps the memory representation of the underlying array. |
| 145 | +This is very different to how Coherence normally stores data. The array is not serialized and deserialized |
| 146 | +each time it is accessed, instead Java buffers are used to treat the binary blob of data as the correct array type. |
| 147 | +This means accessing a vector is a more efficient, at the cost of slightly higher memory usage. |
| 148 | +Given that most Vector DB usage is running queries and performing vector math on the arrays, being able to access them |
| 149 | +faster without the cost of serialization is seen as a good tradeoff. |
| 150 | + |
| 151 | +Storing the vectors this way allows then to be used directly with Java's primitive buffers, e.g. `FloatBuffer`, |
| 152 | +`LongBuffer` etc. These buffers wrap a portion of memory and access it as a primitive array. |
| 153 | +It is simple to go from a Coherence `Binary` to a primitive buffer without necessarily copying the underlying data. |
| 154 | +Ultimately in future it will be possible to switch to Java's `MemorySegment` model when that is out of preview. |
| 155 | + |
| 156 | +Metadata is stored as a decoration on the binary vector. This allows the metadata and vector to co-exist |
| 157 | +easily in the same cache entry. It also allows the metadata to be easily used in `Filter` queries |
| 158 | +to restrict vector searches. There is a cost of slightly more complex extraction of the metadata for queries, |
| 159 | +but this is hidden from the end-user. |
| 160 | + |
| 161 | +This way of storing vectors means it is not possible to access a vector cache like a normal cache. It would be impossible for any of the normal Coherence serializers to work with the cache values. The cache keys are serialized as normal, so they could be accessed, but the values are not. |
| 162 | + |
| 163 | +## Similarity Queries |
| 164 | + |
| 165 | +The current implementation contains slow brute force examples of kNN queries. |
| 166 | + |
| 167 | +### Cosine Similarity |
| 168 | + |
| 169 | +cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval `[−1,1]`. |
| 170 | +For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in `[0,1]`. |
| 171 | + |
| 172 | +The `FloatBruteForceCosine` class is an implementation of cosine similarity that performs brute force calculations on `float` vector arrays. |
| 173 | + |
| 174 | +The aggregator is a "top n" type aggregator, so it returns the requested maximum number of nearest neighbours. |
| 175 | + |
| 176 | + |
| 177 | +The query is run like this: |
| 178 | + |
| 179 | +```java |
| 180 | +VectorStore<float[], Integer, Void> store = VectorStore.ofFloats("my-store"); |
| 181 | + |
| 182 | +float[] testVector = new float[]{0.1f, 0.2f, 0.3f}; |
| 183 | + |
| 184 | +Cosine<float[]> query = Cosine.forFloats(testVector).withMaxResults(100).build(); |
| 185 | + |
| 186 | +List<QueryResult<float[], Integer, Void>> results = store.query(query); |
| 187 | +``` |
| 188 | + |
| 189 | + |
| 190 | +### Jaccard Similarity |
| 191 | + |
| 192 | +A simple kNN query to build is a Jaccard similarity query. |
| 193 | + |
| 194 | +Jaccard Similarity is a measure of similarity between two asymmetric binary vectors, or we can say a way to find the similarity between two sets. It is a common proximity measurement used to compute the similarity of two items, such as two text documents. The index ranges from 0 to 1. Range closer to 1 means more similarity in two sets of data. |
| 195 | + |
| 196 | +It is denoted by J and is also referred as Jaccard Index, Jaccard Coefficient, Jaccard Dissimilarity, and Jaccard Distance. It is frequently used in Data Science and Machine Learning such as Text Mining, E-Commerce, Recommendation System, etc. |
| 197 | + |
| 198 | +It is calculated by the formula: |
| 199 | + |
| 200 | +Jaccard Similarity = (number of observations in both sets) / (number in either set) |
| 201 | + |
| 202 | +or mathematically, |
| 203 | + |
| 204 | +J(A, B) = |A∩B| / |A∪B| |
| 205 | + |
| 206 | +If two datasets share exact same members then their Jaccard Similarity Index will be 1 and if there are no common members then Jaccard Similarity index will be 0. |
| 207 | + |
| 208 | +Jaccard Similarity will tell us that how many features are similar to each other in the dataset. |
| 209 | + |
| 210 | + The Coherence AI `LongBruteForceJaccard` class is an implementation of this functionality. |
| 211 | + It performs the math above on `long[]` vectors. |
| 212 | + |
| 213 | +The actual query is run by the `SimilarityAggregator` that wraps the `LongBruteForceJaccard` operation. |
| 214 | +The aggregator is optionally run using a filter on the vector metadata. |
| 215 | + |
| 216 | +The aggregator is a "top _n_" type aggregator, so it returns the requested maximum number of nearest neighbours. |
| 217 | + |
| 218 | +The query is run like this: |
| 219 | + |
| 220 | +```java |
| 221 | +VectorStore<long[], Integer, Void> store = VectorStore.ofFloats("my-store"); |
| 222 | + |
| 223 | +long[] testVector = new long[]{1L, 2L, 3L}; |
| 224 | + |
| 225 | +Jaccard<long[]> query = Jaccard.forLongs(testVector).withMaxResults(100).build(); |
| 226 | + |
| 227 | +List<QueryResult<long[], Integer, Void>> results = store.query(query); |
| 228 | +``` |
| 229 | + |
0 commit comments