Vectorize: Add Content to the Best Practices Documentation (#21913)

nagraham · sejoker · web-flow · commit da4b8bfcc733 · 2025-04-23T15:01:25.000Z
* Vectorize: Update Best Practices with a section on improving write Throughput

* Vectorize: Update Range Query docs to draw attention to prefix searching

* Vectorize: Add section on how to optimize Metadata Index values

* Update src/content/docs/vectorize/reference/metadata-filtering.mdx

---------

Co-authored-by: Yevgen Safronov &lt;sejoker@gmail.com&gt;
diff --git a/src/content/docs/vectorize/best-practices/insert-vectors.mdx b/src/content/docs/vectorize/best-practices/insert-vectors.mdx
@@ -47,6 +47,20 @@ For example, a vector embedding representing an image could include the path to
 { id: '1', values: [32.4, 74.1, 3.2, ...], metadata: { path: 'r2://bucket-name/path/to/image.png', format: 'png', category: 'profile_image' } }
 ```
 
+### Performance Tips When Filtering by Metadata
+
+When creating metadata indexes for a large Vectorize index, we encourage users to think ahead and plan how they will query for vectors with filters on this metadata.
+
+Carefully consider the cardinality of metadata values in relation to your queries. Cardinality is the level of uniqueness of data values within a set. Low cardinality means there are only a few unique values: for instance, the number of planets in the Solar System; the number of countries in the world. High cardinality means there are many unique values: UUIv4 strings; timestamps with millisecond precision.
+
+High cardinality is good for the selectiveness of the equal (`$eq`) filter. For example, if you want to find vectors associated with one user's id. But the filter is not going to help if all vectors have the same value. That's an example of extreme low cardinality.
+
+High cardinality can also impact range queries, which searches across multiple unqiue metadata values. For example, an indexed metadata value using millisecond timestamps will see lower performance if the range spans long periods of time in which thousands of vectors with unique timestamps were written.
+
+Behind the scenes, Vectorize uses a reverse index to map values to vector ids. If the number of unique values in a particular range is too high, then that requires reading large portions of the index (a full index scan in the worst case). This would lead to memory issues, so Vectorize will degrade performance and the accuracy of the query in order to finish the request.
+
+One approach for high cardinality data is to somehow create buckets where more vectors get grouped to the same value. Continuing the millisecond timestamp example, let's imagine we typically filter with date ranges that have 5 minute increments of granularity. We could use a timestamp which is rounded down to the last 5 minute point. This "windows" our metadata values into 5 minute increments. And we can still store the original millisecond timestamp as a separate non-indexed field.
+
 ## Namespaces
 
 Namespaces provide a way to segment the vectors within your index. For example, by customer, merchant or store ID.
@@ -94,6 +108,16 @@ let matches = await env.TUTORIAL_INDEX.query(queryVector, {
 });
 ```
 
+## Improve Write Throughput
+
+One way to reduce the time to make updates visible in queries is to batch more vectors into fewer requests. This is important for write-heavy workloads. To see how many vectors you can write in a single request, please refer to the [Limits](/vectorize/platform/limits/) page.
+
+Vectorize writes changes immeditely to a write ahead log for durability. To make these writes visible for reads, an asynchronous job needs to read the current index files from R2, create an updated index, write the new index files back to R2, and commit the change. To keep the overhead of writes low and improve write throughput, Vectorize will combine multiple changes together into a single batch. It sets the maximum size of a batch to 200,000 total vectors or to 1,000 individual updates, whichever limit it hits first.
+
+For example, let's say we have 250,000 vectors we would like to insert into our index. We decide to insert them one at a time, calling the insert API 250,000 times. Vectorize will only process 1000 vectors in each job, and will need to work through 250 total jobs. This could take at least an hour to do.
+
+The better approach is to batch our updates. For example, we can split our 250,000 vectors into 100 files, where each file has 2,500 vectors. We would call the insert HTTP API 100 times. Vectorize would update the index in only 2 or 3 jobs. All 250,000 vectors will visible in queries within minutes.
+
 ## Examples
 
 ### Workers API
diff --git a/src/content/docs/vectorize/reference/metadata-filtering.mdx b/src/content/docs/vectorize/reference/metadata-filtering.mdx
@@ -49,8 +49,8 @@ An optional `filter` property on `query()` method specifies metadata filters:
 - For `$eq` and `$ne`, `filter` object non-nested values can be `string`, `number`, `boolean`, or `null` values.
 - For `$in` and `$nin`, `filter` object values can be arrays of `string`, `number`, `boolean`, or `null` values.
 - Upper-bound range queries (i.e. `$lt` and `$lte`) can be combined with lower-bound range queries (i.e. `$gt` and `$gte`) within the same filter. Other combinations are not allowed.
-- For range queries (i.e. `$lt`, `$lte`, `$gt`, `$gte`), `filter` object non-nested values can be `string` or `number` values. Strings are ordered lexicographically. 
-- Range queries involving a large number of vectors (~10M and above) may experience reduced accuracy. 
+- For range queries (i.e. `$lt`, `$lte`, `$gt`, `$gte`), `filter` object non-nested values can be `string` or `number` values. Strings are ordered lexicographically.
+- Range queries involving a large number of vectors (~10M and above) may experience reduced accuracy.
 
 ### Namespace versus metadata filtering
 
@@ -92,7 +92,9 @@ Both [namespaces](/vectorize/best-practices/insert-vectors/#namespaces) and meta
 ```
 
 #### Range query involving strings
-Range queries can be used to implement prefix searching on string metadata fields.
+
+Range queries can implement **prefix searching** on string metadata fields. This is also like a **starts_with** filter.
+
 For example, the following filter matches all values starting with "net":
 
 ```json