Asymmetric quantization for computing KNN similarity score

### Description

Inspired by the asymmetric quantization technique used in [BBQ](https://www.elastic.co/search-labs/blog/better-binary-quantization-lucene-elasticsearch#asymmetric-quantization,-the-interesting-bits), I'm exploring to adopt the same idea to scalar quantization. The core idea is to quantize query using more bits than the number of bits used to quantize document vectors at indexing time . This helps reduce information loss during query quantization, leading to a better approximated distance and eventually higher recall. Importantly, we can achieves this without increasing the hot RAM memory for loading  document vectors 

I made  a quick prototype to test the idea and benchmarked with following settings :
- Document vectors quantized to 4 bit
- Queries vectors are quantized to 4-bit, 7-bit and 15-bit
- Dot product  as similarty score
- Same parameters as the nightly benchmark on the Coehere 768-dimension dataset. 

#### 4 bit query + 4 bit index

```

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.468        2.926  24.651        8.425  8000000   100      50       32        100     4 bits   2682.55       2982.24            17        29879.53     29327.393     5889.893       HNSW
```
#### 7 bit query + 4 bit index

```
recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.572        2.966  24.714        8.332  8000000   100      50       32        100     4 bits      0.00      Infinity            17        29879.53     29327.393     5889.893       HNSW
```

#### 15 bit query + 4 bit index 
```
recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.574        3.098  26.096        8.423  8000000   100      50       32        100     4 bits      0.00      Infinity            17        29879.53     29327.393     5889.893       HNSW
```


We could see that quantizing query to 7 bits  and using it to compute dot product score  yields approximately a ~10% higher recall  compared to 4-bit query quantization, with almost the same latency. Increasing query quantization to 15 bits  i.e.. each dimension represented by an `int` instead of `byte`,  provides only a marginal additional recall gain over 7-bit quantization


This asymmetric quantization technique can also be leveraged for reranking without incurring additional hot RAM memory costs. A typical reranking setup often requires loading additional, higher-precision vectors (e.g., float vectors) into RAM to compute a reranking score. By using a higher-bit quantized query against a lower-bit indexed vector, we can compute the reranking score without this memory overhead.

The recall latency tradeoff looks nice so I just want to share a quick benchmark result and discuss if it's worth pursuing the idea. 
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Asymmetric quantization for computing KNN similarity score #14984

Description

4 bit query + 4 bit index

7 bit query + 4 bit index

15 bit query + 4 bit index

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Asymmetric quantization for computing KNN similarity score #14984

Description

Description

4 bit query + 4 bit index

7 bit query + 4 bit index

15 bit query + 4 bit index

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions