Skip to content

Asymmetric quantization for computing KNN similarity score #14984

@huynmg

Description

@huynmg

Description

Inspired by the asymmetric quantization technique used in BBQ, I'm exploring to adopt the same idea to scalar quantization. The core idea is to quantize query using more bits than the number of bits used to quantize document vectors at indexing time . This helps reduce information loss during query quantization, leading to a better approximated distance and eventually higher recall. Importantly, we can achieves this without increasing the hot RAM memory for loading document vectors

I made a quick prototype to test the idea and benchmarked with following settings :

  • Document vectors quantized to 4 bit
  • Queries vectors are quantized to 4-bit, 7-bit and 15-bit
  • Dot product as similarty score
  • Same parameters as the nightly benchmark on the Coehere 768-dimension dataset.

4 bit query + 4 bit index


recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.468        2.926  24.651        8.425  8000000   100      50       32        100     4 bits   2682.55       2982.24            17        29879.53     29327.393     5889.893       HNSW

7 bit query + 4 bit index

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.572        2.966  24.714        8.332  8000000   100      50       32        100     4 bits      0.00      Infinity            17        29879.53     29327.393     5889.893       HNSW

15 bit query + 4 bit index

recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.574        3.098  26.096        8.423  8000000   100      50       32        100     4 bits      0.00      Infinity            17        29879.53     29327.393     5889.893       HNSW

We could see that quantizing query to 7 bits and using it to compute dot product score yields approximately a ~10% higher recall compared to 4-bit query quantization, with almost the same latency. Increasing query quantization to 15 bits i.e.. each dimension represented by an int instead of byte, provides only a marginal additional recall gain over 7-bit quantization

This asymmetric quantization technique can also be leveraged for reranking without incurring additional hot RAM memory costs. A typical reranking setup often requires loading additional, higher-precision vectors (e.g., float vectors) into RAM to compute a reranking score. By using a higher-bit quantized query against a lower-bit indexed vector, we can compute the reranking score without this memory overhead.

The recall latency tradeoff looks nice so I just want to share a quick benchmark result and discuss if it's worth pursuing the idea.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions