-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
With quantization techniques that are compressing vectors in memory further and further, because of how much information is lost, recall is going to drop. However, with the current quantization support, we already store the full precision float vector values. With this, we could create a two phased search process that oversamples vectors from the quantized index (i.e. r*k results), and then lazily loads vectors from disk for the r*k results, re-scoring the vectors. This has been discussed directly or indirectly in a couple different places, but I figured itd make sense to create a separate issue for it:
- Can we add configuration on dropping raw vectors from quantized formats after some period of time? #13251
- Expose flat vectors in "user space" #13468
- Should we explore DiskANN for aKNN vector search? #12615
It has been shown to work with binary compression techniques and some PQ in a few different places (https://medium.com/qdrant/hyperfast-vector-search-with-binary-quantization-865052c5c259, https://huggingface.co/blog/embedding-quantization#binary-rescoring). We also have done some experiments in OpenSearch with IVFPQ and re-scoring and its shown pretty strong promise (assuming that the disk IOPS/throughput are strong enough - see opensearch-project/k-NN#1779 (comment)).
That being said, it can provide similar benefits to the DiskANN approach.
One challenge I foresee is the approach requires the quantized ANN index to be resident in memory. However, I am unsure if loading the full precision vectors from disk will cause the page cache to evict the quantized ANN index or if the page cache will naturally adapt to the access pattern. If it is the former, we would probably need some way to pin the quantized ANN index and its vectors in memory.
In the future, it could probably be fine tuned quite a bit with access pattern optimizations and/or some kind of hierarchical quantization and re-scoring.