Skip to content

New API proposal #585

@marianotepper

Description

@marianotepper

Current status

The current JVector API allows too many combinatorial combinations, not all of which make sense. For example:

  • we can build a graph using Euclidean distance and search it using inner product similarity.
  • we can build a graph using PQ and search it using BQ.

Although good for quick exploration, this has some challenging consequences:

  • Inconsistent combinations may lead to undefined behavior.
  • Too much burden on the system integrating JVector. The code becomes complex and requires expert knowledge of JVector.
  • Different systems integrating JVector use it differently, but JVector was not truly designed as a general plug-and-play library for vector search. Its design is tightly driven by the LSMT model.
  • Testing becomes challenging because of the combinatorial number of configurations to test.
  • Benchmarking becomes challenging as results run on one JVector standalone configuration are not predictive of the results of the results obtained by the particular configuration used in the system integrating JVector.

Philosophy

JVector must embrace being a library for the LSMT model. As such, the API should provide the necessary functions to operate within this model. This means covering:

  • Immutable indices that can be searched
  • Mutable indices that support insertions, deletions, and possibly updates. Mutable indices are extensions of immutable ones, so they are searchable.
  • Merging immutable indices in compliance with Multi-Component LSM-trees

For users that want to use JVector but do not need the LSMT, the mutable indices should be sufficient.

Throughout this document, the underlying assumption is that VectorFloat is an exact representation (i.e., FP32 is sufficient).

Overview

This proposal aims to widely change the JVector API and do a deep refactoring in a number of key areas:

  • The indices will have the increased responsibility of handling the vectors
    • for search: a vector should be the input instead of ScoreProvider
    • for construction: a vector should be inserted directly, without going through a ScoreProvider
  • There will be a new way for merging existing indices
  • Indices should persist themselves instead of going through external writers

Indices

The indices have to own the primary and (optionally) secondary vectors. The primary will be used to carry the graph search, and the secondary vectors will be used to perform a final refinement of the results. the secondary vector representations are thus meant to have higher-precision than the primary ones, but do not necessarily need to be full precision (i.e., they can be quantized).


When inserting a new vector in a mutable index, a VectorFloat and an ordinal will be provided. The index will be responsible for:

  • storing (a potentially compressed version of) the provided vector as a primary vector
  • optionally, storing the provided vector as a secondary vector (either as is or compressed using a highly-accurate quantization)
  • modifying the graph so that the vector can be searched in subsequent searches.

This new behavior adds consistency between the different types of indices that are currently in JVector, fixing a few discrepancies between them:

  • The OnHeapGraphIndex does not own the primary nor the secondary vectors.
  • The OnDiskGraphIndex owns the secondary vectors but does not own the primary vectors if the graph is not fused.
  • The OnDiskGraphIndex owns the secondary vectors and owns the primary vectors if the graph if fused.

Having consistent ownership throughout the different indices will reduce the amount of special casing needed in the integrating systems.


The index also needs to own the the necessary data structures (e.g., SearchScoreProvider described later on in this document) needed to perform the search.
When searching an index, a query VectorFloat should be provided. The index will use its SearchScoreProvider in conjunction with its primary and secondary vectors to perform the search. As described

Mutable indices

Currently, JVector supports one type of mutable index, the OnHeapGraphIndex. As an additional and optional part of the proposal, we could add two new implementations as follows:

Index Mutable Graph Primary vectors Secondary vector
OnHeapGraphIndex yes in memory in memory in memory
HybridGraphIndex yes in memory on disk in memory
OnDiskGraphIndex yes on disk on disk on disk

The last two options are new. These are important as they would allow to disposition a vector from memory upon insertion without having to manually handle the insertInline feature of the graph writer. This greatly simplifies the graph construction calls in the integrating systems.

The remaining combinations for the locations of the graph, the primary vectors, and the secondary vectors are not part of this proposal but could be added in future versions if needed.

Vector representations

Right now, the library supports a mix of VectorFloat, and different instances of compressed/quantized vectors through the SearchScoreProvider that is meant to abstract away the differences between the underlying types.

The first element of this proposal is a new interface that clarifies the intent of a vector representation.

public interface VectorRepresentation extends Accountable {
    /**
     * @return true if the representation is exact, i.e., a full-precision (float32, FP32) vector 
     */
    boolean isExact();

    /**
     * @return this if the representation is exact, else a decoded version of the representation
     */
    VectorFloat<?> decode();

    interface Exact extends VectorRepresentation {
        default boolean isExact() {
            return true;
        }
    }

    interface Approximate extends VectorRepresentation {
        default boolean isExact() {
            return false;
        }
    }
}

Moving forward, all vector "types" should inherit from this interface. Having this common type enables a more clear definition of the ScoreFunction. Instead of instantiating a new object ScoreFunction each time we do a search, the proposed interface has a fixFirstArgument method that sets the query for a given search. Thus, the GraphSearcher can maintain and reuse the same ScoreFunction object across different searches.

public interface ScoreFunction<Vec extends VectorRepresentation> {
    /**
     * @return true if the ScoreFunction returns exact, full-resolution scores. That is, if the underlying VectorRepresentation is exact.
     */
    boolean isExact();


    void fixFirstArgument(Vec first);

    /**
     * @return the similarity to one other node. This method is stateful and requires calling fixFirstArgument.
     */
    float similarityTo(Vec other);

    /**
     * @return the similarity between vec1 and vec2. This method is stateless.
     */
    float similarityTo(Vec vec1, Vec vec2);

    interface Exact extends ScoreFunction<VectorRepresentation.Exact> {
        default boolean isExact() {
            return true;
        }
    }

    interface Approximate extends ScoreFunction<VectorRepresentation.Approximate> {
        default boolean isExact() {
            return false;
        }
    }
}

This definition includes the new method float similarityTo(Vec first, Vec other) that is meant to replace the diversityProvider and by passes the stateful query-aware similarity computations and treats both arguments as symmetric. This greatly simplifies the class structure during search but particularly, during graph construction.

The next key element of this proposal is a revamped SearchScoreProvider. This new version has explicit primary and secondary vector representations (normally used for the graph search and a final refinement, respectively) as templates. This adheres more closely to the use from our indices.

public interface SearchScoreProvider<Primary extends VectorRepresentation, Secondary extends VectorRepresentation> {

    ScoreFunction<Primary> primaryScoreFunction();

    ScoreFunction<Secondary> secondaryScoreFunction();

    /**
     * Convenience function to avoid instantiating a ScoreFunction<Primary>
     * @return true if the primary representations are exact
     */
    boolean isPrimaryExact();

    /**
     * Convenience function to avoid instantiating a ScoreFunction<Secondary>
     * @return true if the secondary representations are exact
     */
    boolean isSecondaryExact();
}

As mentioned above, the index should own this SearchScoreProvider. That said, we should provide mechanisms to override the combination of primary/secondary vectors for rapid prototyping, but this can be a 'behind the scene' add-on instead of a full feature of the API.

Merging indices

In the LSMT model, we need to progressively merge indices. Right now, JVector offers no capabilities to perform these operations. Right now, this is handled by creating an OnHeapGraphIndex (the only mutable index currently in JVector) and incrementally building this new index by adding the vectors from all input indices.

This approach has two main problems:

  • Building a graph incrementally means that we need to add backward edges if we want it to be navigable. This process needs to be properly synchronized for concurrent insertions. It also requires doing two rounds of pruning for each node (done by the cleanup routine in JVector).
  • The spirit of the LSMT is that we can very efficiently merge lexical indices. The current setup just starts from scratch, not taking advantage of the work already done to build the set of indices to be merged.

A better approach is to proceed as follows. For each node in any of the input indices, in parallel:

  1. search all the input indices to get its ANNs.
  2. prune the combined ANNs
  3. write the pruned neighbors to disk.

These operations can be fully done in parallel because using the input indices overrides the problem back edges due to the insertion order. It requires no synchronization and only one round of pruning. Additionally, we take advantage of the input indices. the latency of step 1 can be further reduced by using a multi-index search approach.

A candidate implementation is dicussed in #580.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions