-
Notifications
You must be signed in to change notification settings - Fork 91
vectors - design doc draft #9317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Amit Prinz Setter <[email protected]>
e258949 to
9a8c413
Compare
…tion details options Signed-off-by: Amit Prinz Setter <[email protected]>
Signed-off-by: Amit Prinz Setter <[email protected]>
9a8c413 to
de76727
Compare
| 4. Hopefully will be able to reuse current bucket/object implementation with some adaptation. | ||
| 1. Eg add "bucket content" field to indicate whether bucket is object or vector instead of creating a new “vector bucket” entity. | ||
|
|
||
| ### Terminology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add also the noobaa terminology here for a reader that is less familiar with it -
NB vs FS bucketspace
NB vs FS namespace
refer to https://github.com/noobaa/noobaa-core/blob/master/docs/bucket-types.md
docs/design/vectors.md
Outdated
| ### CRDs | ||
| We need to specify, at least in Lance client case, two kinds of independent parameters: | ||
| -A storage connection - Can be an S3 account (endpoint url, secret id, secret key), or a FS based. | ||
| This can be the alread-existing CRDs Backingstore and NamespaceStore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be based on a BucketClass, and not on the underlying BackingStore/NSStore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC BucketClass has a bunch of stuff relevant to objects but not for the vector use case, eg tiers and mirroring?
I just need a directory mounted on the endpoints, or an s3 connection credentials.
docs/design/vectors.md
Outdated
| We need to specify, at least in Lance client case, two kinds of independent parameters: | ||
| -A storage connection - Can be an S3 account (endpoint url, secret id, secret key), or a FS based. | ||
| This can be the alread-existing CRDs Backingstore and NamespaceStore. | ||
| For FS we can also use a pvc directly, without NamespaceStore? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is eventually what has to happen, because the DB itself will need to be initialized with the configuration of the storage which is either a FS path or an S3 connection.
Signed-off-by: Amit Prinz Setter <[email protected]>
Signed-off-by: Amit Prinz Setter <[email protected]>
… add custom header options, fix typos Signed-off-by: Amit Prinz Setter <[email protected]>
| A2[Inference/Agent] | ||
| B[S3 Vector Endpoint] | ||
| B1[BucketSpace] | ||
| C{{Vector Plugin System}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| C{{Vector Plugin System}} | |
| C{{VectorSDK}} |
| E2-.->G | ||
| ``` | ||
|
|
||
| ### Vector Bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Vector Bucket | |
| ### Operations on Vector Buckets |
| NB shall add "deleted" field to the relevant db row. | ||
| FS shall delete the relevant json file. | ||
|
|
||
| For NB, actual deletion will be implemeted in standard "DB Cleaner" BG worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirically we can check how aws responds to deleting non-empty index. The doc might not be complete.
|
|
||
| For NB, actual deletion will be implemeted in standard "DB Cleaner" BG worker. | ||
|
|
||
| ### Vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Vector | |
| ### Operations on Vectors |
| -translate keys into Lance "ids in" filter. | ||
| -execute delete | ||
|
|
||
| ### Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Index | |
| ### Operations on Vector Indexes |
docs/design/vectors.md
Outdated
| In Lance case, the technical difficulty here is to translate find a place for the Lance parameters in the AWS request body type. | ||
| Since | ||
| -Lance index and S3 vectors index are essentially different AND | ||
| -S3 vectors' CreateIndex request type is restricted, | ||
| we can utilize the generic Tags field to put all data. | ||
|
|
||
| https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_CreateIndex.html | ||
| https://lancedb.github.io/lancedb/js/interfaces/IndexOptions/ | ||
|
|
||
| Other than that, Index API is a simple CRUD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we should add configs to pick the index type (HNSW / IVF) and configs to other default parameters in addition to the ones sent by S3 API (dataType, dimensions, distanceMetric).
Can we call createIndex eachtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I found on when to use LanceDB without an Index -
Brute-force (no-index) search is suitable in specific scenarios:
- Small datasets: For datasets with up to around 100,000 records, a vector index is usually not necessary as the latency is acceptable (often under 100ms).
- Guaranteed 100% recall: When you need to find the exact nearest neighbors without any approximation.
- Immediate searchability of new data: When new data is added to a table, it is immediately searchable via brute-force even before a background indexing job has run.
I think these are not our goals in S3 Vectors, so by default I would define an index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 CreateVectorIndex should be translated to Lance createTable + createIndex.
| G[S3 Backend] | ||
|
|
||
| A2-->|Query|B | ||
| A1-->|Ingest|B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A1-->|Ingest|B | |
| A1-->|Put|B |
| 3. Use account-level default (similar to account's default resource). | ||
| This enables account-level granularity control (not per-bucket as needed). | ||
| Can be combine as a default fall-back with above "Pure s3-compatible" option. | ||
|
|
||
| 4. New actions in operator cli (similar to OB in ODF, manage_nsfs in NSFS). Allows control on parameter names and values. Eg | ||
| nb vector-bucket create --vector-storage=NS1 --vector-storage-path='/vectors1' --vector-plugin Lance | ||
| We can add a new "VectorBucket" CRD or add fields to existing bucket's CRD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two should be the first options to implement.
| #### Query | ||
| https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_QueryVectors.html | ||
|
|
||
| In Lance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing "filter" "returnMetadata"
docs/design/vectors.md
Outdated
| In Lance case, the technical difficulty here is to translate find a place for the Lance parameters in the AWS request body type. | ||
| Since | ||
| -Lance index and S3 vectors index are essentially different AND | ||
| -S3 vectors' CreateIndex request type is restricted, | ||
| we can utilize the generic Tags field to put all data. | ||
|
|
||
| https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_CreateIndex.html | ||
| https://lancedb.github.io/lancedb/js/interfaces/IndexOptions/ | ||
|
|
||
| Other than that, Index API is a simple CRUD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 CreateVectorIndex should be translated to Lance createTable + createIndex.
| https://aws.amazon.com/s3/features/vectors/ | ||
| https://docs.aws.amazon.com/AmazonS3/latest/API/API_Operations_Amazon_S3_Vectors.html | ||
|
|
||
| #### LanceDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to mention the gap that lance metadata is columns with schema, whereas s3vectors metadata is a json (free schema), and we need to decide which solution to use to map between those.
| VectorPlugin is abstract, with a concrete implementation per vector backend (Lance, Davinci, etc). | ||
| VectorPlugin translates parameters to vector backend api and calls appropriate apis on vector backend client. | ||
|
|
||
| ## Vector storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Vector storage | |
| ## Vector bucket configurations |
Signed-off-by: Amit Prinz Setter <[email protected]>
No description provided.