Skip to content

Conversation

@alphaprinz
Copy link
Contributor

No description provided.

@coderabbitai
Copy link

coderabbitai bot commented Nov 26, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Amit Prinz Setter <[email protected]>
4. Hopefully will be able to reuse current bucket/object implementation with some adaptation.
1. Eg add "bucket content" field to indicate whether bucket is object or vector instead of creating a new “vector bucket” entity.

### Terminology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add also the noobaa terminology here for a reader that is less familiar with it -
NB vs FS bucketspace
NB vs FS namespace
refer to https://github.com/noobaa/noobaa-core/blob/master/docs/bucket-types.md

### CRDs
We need to specify, at least in Lance client case, two kinds of independent parameters:
-A storage connection - Can be an S3 account (endpoint url, secret id, secret key), or a FS based.
This can be the alread-existing CRDs Backingstore and NamespaceStore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be based on a BucketClass, and not on the underlying BackingStore/NSStore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC BucketClass has a bunch of stuff relevant to objects but not for the vector use case, eg tiers and mirroring?
I just need a directory mounted on the endpoints, or an s3 connection credentials.

We need to specify, at least in Lance client case, two kinds of independent parameters:
-A storage connection - Can be an S3 account (endpoint url, secret id, secret key), or a FS based.
This can be the alread-existing CRDs Backingstore and NamespaceStore.
For FS we can also use a pvc directly, without NamespaceStore?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is eventually what has to happen, because the DB itself will need to be initialized with the configuration of the storage which is either a FS path or an S3 connection.

A2[Inference/Agent]
B[S3 Vector Endpoint]
B1[BucketSpace]
C{{Vector Plugin System}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
C{{Vector Plugin System}}
C{{VectorSDK}}

E2-.->G
```

### Vector Bucket
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Vector Bucket
### Operations on Vector Buckets

NB shall add "deleted" field to the relevant db row.
FS shall delete the relevant json file.

For NB, actual deletion will be implemeted in standard "DB Cleaner" BG worker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically we can check how aws responds to deleting non-empty index. The doc might not be complete.


For NB, actual deletion will be implemeted in standard "DB Cleaner" BG worker.

### Vector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Vector
### Operations on Vectors

-translate keys into Lance "ids in" filter.
-execute delete

### Index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Index
### Operations on Vector Indexes

Comment on lines 156 to 165
In Lance case, the technical difficulty here is to translate find a place for the Lance parameters in the AWS request body type.
Since
-Lance index and S3 vectors index are essentially different AND
-S3 vectors' CreateIndex request type is restricted,
we can utilize the generic Tags field to put all data.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_CreateIndex.html
https://lancedb.github.io/lancedb/js/interfaces/IndexOptions/

Other than that, Index API is a simple CRUD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we should add configs to pick the index type (HNSW / IVF) and configs to other default parameters in addition to the ones sent by S3 API (dataType, dimensions, distanceMetric).

Can we call createIndex eachtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what I found on when to use LanceDB without an Index -

Brute-force (no-index) search is suitable in specific scenarios:

  1. Small datasets: For datasets with up to around 100,000 records, a vector index is usually not necessary as the latency is acceptable (often under 100ms).
  2. Guaranteed 100% recall: When you need to find the exact nearest neighbors without any approximation.
  3. Immediate searchability of new data: When new data is added to a table, it is immediately searchable via brute-force even before a background indexing job has run.

I think these are not our goals in S3 Vectors, so by default I would define an index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 CreateVectorIndex should be translated to Lance createTable + createIndex.

G[S3 Backend]

A2-->|Query|B
A1-->|Ingest|B
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A1-->|Ingest|B
A1-->|Put|B

Comment on lines +248 to +254
3. Use account-level default (similar to account's default resource).
This enables account-level granularity control (not per-bucket as needed).
Can be combine as a default fall-back with above "Pure s3-compatible" option.

4. New actions in operator cli (similar to OB in ODF, manage_nsfs in NSFS). Allows control on parameter names and values. Eg
nb vector-bucket create --vector-storage=NS1 --vector-storage-path='/vectors1' --vector-plugin Lance
We can add a new "VectorBucket" CRD or add fields to existing bucket's CRD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two should be the first options to implement.

#### Query
https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_QueryVectors.html

In Lance:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "filter" "returnMetadata"

Comment on lines 156 to 165
In Lance case, the technical difficulty here is to translate find a place for the Lance parameters in the AWS request body type.
Since
-Lance index and S3 vectors index are essentially different AND
-S3 vectors' CreateIndex request type is restricted,
we can utilize the generic Tags field to put all data.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3VectorBuckets_CreateIndex.html
https://lancedb.github.io/lancedb/js/interfaces/IndexOptions/

Other than that, Index API is a simple CRUD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 CreateVectorIndex should be translated to Lance createTable + createIndex.

https://aws.amazon.com/s3/features/vectors/
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Operations_Amazon_S3_Vectors.html

#### LanceDB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to mention the gap that lance metadata is columns with schema, whereas s3vectors metadata is a json (free schema), and we need to decide which solution to use to map between those.

VectorPlugin is abstract, with a concrete implementation per vector backend (Lance, Davinci, etc).
VectorPlugin translates parameters to vector backend api and calls appropriate apis on vector backend client.

## Vector storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Vector storage
## Vector bucket configurations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants