Skip to content

Conversation

@singhmanas1
Copy link

This script generates billion-scale synthetic data. Following is the procedure-
Dataset Generation-

  1. Generate centroids using k-means over randomly generated 1M vectors.
  2. Create a Gaussion blob for each centroid using varying standard deviation.
    Ground Truth Generation-
  3. Randomly sample query vectors from dataset (from each cluster) -- remove these IDs from the dataset.
  4. Use cuvs brute force to identify 4 closest centroids
  5. Do a brute force search over the 4 clusters to identify top-k ground truth.

User Configurations-

  1. Total vectors
  2. Total vectors per cluster
  3. Vector dim and dtype
  4. Top-k (for ground truth)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@singhmanas1
Copy link
Author

Time to generate 10B vectors with 1 L4 (AWS g6.16xlarge)
Screenshot 2025-12-04 at 3 22 14 AM

@aamijar aamijar added the non-breaking Introduces a non-breaking change label Dec 5, 2025
@aamijar aamijar moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Dec 5, 2025
@aamijar aamijar added the improvement Improves an existing functionality label Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

2 participants