Script to generate billion scale synthetic data #1612

singhmanas1 · 2025-12-03T11:49:54Z

This script generates billion-scale synthetic data. Following is the procedure-
Dataset Generation-

Generate centroids using k-means over randomly generated 1M vectors.
Create a Gaussion blob for each centroid using varying standard deviation.
Ground Truth Generation-
Randomly sample query vectors from dataset (from each cluster) -- remove these IDs from the dataset.
Use cuvs brute force to identify 4 closest centroids
Do a brute force search over the 4 clusters to identify top-k ground truth.

User Configurations-

Total vectors
Total vectors per cluster
Vector dim and dtype
Top-k (for ground truth)

review-notebook-app · 2025-12-03T11:49:58Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

singhmanas1 · 2025-12-04T11:27:37Z

Time to generate 10B vectors with 1 L4 (AWS g6.16xlarge)

Script to generate billion scale synthetic data

3073620

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Dec 3, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Dec 3, 2025

singhmanas1 added 2 commits December 3, 2025 03:55

Modified overall flow and isntructions

4dab50d

Updated print statements for ground truth

1946293

aamijar assigned singhmanas1 Dec 5, 2025

aamijar added the non-breaking Introduces a non-breaking change label Dec 5, 2025

aamijar moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Dec 5, 2025

aamijar added the improvement Improves an existing functionality label Dec 5, 2025

Created Further modules

957b3df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Script to generate billion scale synthetic data #1612

Script to generate billion scale synthetic data #1612

Uh oh!

singhmanas1 commented Dec 3, 2025

Uh oh!

review-notebook-app bot commented Dec 3, 2025

Uh oh!

singhmanas1 commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Script to generate billion scale synthetic data #1612

Are you sure you want to change the base?

Script to generate billion scale synthetic data #1612

Uh oh!

Conversation

singhmanas1 commented Dec 3, 2025

Uh oh!

review-notebook-app bot commented Dec 3, 2025

Uh oh!

singhmanas1 commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants