[SDG] Add support for dense and sparse vectors in synthetic data generation. #982

IanHoang · 2025-11-05T21:06:31Z

Description

This PR adds support for generating synthetic dense vectors and sparse vectors in OpenSearch Benchmark.

The following has been added:

Two new generators and Pydantic models have been modified to support generating knn_vectors (dense vectors in OpenSearch) and sparse vectors.
Pydantic models has had small enhancements to make it easier to add support for future mapping field types

Will add documentation in a separate PR that shows users how to use this from basic to advanced and how to verify that the data is distributed accurately.

Issues Resolved

#981

Testing

Generated test documents with knn_vectors and sparse vectors
Generated an entire corpora with both
Ran scripts to confirm that distributions created are accurate

Example dense vectors generated

  "embedding_small": [
    0.5774467022710263,
    -0.8122808264515302,
    -0.9433050469559874
  ],
  "embedding_medium": [
    -0.09282607742135292,
    0.11205520855155038,
    -0.015302718833961362,
    0.13413846806568544,
    -0.10256925456890523,
    0.05507756403574536,
    -0.016291547532004385,
    0.10501385023442619,
    -0.07995607642893765,
    0.050937655681775315,
    -0.10521870350426658,
    0.09023463946852839,
    -0.042059970512528126,
    0.14183890032503618,
    -0.10401004224261967,
    0.06885905240898517,
    -0.04703317322533611,
    0.13484244839871967,
    -0.09587562605753583,
    0.05053255364747053,
    -0.0655384534526021,
    0.11210001623206463,
    -0.046876997474924166,
    0.137374785402781,
    -0.06880274649931827,
    0.07124468674743567,
    -0.07271733849171332
]

Example sparse vectors

  "sparse_embedding_default": {
    "1000": 0.7519,
    "1100": 0.4866,
    "1200": 0.3432,
    "1300": 0.4616,
    "1400": 0.1253,
    "1500": 0.361,
    "1600": 0.421,
    "1700": 0.028,
    "1800": 0.1804,
    "1900": 0.2676
  },

Example config

generator_overrides
    knn_vector:
      dimension: 128 
    sparse_vector:
      num_tokens: 10        
      min_weight: 0.01     
      max_weight: 1.0     
      token_id_start: 1000 
      token_id_step: 100    

field_overrides:
    # Small vector (3D) - Simple random generation
    embedding_small:
      generator: generate_knn_vector
      params:
        dimension: 3 

    # Medium vector (128D) - Sample-based with Gaussian noise
    embedding_medium:
      generator: generate_knn_vector
      params:
        dimension: 128
        # Provide sample base vectors to add noise to
        sample_vectors:
          - [0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3]
          - [-0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35]
          - [0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2]
        noise_factor: 0.05      # standard deviation for gaussian noise
        distribution_type: gaussian  # can be guassian or uniform
        normalize: true         # Normalize after adding noise (useful for cosine similarity datasets)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ian Hoang <[email protected]>

Add support for dense and sparse vectors in synthetic data generation.

42ea8ab

Signed-off-by: Ian Hoang <[email protected]>

IanHoang requested review from OVI3D0, VijayanB, beaioun, gkamat and rishabh6788 as code owners November 5, 2025 21:06

IanHoang closed this Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDG] Add support for dense and sparse vectors in synthetic data generation. #982

[SDG] Add support for dense and sparse vectors in synthetic data generation. #982

Uh oh!

IanHoang commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SDG] Add support for dense and sparse vectors in synthetic data generation. #982

[SDG] Add support for dense and sparse vectors in synthetic data generation. #982

Uh oh!

Conversation

IanHoang commented Nov 5, 2025

Description

Issues Resolved

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant