Skip to content

Conversation

@IanHoang
Copy link
Collaborator

@IanHoang IanHoang commented Nov 5, 2025

Description

This PR adds support for generating synthetic dense vectors and sparse vectors in OpenSearch Benchmark.

The following has been added:

  • Two new generators and Pydantic models have been modified to support generating knn_vectors (dense vectors in OpenSearch) and sparse vectors.
  • Pydantic models has had small enhancements to make it easier to add support for future mapping field types

Will add documentation in a separate PR that shows users how to use this from basic to advanced and how to verify that the data is distributed accurately.

Issues Resolved

#981

Testing

  • Generated test documents with knn_vectors and sparse vectors
  • Generated an entire corpora with both
  • Ran scripts to confirm that distributions created are accurate

Example dense vectors generated

  "embedding_small": [
    0.5774467022710263,
    -0.8122808264515302,
    -0.9433050469559874
  ],
  "embedding_medium": [
    -0.09282607742135292,
    0.11205520855155038,
    -0.015302718833961362,
    0.13413846806568544,
    -0.10256925456890523,
    0.05507756403574536,
    -0.016291547532004385,
    0.10501385023442619,
    -0.07995607642893765,
    0.050937655681775315,
    -0.10521870350426658,
    0.09023463946852839,
    -0.042059970512528126,
    0.14183890032503618,
    -0.10401004224261967,
    0.06885905240898517,
    -0.04703317322533611,
    0.13484244839871967,
    -0.09587562605753583,
    0.05053255364747053,
    -0.0655384534526021,
    0.11210001623206463,
    -0.046876997474924166,
    0.137374785402781,
    -0.06880274649931827,
    0.07124468674743567,
    -0.07271733849171332
]

Example sparse vectors

  "sparse_embedding_default": {
    "1000": 0.7519,
    "1100": 0.4866,
    "1200": 0.3432,
    "1300": 0.4616,
    "1400": 0.1253,
    "1500": 0.361,
    "1600": 0.421,
    "1700": 0.028,
    "1800": 0.1804,
    "1900": 0.2676
  },

Example config

generator_overrides
    knn_vector:
      dimension: 128 
    sparse_vector:
      num_tokens: 10        
      min_weight: 0.01     
      max_weight: 1.0     
      token_id_start: 1000 
      token_id_step: 100    

field_overrides:
    # Small vector (3D) - Simple random generation
    embedding_small:
      generator: generate_knn_vector
      params:
        dimension: 3 

    # Medium vector (128D) - Sample-based with Gaussian noise
    embedding_medium:
      generator: generate_knn_vector
      params:
        dimension: 128
        # Provide sample base vectors to add noise to
        sample_vectors:
          - [0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3, 0.1, -0.15, 0.2, 0.35, -0.4, 0.1, 0.2, 0.15, -0.3, 0.5, -0.1, 0.25, 0.4, -0.2, 0.3]
          - [-0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35, -0.25, 0.15, -0.2, 0.3, -0.1, 0.4, -0.3, 0.2, -0.15, 0.35]
          - [0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2, -0.1, 0.35, 0.5, -0.4, 0.3, -0.2, 0.1, 0.4, -0.3, 0.2]
        noise_factor: 0.05      # standard deviation for gaussian noise
        distribution_type: gaussian  # can be guassian or uniform
        normalize: true         # Normalize after adding noise (useful for cosine similarity datasets)


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant