Skip to content

Add benchmark_multi_table_aws #487

@amontanez24

Description

@amontanez24

Problem Description

As a user, I'd like a way to run the multi-table benchmark on an EC2 instance since I do not have the compute power to do so on my machine.
We want to add a multi table version of benchmark_single_table_aws.

Expected behavior

Add new function to the sdgym.benchmark module

def benchmark_multi_table(
    output_destination,
    aws_access_key_id=None,
    aws_secret_access_key=None,
    synthesizers=['HMASynthesizer', 'MultiTableUniformSynthesizer'],
    custom_synthesizers=None,
    sdv_datasets=['NBA', 'financial', 'Student_loan', 'Biodegradability', 'fake_hotels', 'restbase', airbnb-simplified'],
    additional_datasets_folder=None,
    limit_dataset_size=False,
    compute_quality_score=True,
    compute_diagnostic_score=True,
    timeout=None
    show_progress=False
):
    """
    Args:
        output_destination (str):
            An S3 bucket or filepath. The results output folder will be written here.
            Should be structured as:
            s3://{s3_bucket_name}/{path_to_file} or s3://{s3_bucket_name}.
        aws_access_key_id (str): The AWS access key id. Optional
        aws_secret_access_key (str): The AWS secret access key. Optional
        synthesizers (list[string] | sdgym.synthesizer.BaselineSynthesizer): List of synthesizers to use.
        sdv_datasets (list[str] or ``None``):Names of the SDV demo datasets to use for the benchmark. 
        additional_datasets_folder (str or ``None``): The path to an S3 bucket. Datasets found in this folder are
            run in addition to the SDV datasets. If ``None``, no additional datasets are used.
        limit_dataset_size (bool):
            We should still limit the dataset to 10 columns per table (not including primary/foreign keys). 
            But as for the # of rows: The overall dataset needs to be subsampled with referential integrity.
            We should use the [get_random_subset](https://docs.sdv.dev/sdv/multi-table-data/data-preparation/cleaning-your-data#get_random_subset) function to perform the subsample.
            For the main table, select the table with the larges # of rows; and for num rows, set it to 1000.
        compute_quality_score (bool):
            Whether or not to evaluate an overall quality score. In this case we should use the MultiTableQualityReport.
        compute_diagnostic_score (bool):
            Whether or not to evaluate an overall diagnostic score. In this case we should use the MultiTableDiagnosticReport.
        timeout (int or ``None``):
            The maximum number of seconds to wait for synthetic data creation. If ``None``, no
            timeout is enforced.
    """
    

Additional context

  • Once Add benchmark_multi_table function #486 is done, this should be relatively straight forward. You just have to adapt the startup script that we give the EC2 instance to use the benchmark_multi_table function.
  • Consider that we may add support for other cloud services (like GCP). This means we should abstract things in a way that any cloud can be plugged in.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions