Skip to content

Unable to use a custom synthesizer with a timeout on a Colab Notebook #368

@npatki

Description

@npatki

Environment Details

  • SDGym version: 0.9.1 (latest)
  • Operating System: Linux, environment is a Colab Notebook

Error Description

In a Colab Notebook environment, I am unable to get results for custom synthesizers if I supply a timeout value. The synthesizer shows up in the results DataFrame, but all the associated values for it are NaN (even the ones for dataset size, initialization, etc.). All of the other, pre-defined synthesizers produce values.

This problem goes away if I remove the time out value, or run the script on my local machine instead. So it is the combination of following that is causing the issue:

  • (a) running on a Colab notebook (or likely any interactive environment), and
  • (b) adding a custom synthesizer to the benchmark, and
  • (c) adding a timeout

Steps to reproduce

The code below creates a custom synthesizer that is just a variant of GaussianCopula (setting marginals to uniform). Then it tries to run the benchmark for it.

import sdgym

from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdgym import create_single_table_synthesizer

def get_trained_synthesizer(data, metadata):
  metadata_obj = Metadata.load_from_dict(metadata)
  synthesizer = GaussianCopulaSynthesizer(metadata_obj, default_distribution='uniform')
  synthesizer.fit(data)
  return synthesizer

def sample_from_synthesizer(synthesizer, n_rows):
    return synthesizer.sample(n_rows)

GCUniformSynthesizer = create_single_table_synthesizer(
    get_trained_synthesizer_fn=get_trained_synthesizer,
    sample_from_synthesizer_fn=sample_from_synthesizer,
    display_name='GCUniform'
)

results = sdgym.benchmark_single_table(
    synthesizers=['GaussianCopulaSynthesizer'],
    custom_synthesizers=[GCUniformSynthesizer],
    sdv_datasets=['KRK_v1'],
    limit_dataset_size=False,
    timeout=20*60, # 20 min
    output_filepath='results.csv',
    detailed_results_folder='/content/results',
    sdmetrics=[]
)

This script works as expected on a terminal. But if I run it on a Colab Notebook, I see NaN values produced:
image

Additional Context

According to @frances-h: We ran into a similar issue when working on this PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions