Skip to content

Numerical column considered as regional? #459

@miaoli-04

Description

@miaoli-04

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • CTGAN version: 1.25.0 (sdv)
  • Python version: 3.12.8
  • Operating System:

Error Description

When trying the fit the "automobile" dataset from UCIML, the 'city-mpg' column, which is continuous, seems to be interpreted as a location and a column of strings is generated in the synthetic data. This might have to do with the column name, as if I rename the column as 'mpg', column of correct datatype will be returned.
Image

Steps to reproduce

import pandas as pd
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import Metadata

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
automobile = fetch_ucirepo(id=10) 
  
# data (as pandas dataframes) 
X = automobile.data.features 


metadata = Metadata.detect_from_dataframe(X)

synthesizer = CTGANSynthesizer(metadata)
synthesizer.fit(X)
synthetic_data = synthesizer.sample(num_rows=1000)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingunder discussionIssue is currently being discussed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions