Skip to content

SMOGN is creating a new class for target #38

@purp172

Description

@purp172

Hey!
Any idea on why is the algorithm creating a new class (value) for my target? I'm analyzing the Room_Occupancy_Dataset from Kaggle, and in this dataset the target only has four values for occupancy (0, 1, 2, 3 people in the room), but it is expected for the model to be able to predict other cases that have more than 3 people in the room. SMOGN is not balancing the data correctly, because the majority class (0) remains equal, and the minority classes (1,2,3) are not over-sampled. Plus, it creates an extra value (4). I don't know if this is a bug, but i hope you can help me fix it. This is my 2d array:

rg_mtrx = [

    [0, 0, 0],  ## under-sample ("majority")
    [1, 1, 0],  ## over-sample ("minority")
    [2, 1, 0],  ## over-sample ("minority")
    [3, 1, 0],  ## over-sample ("minority")
]

## conduct smogn
balanced_smogn = smogn.smoter(
    
    ## main arguments
    data = df,            ## pandas dataframe
    y = 'Room_Occupancy_Count', ## string ('header name')
    k = 5,                    ## positive integer (k < n)
    pert = 0.02,              ## real number (0 < R < 1)
    samp_method = 'extreme',  ## string ('balance' or 'extreme')
    drop_na_col = False,       ## boolean (True or False)
    drop_na_row = False,       ## boolean (True or False)
    replace = True,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres = 0.50,         ## real number (0 < R < 1)
    rel_method = 'manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg = rg_mtrx ## 2d array (format: [x, y])
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions