Skip to content

Reconsider balancing data by resampling? #323

@joelostblom

Description

@joelostblom

For the future, maybe we should reconsider the recommendation to rebalance data by duplicating observations in this section https://python.datasciencebook.ca/pull317/classification1.html#balancing. Both this year and last, I have encountered students who find that the optimal K=1 when they do this, and from visualizing the data it is impossible to see that the reason for that is that there is an exact copy of the data point that they are predicting hiding underneath it. Maybe this can be avoided by sampling in a smarter way, but we are introducing it in the first classification chapter where we haven't introduced any evaluation yet, so we might have to move it later as a more advanced topic if we do a smarter resampling (not sure what this would look like or if it is possible).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions