Skip to content

Consider Filtering Datasets #115

@Innixma

Description

@Innixma

We should consider which datasets we should use for TabRepo 2.0 compared to TabRepo 1.0.

Some datasets worth considering removing:

  1. OVA_*: 6 datasets, many features, potentially images? Might not be relevant for tabular evaluation. Also very expensive to fit models on.
  2. volcanoes*: 10 datasets, all small and relatively simple? Could keep only a few instead of having so many?
  3. arcene: Many features, few rows, maybe worth removing?
  4. fri_*: 10 datasets that seem to be very similar?
  5. GAMETES_*: 6 datasets that seem to be similar?
  6. car: Often perfectly solvable, winner is the model with the lowest epsilon value.
  7. kc1: synthetic?

Can we come up with an automated mechanism for determining if a dataset should be removed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions