Consider Filtering Datasets

We should consider which datasets we should use for TabRepo 2.0 compared to TabRepo 1.0.

Some datasets worth considering removing:

1. OVA_*: 6 datasets, many features, potentially images? Might not be relevant for tabular evaluation. Also very expensive to fit models on.
2. volcanoes*: 10 datasets, all small and relatively simple? Could keep only a few instead of having so many?
3. arcene: Many features, few rows, maybe worth removing?
4. fri_*: 10 datasets that seem to be very similar?
5. GAMETES_*: 6 datasets that seem to be similar?
6. car: Often perfectly solvable, winner is the model with the lowest epsilon value.
7. kc1: synthetic?

Can we come up with an automated mechanism for determining if a dataset should be removed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider Filtering Datasets #115

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider Filtering Datasets #115

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions