-
Notifications
You must be signed in to change notification settings - Fork 36
Open
Milestone
Description
We should consider which datasets we should use for TabRepo 2.0 compared to TabRepo 1.0.
Some datasets worth considering removing:
- OVA_*: 6 datasets, many features, potentially images? Might not be relevant for tabular evaluation. Also very expensive to fit models on.
- volcanoes*: 10 datasets, all small and relatively simple? Could keep only a few instead of having so many?
- arcene: Many features, few rows, maybe worth removing?
- fri_*: 10 datasets that seem to be very similar?
- GAMETES_*: 6 datasets that seem to be similar?
- car: Often perfectly solvable, winner is the model with the lowest epsilon value.
- kc1: synthetic?
Can we come up with an automated mechanism for determining if a dataset should be removed?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels