-
-
Notifications
You must be signed in to change notification settings - Fork 201
Open
Description
Hi OpenML team,
I’d like to propose a small usability enhancement to improve the dataset exploration workflow in openml-python.
Feature Request
helper method:
dataset.get_missing_summary()
- that returns a simple summary of missing values for the dataset.
Motivation
Many users load a dataset and immediately need to check:
- how many missing values exist in total
- which columns contain missing values
- basic column-level counts
Currently, users must manually compute this after calling dataset.get_data().
A built-in helper would reduce repetitive code and improve the dataset exploration experience, especially for new users.
Proposed Behavior
dataset = openml.datasets.get_dataset(dataset_id)
df, *_ = dataset.get_data()
dataset.get_missing_summary()
Example output:
{
"n_missing_total": 235,
"missing_per_column": {
"age": 10,
"income": 20,
"zipcode": 205
}
}
Implementation Idea
- Implement this as a method inside the
OpenMLDatasetclass. - Internally, the method would:
- Call
.get_data() - Compute missing summary using pandas (
df.isna().sum()) - Return a dictionary with overall and per-column counts
No changes needed to the core API; this is an isolated helper.
Benefits
Improves ease of use
- No backward compatibility impact
- Lightweight feature (easy to maintain)
- Helps users performing initial dataset checks
Happy to open a PR implementing this.
Metadata
Metadata
Assignees
Labels
No labels