Skip to content

get_missing_summary() helper method for OpenML datasets #1443

@Faisalhakimi22

Description

@Faisalhakimi22

Hi OpenML team,

I’d like to propose a small usability enhancement to improve the dataset exploration workflow in openml-python.

Feature Request

helper method:

dataset.get_missing_summary()

  • that returns a simple summary of missing values for the dataset.

Motivation

Many users load a dataset and immediately need to check:

  • how many missing values exist in total
  • which columns contain missing values
  • basic column-level counts

Currently, users must manually compute this after calling dataset.get_data().
A built-in helper would reduce repetitive code and improve the dataset exploration experience, especially for new users.

Proposed Behavior

dataset = openml.datasets.get_dataset(dataset_id)
df, *_ = dataset.get_data()

dataset.get_missing_summary()

Example output:

{
    "n_missing_total": 235,
    "missing_per_column": {
        "age": 10,
        "income": 20,
        "zipcode": 205
    }
}

Implementation Idea

  • Implement this as a method inside the OpenMLDataset class.
  • Internally, the method would:
  1. Call .get_data()
  2. Compute missing summary using pandas (df.isna().sum())
  3. Return a dictionary with overall and per-column counts

No changes needed to the core API; this is an isolated helper.

Benefits

Improves ease of use

  • No backward compatibility impact
  • Lightweight feature (easy to maintain)
  • Helps users performing initial dataset checks

Happy to open a PR implementing this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions