Skip to content

[ENH] loaders for MONSTER datasets #2985

@baraline

Description

@baraline

Describe the feature or idea you want to propose

Include loaders to the MONSTER datasets in the datasets module.

The only downside is that we would have to put huggingface as an optional dependency. I'm not sure if there is other channels to load the datasets from to avoid another dependency?

Describe your proposed solution

Code to load the datasets from huggingface

import numpy as np
from aeon.utils.numba.general import z_normalise_series_3d
from huggingface_hub import hf_hub_download

univariate_monster_datasets = [
    "CornellWhaleChallenge",
    "AudioMNIST",
    "WhaleSounds",
    "Pedestrian",
    "FruitFlies",
    "AudioMNIST-DS",
    "Traffic",
    "LakeIce",
    "MosquitoSound",
    "InsectSound",
]


def load_monster(dataset_name, fold, normalize=True):
    repo_id = f"monster-monash/{dataset_name}"

    # Download data
    data_path = hf_hub_download(
        repo_id=repo_id, filename=f"{dataset_name}_X.npy", repo_type="dataset"
    )
    X = np.load(data_path, mmap_mode="r")  # (#Samples, #Channel, #Length)
    if normalize:
        X = z_normalise_series_3d(X)
    # Download labels
    label_filename = f"{dataset_name}_Y.npy"
    try:
        label_path = hf_hub_download(
            repo_id=repo_id, filename=label_filename, repo_type="dataset"
        )
    except:
        label_filename = f"{dataset_name}_y.npy"
        label_path = hf_hub_download(
            repo_id=repo_id, filename=label_filename, repo_type="dataset"
        )
    y = np.load(label_path)
    # Load test indices
    try:
        test_index_path = hf_hub_download(
            repo_id=repo_id,
            filename=f"test_indices_fold_{fold}.txt",
            repo_type="dataset",
        )
        test_index = np.loadtxt(test_index_path, dtype=int)
    except Exception as e:
        logger.error(f"Failed to load test indices: {e}")
        raise

    test_bool_index = np.zeros(len(y), dtype=bool)
    test_bool_index[test_index] = True
    return (
        X[~test_bool_index],
        y[~test_bool_index],
        X[test_bool_index],
        y[test_bool_index],
    )

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetsDatasets and data loadersenhancementNew feature, improvement request or other non-bug code enhancementgood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions