Skip to content

Speed up train/test splits for EvaluationWindow #69

@shchur

Description

@shchur

Currently we use Dataset.map to select the past/future data for each evaluation window

fev/src/fev/task.py

Lines 191 to 213 in b091e06

past_data = dataset.map(
_select_past,
fn_kwargs=dict(
columns_to_slice=columns_to_slice,
timestamp_column=self.timestamp_column,
cutoff=self.cutoff,
max_context_length=self.max_context_length,
),
num_proc=min(num_proc, len(dataset)),
desc="Selecting past data",
)
future_data = dataset.map(
_select_future,
fn_kwargs=dict(
columns_to_slice=columns_to_slice,
timestamp_column=self.timestamp_column,
cutoff=self.cutoff,
horizon=self.horizon,
),
num_proc=min(num_proc, len(dataset)),
desc="Selecting future data",
)

This caches the intermediate results on disk and increases the total runtime of the benchmark due to the inefficiency.

We could instead directly work with the ListArray https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html representing each list-like column, generating the splits more efficiently in memory

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions