Speed up train/test splits for `EvaluationWindow`

Currently we use `Dataset.map` to select the past/future data for each evaluation window
https://github.com/autogluon/fev/blob/b091e06fd8eb882eb26f22c8fe78542af689ecd0/src/fev/task.py#L191-L213

This caches the intermediate results on disk and increases the total runtime of the benchmark due to the inefficiency. 

We could instead directly work with the ListArray https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html representing each list-like column, generating the splits more efficiently in memory

	past_data = dataset.map(
	_select_past,
	fn_kwargs=dict(
	columns_to_slice=columns_to_slice,
	timestamp_column=self.timestamp_column,
	cutoff=self.cutoff,
	max_context_length=self.max_context_length,
	),
	num_proc=min(num_proc, len(dataset)),
	desc="Selecting past data",
	)

	future_data = dataset.map(
	_select_future,
	fn_kwargs=dict(
	columns_to_slice=columns_to_slice,
	timestamp_column=self.timestamp_column,
	cutoff=self.cutoff,
	horizon=self.horizon,
	),
	num_proc=min(num_proc, len(dataset)),
	desc="Selecting future data",
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up train/test splits for `EvaluationWindow` #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speed up train/test splits for EvaluationWindow #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Speed up train/test splits for `EvaluationWindow` #69