generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Currently we use Dataset.map to select the past/future data for each evaluation window
Lines 191 to 213 in b091e06
| past_data = dataset.map( | |
| _select_past, | |
| fn_kwargs=dict( | |
| columns_to_slice=columns_to_slice, | |
| timestamp_column=self.timestamp_column, | |
| cutoff=self.cutoff, | |
| max_context_length=self.max_context_length, | |
| ), | |
| num_proc=min(num_proc, len(dataset)), | |
| desc="Selecting past data", | |
| ) | |
| future_data = dataset.map( | |
| _select_future, | |
| fn_kwargs=dict( | |
| columns_to_slice=columns_to_slice, | |
| timestamp_column=self.timestamp_column, | |
| cutoff=self.cutoff, | |
| horizon=self.horizon, | |
| ), | |
| num_proc=min(num_proc, len(dataset)), | |
| desc="Selecting future data", | |
| ) |
This caches the intermediate results on disk and increases the total runtime of the benchmark due to the inefficiency.
We could instead directly work with the ListArray https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html representing each list-like column, generating the splits more efficiently in memory
Metadata
Metadata
Assignees
Labels
No labels