Task based refactor #736

markharley · 2022-09-19T17:05:17Z

Why are these changes needed?

Refactor to facilitate improvement of time-series fitting.
Delegate task specific logic to a Task sub-class, simplifying the AutoML class.
Introduce a TimeSeriesDataset class for encapsulation of time-series relevant data operations.
Introduce a TimeSeriesEstimator parent class to standardise time-series models.

Related issue number

N/A

Checks

I've used pre-commit to lint the changes in this PR, or I've made sure lint with flake8 output is two 0s.
I've included any doc changes needed for https://microsoft.github.io/FLAML/. See https://microsoft.github.io/FLAML/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

This allows us to swap out the generic AutoML logic for the time series specific when the task is in TS_FORECASTREGRESSION while maintaining backwards compatibility with the current flaml.automl public interface.

These will be used to separate task specific logic from the main AutoML entrypoint class

…/FLAML into time-series-extension

It isn't pretty, but it seems to get to the model now

…/FLAML into time-series-extension

…numpy forecast test fails

…t for CV

….py passes except for numpy and prophet

…rophet, apart from test_numpy

sonichi · 2022-09-23T16:14:18Z

website/docs/Use-Cases/Task-Oriented-AutoML.md

@@ -1,4 +1,4 @@
-# Task Oriented AutoML
+# GenericTask Oriented AutoML


Is this and other changes from Task->GenericTask in the documentation a mistake?

Hey! Thanks for the review. Yeah, this and a couple of other instances appear to be a mistaken replacement. Likely Pycharm refactor going over-the-top. I've fixed them up now

flaml/nlp/huggingface/training_args.py

flaml/nlp/huggingface/data_collator.py

flaml/nlp/huggingface/utils.py

flaml/nlp/utils.py

flaml/data.py

flaml/default/estimator.py

liususan091219 · 2022-09-26T14:48:58Z

Was looking at the failed build and noticed that there were some import issues.

Python 3.6 doesn't include data classes, but there is a backport that could be added to the build script (and setup).

Also saw that importing holidays fails. That's not part of the [test] requirements in setup.py (see https://github.com/markharley/FLAML/blob/fafd6524c9466d8ab9c9049bb6102c34cd5819ec/setup.py#L91).

It sounds like new tests were added for time series tasks? Should the requirements in [ts_forecast] and [forecast] also be included in [test] in that case?

The holidays are only used in time series forecasting which is optional. Make sure the import statement is invoked only under the time series environment. See https://github.com/microsoft/FLAML/blob/main/flaml/automl.py#L1096 as an example for nlp (another optional environment).

qingyun-wu

Would it be more reasonable to put the "nlp" and "time_series" folders into the "automl" folder?

Co-authored-by: Xueqing Liu <[email protected]>

markharley · 2022-10-02T22:04:01Z

Was looking at the failed build and noticed that there were some import issues.

Python 3.6 doesn't include data classes, but there is a backport that could be added to the build script (and setup).

Also saw that importing holidays fails. That's not part of the [test] requirements in setup.py (see https://github.com/markharley/FLAML/blob/fafd6524c9466d8ab9c9049bb6102c34cd5819ec/setup.py#L91).

It sounds like new tests were added for time series tasks? Should the requirements in [ts_forecast] and [forecast] also be included in [test] in that case?

Hey! Thanks for the review @gmdiana-hershey. I've added dataclasses as a Python version dependent install requirement, thanks for spotting it! Holidays turned out not to be used outside of the tests at present so I've removed the unnecessary test in test_forecast

markharley · 2022-10-02T22:06:07Z

Was looking at the failed build and noticed that there were some import issues.

Python 3.6 doesn't include data classes, but there is a backport that could be added to the build script (and setup).

Also saw that importing holidays fails. That's not part of the [test] requirements in setup.py (see https://github.com/markharley/FLAML/blob/fafd6524c9466d8ab9c9049bb6102c34cd5819ec/setup.py#L91).

It sounds like new tests were added for time series tasks? Should the requirements in [ts_forecast] and [forecast] also be included in [test] in that case?

The holidays are only used in time series forecasting which is optional. Make sure the import statement is invoked only under the time series environment. See https://github.com/microsoft/FLAML/blob/main/flaml/automl.py#L1096 as an example for nlp (another optional environment).

Hey! Thanks for the review @liususan091219. I've applied your relative import suggestions. On holidays, it turns out that it wasn't called outside of tests at present and so I've removed the offending test and redundant code paths

markharley · 2022-10-02T22:15:55Z

Would it be more reasonable to put the "nlp" and "time_series" folders into the "automl" folder?

Hey! I think we ended up with this layout to avoid some circular imports, but I'll have another try at refactoring these into the automl subpackage 😄

sonichi · 2022-10-02T23:53:44Z

Some checks failed. I wonder if it will be easier to break the PR down to smaller PRs. For example, the first PR to make is to just create the automl subpackage and move files to corresponding locations. Otherwise this PR would take a long time to merge and you will get conflicts often.

qingyun-wu · 2022-10-03T01:25:31Z

Thank you @markharley! We indeed need to re-organize the whole structure of the flaml folder considering the need of adding an automl folder. I am attaching a proposal for the new structure (considering all content in flaml, not just the changes involved in this PR).

In this PR, perhaps you can just make .py files about automl and the time_series folder in the right place. We can come up with a plan with the other changes (and perhaps also discuss this proposed structure plan in the maintainer meeting on 10/10).

automl
- nlp
- time_series
  automl.py
  data.py
  ml.py
  model.py
  train_log.py
  config.py
  [and other files need to be added in the automl folder]
tune
- searcher
- scheduler
  [and existing files in the tune foler]
onlineml
default

sonichi · 2022-10-17T21:59:27Z

flaml/automl/task.py

+    def __init__(
+        self,
+        task_name: str,
+        X_train: Union[np.ndarray, pd.DataFrame],
+        y_train: Union[np.ndarray, pd.DataFrame, pd.Series],
+    ):


The constructor here indicates that a Task object needs to know about X_train and y_train when it's constructed.
The implementation indicates that no reference to the dataset is stored inside the Task object.
Why? What's the relation between a Task object and a dataset exactly?

for example, at this stage we could infer it it's a binary or multi-category classification, and whether it's a univariate or panel regression, so the user wouldn't have the hassle of specifying that

qingyun-wu · 2023-03-01T16:24:14Z

flaml/automl/task.py

+
+    def __init__(
+        self,
+        task_name: str,


Needs documentation on the allowed task_name name?

qingyun-wu · 2023-03-01T16:27:47Z

setup.py

    "scipy>=1.4.1",
    "pandas>=1.1.4",
    "scikit-learn>=0.24",
+    "dataclasses>=0.8 ; python_version=='3.6'",


Suggested change

"dataclasses>=0.8 ; python_version=='3.6'",

"dataclasses>=0.8 ; python_version>='3.6'",

markharley and others added 30 commits August 11, 2022 09:41

Add AutoML factory

81cf07b

This allows us to swap out the generic AutoML logic for the time series specific when the task is in TS_FORECASTREGRESSION while maintaining backwards compatibility with the current flaml.automl public interface.

WIP

1f3f816

Factor out default estimator selection to flaml.data

049861b

Add tasks entity

1d08007

These will be used to separate task specific logic from the main AutoML entrypoint class

Merge branch 'time-series-extension' of https://github.com/markharley…

165e28e

…/FLAML into time-series-extension

Move TS specific data prep to AutoMLTS

3bf7b7c

Add time series data model

6015088

Pre-pull commit

a42e9dd

Merge branch 'time-series-extension' of https://github.com/markharley…

1c921d8

…/FLAML into time-series-extension

Another step towards refactor - test_forecast.py passes

4c941cd

Prepare TS models for receiving TimeSeriesDataset's

a2f262d

bugfix

c1f1b20

Add time series data object

0d4d6b2

It isn't pretty, but it seems to get to the model now

Merge branch 'time-series-extension' of https://github.com/markharley…

36877c5

…/FLAML into time-series-extension

All forecast tests pass except test_numpy; no Prophet

f86396e

Orbit is integrated, with degenerate search space. Prophet untested, …

5871c32

…numpy forecast test fails

All ts tests except numpy pass, no prophet

4886a66

Nicer variant of the Task class, still no prophet

366c038

Factored out more stuff to the Task class; test_forecast passes excep…

e074599

…t for CV

Further factoring out to use the task class throughout; test_forecast…

ae6a1d4

….py passes except for numpy and prophet

Further test fixes for the task class, test_forecast.py passes with p…

4448f35

…rophet, apart from test_numpy

Multiscale decomposition of a dataframe, with test

eaf7a36

Tweaks to TimeSeriesDataset as part of multiscale progress

ef2d5af

First full cycle of multiscale ARIMA, with tests

95cee71

Fix test_forecast

a66ac9f

Minor fix, now model actually produces reasonable-ish output

2e4aee3

Testing out ARIMA parameters

64e4f25

Tidy up time series-related dir structure

2580f88

Tidy up time series-related dir structure, ts tests pass like before

1751e38

Fix minor bug in DataTransformerTS; 2 tests fail

50114b1

sonichi reviewed Sep 23, 2022

View reviewed changes