Skip to content

Conversation

@skyeeiskowitz
Copy link
Contributor

@skyeeiskowitz skyeeiskowitz commented Jan 26, 2021

Data Input

  • removed functions ingest_data and egest_data and replaced them with simpler preprocessing primitives pyteller.primitives.preprocessing.format_data.json and pyteller/primitives/jsons/pyteller.primitives.postprocessing.reformat_data.json

Pipeline Outputs

  • Added default outputs block to pipelines with outputs as forecast and actual
  • fit and forecast outputs are dictionaries with the key being the output name specified in the pipeline json_
  • Added a tutorial folder with a python notebooks that steps through a pipe line and one that tunes a pipeline

Benchmarking

  • Made a new benchmarking that can run on dask
  • The function including detailed output and summary output for 3 pipelines and 11 signals

Style

  • Cleaned up code style
  • added logging instead of print statements
  • fixed plotting function in utils.py
  • updated documentation to sphinx theme

@skyeeiskowitz skyeeiskowitz changed the title Fixes to API, primitives to format data, optional 'visualization' outputs, and outputs from fit method Fixes to API, added primitives to format data, benchmarking function, tuning a pipeline Aug 3, 2021
@dyuliu dyuliu assigned sarahmish and unassigned sarahmish Sep 23, 2021
@dyuliu dyuliu requested review from dyuliu and removed request for pvk-developer September 23, 2021 13:27
@sarahmish sarahmish requested review from dyuliu and sarahmish and removed request for dyuliu and sarahmish September 23, 2021 13:27
@dyuliu dyuliu requested review from sarahmish and removed request for dyuliu September 23, 2021 14:09
in the subdirectories inside the [pyteller/pipelines](orion/pipelines) folder.

This is the list of pipelines available so far, which will grow over time:
# Quickstart
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pyteller still support multiple input options? If yes, perhaps create a separate file (e.g., data_format.md) to introduce different input options and give some examples?

## Releases

In every release, we run the pyteller benchmark and maintain an up to-date [leaderboard](../README.md#leaderboard).
In every release, we run the pyteller benchmark and maintain an up to-date [leaderboard](leaderboard.md) which can also be found in this [summary Google Sheets document](https://docs.google.com/spreadsheets/d/1OPwAslqfpWvzpUgiGoeEq-Wk_yK-GYPGpmS7TwEaSbw/edit?usp=sharing).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worthwhile to report some aggregated numbers such as mean runtime, failed times, average performance scores, in the summary file. We can refer to ORION's summary report (link).

In [5]: scores = benchmark(pipelines=pipelines, datasets=datasets, metrics=metrics, rank='MAPE')
datasets = ['a10', 'gasoline', 'calls']

results = benchmark(pipelines=pipelines, metrics=metrics, output_path=output_path, datasets=datasets, metrics=metrics, rank='MASE')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark will actually create many tasks to run. Do we save the meta information of these tasks somewhere? (ideas from Cardea and LM project)

@@ -0,0 +1,8 @@
The current pyteller benchmark was run on 11 signals for 3 pipelines
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more convenient to merge leaderboard.md to README.md either in the root directory or the current benchmark directory.

@@ -0,0 +1,34 @@
dataset,pipeline,signal,prediction length,iteration,MAE,MSE,RMSE,MASE,sMAPE,MAPE,under,over,strategy,lstm_1_units,dropout_1_rate,lstm_2_units,dropout_2_rate,window_size,p,d,q,status,elapsed,run_id
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use the release number to name the file?
For example,
0.1.0_summary.csv and 0.1.0_details.csv

@@ -0,0 +1,6 @@
pyteller.metrics.over\_pred
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files generated automatically? Should we include it in a PR?

===========

The User Guides section covers different topics about Orion usage:
The User Guides section covers different topics about python usage:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyteller?

Copy link
Collaborator

@sarahmish sarahmish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! A lot of changes are proposed in this PR, there are some comments that I would recommend addressing before merging. Most of my comments are of format / style nature

General thoughts I have about pyteller:

  • rethink where data should be stored.
  • how do we store pipelines? according to the PR we have the following path pyteller/pipelines/pyteller/, why not keep it simpler? pyteller/pipelines/
  • edit the .gitignore to ignore all automatically generated files under api/.


Time series forecasting using MLPrimitives


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the following as well to be clear about where we are in the project

- License: [MIT](https://github.com/signals-dev/pyteller/blob/master/LICENSE)
- Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)

Comment on lines +4 to +8
| pipeline | Percentage of Times Beat ARIMA |
| -------------------------------- | ------------------------------ |
| pyteller.ARIMA.arima | 0 |
| pyteller.LSTM.LSTM | 36.3636364 |
| pyteller.persistence.persistence | 18.1818182 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting table for a leaderboard. What does the percentage mean in this case? I find it hard to interpret. In Orion we aggregate the score over all signals per dataset and compute in how many datasets did the pipeline outperform ARIMA. I gather you are using % here because you are computing it per signal?

@@ -0,0 +1,6 @@
pyteller.Pyteller.evaluate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not include any file from api/*, to avoid this from happening add the following to .gitignore

# Sphinx documentation
docs/_build/
docs/**/api

1. Load the data
----------------

Here is an example of loading the **Alabama Weather** demo data which has multiple entities in long form:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in data.rst we do not introduce the difference between flatform and longform. I would just omit long form or add a section in data.rst to explain the supported data formats.

from pyteller.core import Pyteller
pipeline = 'pyteller.LSTM.LSTM'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can simplify this to pipeline = 'LSTM' and generally to all other places as well

@@ -0,0 +1,46 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is a duplicate I think?

@@ -0,0 +1,61 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we supposed to have 2 files of each?

Comment on lines +50 to +59
# convert index to datetime
# if index.dtype == 'float' or index.dtype == 'int':
# index = pd.to_datetime(index.values * 1e9)
# else:
# index = pd.to_datetime(index)
#
# if actuals[time_column].dtypes == 'float' or actuals[time_column].dtypes == 'int':
# actuals[time_column] = pd.to_datetime(actuals[time_column] * 1e9)
# else:
# actuals[time_column] = pd.to_datetime(actuals[time_column])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean



def plot(dfs, output_path, labels=None):
def plot_forecast(dfs, output_path=None, labels=['actuals', 'predicted'], frequency=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we typically try to avoid mutable objects as the default arguments of a function, I recommend setting it to None and then adding the default if it is None.

def plot_forecast(dfs, output_path=None, labels=None, frequency=None):
    labels = labels or ['actuals', 'predicted']

'Keras>=2.1.6,<2.4',
'mlblocks>=0.4.0,<0.5',
'mlprimitives>=0.3.0,<0.4',
'pandas>=1,<2',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'pandas>=1,<2' is already specified in mlprimitives so we can skip it here. Similarly to 'scikit-learn>=0.21'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants