-
Notifications
You must be signed in to change notification settings - Fork 1
Fixes to API, added primitives to format data, benchmarking function, tuning a pipeline #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
d260e21
e07e208
2b121ff
775edc6
1d815a8
5498d11
44c8012
f6229f7
2da2d3a
5936976
b274cd5
2b81b37
7290f9d
abe8655
7a53c9c
97cc081
7b5376c
c8f35fe
c50c398
e2655cd
7ef4d41
a69e3b9
8b9dc88
04d13ad
6eb03f0
07e31bc
c9e03d0
7a2c8b5
010ebd4
88e2d5b
45d40f0
8c611ea
fb543af
33ed163
8221e89
afe78a5
d396a5d
99cb2f5
9b8d7f3
f3ed4f0
b372fe3
b3f29d1
d28240b
71291c8
3484dd2
e93782e
bb12eba
dd7d57c
fa7f34b
c89c6d4
ceda7b3
0cac16c
f6b136f
d9b2a04
b4881c5
cab485a
6a070b8
0a6bcda
303edd9
556bee0
eefcf20
4444451
a674f1a
6103b76
344f43d
2768d38
7eecc36
cb3d810
1a81102
bc4ce6b
5666b8b
80ef877
85a62ce
546ca1f
ab9389f
f595f23
f1ab7f4
d94bf46
18ef527
6aa4ef2
fe7d00b
d475f0b
67416d7
c2a98c5
a607da1
4b48afa
44c3659
f4c852b
91aea63
9453e0b
668bbea
6f2d501
cd9f2ee
ca31165
95ba9ac
0ff9f76
d263854
028bb72
a033c2f
112e13d
6738ab3
86df0c3
6496fdb
1f7b2e4
d37dc34
49fa47a
3bf9ac2
051a70b
caf33a9
a295707
bc54eac
5304f03
725c877
8dafff2
58bc2d8
08d6185
e874f89
aec4c37
40a8476
896987c
a28c017
fcbd25e
d3ce332
7582d3f
3535ba7
06151c2
3fe88e0
8ed4d46
e679803
bfa06d5
3cb5fc8
3246695
7392b06
f377e62
7094ffd
29cc9d1
3b90246
2812f2d
bde9d66
d4a94ff
31a7912
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,6 +3,8 @@ | |
| <i>An open source project from Data to AI Lab at MIT.</i> | ||
| </p> | ||
|
|
||
|
|
||
|
|
||
| <!-- Uncomment these lines after releasing the package to PyPI for version and downloads badges --> | ||
| <!--[](https://pypi.python.org/pypi/pyteller)--> | ||
| <!--[](https://pepy.tech/project/pyteller)--> | ||
|
|
@@ -13,275 +15,138 @@ | |
|
|
||
| # pyteller | ||
|
|
||
| Time series forecasting using MLPrimitives | ||
|
|
||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add the following as well to be clear about where we are in the project |
||
| - Documentation: https://signals-dev.github.io/pyteller | ||
| - Homepage: https://github.com/signals-dev/pyteller | ||
|
|
||
| # Overview | ||
|
|
||
| pyteller is a time series forecasting library built with the end user in mind. | ||
|
|
||
|
|
||
| ## Table of Contents | ||
|
|
||
| * [I. Data Format](#data-format) | ||
| * [I.1 Input](#input) | ||
| * [I.2 Output](#output) | ||
| * [I.3 Datasets in the library](#datasets-in-the-library) | ||
| * [II. pyteller Pipelines](#pyteller-pipelines) | ||
| * [II.1 Current Available Pipelines](#current-available-pipelines) | ||
| * [III. Install](#install) | ||
| * [IV. Quick Start](#quick-start) | ||
|
|
||
|
|
||
| # Data Format | ||
|
|
||
| ## Input | ||
|
|
||
| The expected input to pyteller pipelines is a .csv file with data in one of the following formats: | ||
|
|
||
| ### Targets Table | ||
| #### Option 1: Single Entity (Academic Form) | ||
| The user must specify the following: | ||
| * `timestamp_col`: the **string** denoting which column contains the **pandas timestamp** objects or **python datetime** objects corresponding to the time at which the observation is made | ||
| * `target_signal`: an **integer** or **float** column with the observed target values at the indicated timestamps | ||
|
|
||
| This is an example of such table, where the `timestamp_col` is 'timestamp' and the `target_signal` is 'value' | ||
|
|
||
| | timestamp | value | | ||
| |------------|-----------| | ||
| | 7/1/14 1:00 | 6210 | | ||
| | 7/1/14 1:30 | 4656| | ||
| | 7/1/14 2:00 | 3820 | | ||
| |7/1/14 1:30| 4656| | ||
| |7/1/14 2:00| 3820| | ||
| |7/1/14 2:30| 2873| | ||
| | | ||
|
|
||
| #### Option 2: Multiple Entity (Flat Form) | ||
| The user must specify the following: | ||
| * `timestamp_col`: the **string** denoting which column contains the **pandas timestamp** objects or **python datetime** objects corresponding to the time at which the observation is made | ||
| * `entities`: the **list** denoting the columns the user wants to make forecasts for | ||
|
|
||
|
|
||
| This is an example of such table, where the `timestamp_col` is 'timestamp' and the `entities` can be ['taxi 1','taxi 3'] | ||
|
|
||
| | timestamp | taxi 1 | taxi 2 | taxi 3 | | ||
| |------------|-----------|-----------| -----------| | ||
| | 7/1/14 1:00 | 6210 | 510 | 6230 | | ||
| | 7/1/14 1:30 | 4656| 5666|656| | ||
| | 7/1/14 2:00 | 3820 | 2420 | 3650 | | ||
| |7/1/14 1:30| 4656| 4664| 380 | | ||
| |7/1/14 2:00| 3820| 3520| 320 | | ||
| |7/1/14 2:30| 2873| 1373| 3640 | | ||
|
|
||
|
|
||
| #### Option 3: Multiple Entity (Long Form) | ||
| The user must specify the following: | ||
| * `timestamp_col`: the **string** denoting which column contains the **pandas timestamp** objects or **python datetime** objects corresponding to the time at which the observation is made | ||
| * `entity_col`: the **string** denoting which column contains the entities you will seperately make forecasts for | ||
| * `target_signal`: the **string** denoting which columns contain the observed target value that you want to forecast for | ||
|
|
||
|
|
||
| This is an example of such table, where the `timestamp_col` is 'timestamp', the `entity_col` is 'region', and the `target_signal` is 'demand'. | ||
|
|
||
|
|
||
|
|
||
| | timestamp | region | demand | Temp | Rain | | ||
| |------------|------------|-----------| -----------|-----------| | ||
| 9/27/20 21:20 | DAYTON|1841.6 | 65.78| 0| | ||
| | 9/27/20 21:20 | DEOK|2892.5 |75.92| 0| | ||
| | 9/27/20 21:20| DOM|11276 | 55.29| 0| | ||
| |9/27/20 21:20| DPL|2113.7| 75.02| 0.06| | ||
| | 9/27/20 21:25 | DAYTON|1834.1 | 65.72| 0| | ||
| | 9/27/20 21:25 |DEOK| 2880.2 | 75.92| 0| | ||
| | 9/27/20 21:25| DOM| 11211.7 | 55.54| 0| | ||
| |9/27/20 21:25|DPL| 2086.6| 75.02| 0.06| | ||
|
|
||
|
|
||
| ## Output | ||
|
|
||
| The output of the pyteller Pipelines is another table that contains the timestamp and the forecasting value(s), matching the format of the input targets table. | ||
|
|
||
| ## Datasets in the library | ||
|
|
||
| For development and evaluation of pipelines, we include the following datasets: | ||
| #### NYC taxi data | ||
| * Found on the [nyc website](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), or the processed version maintained by Numenta [here](https://github.com/numenta/NAB/tree/master/data). | ||
| * No modifications were made from the Numenta version | ||
|
|
||
| #### Wind data | ||
| * Found here on [kaggle](https://www.kaggle.com/sohier/30-years-of-european-wind-generation/metadata) | ||
| * After downloading the FasTrak 5-Minute .txt files the .txt files for each day from 1/1/13-1/8/18 were compiled into one .csv file | ||
|
|
||
|
|
||
| #### Weather data | ||
| * Maintained by Iowa State University's [IEM](https://mesonet.agron.iastate.edu/request/download.phtml?network=ILASOS) | ||
| * The downloaded data was from the selected network of 8A0 Albertville and the selected date range was 1/1/16 0:15 - 2/16/16 0:55 | ||
|
|
||
| #### Traffic data | ||
| * Found on [Caltrans PeMS](http://pems.dot.ca.gov/?dnode=Clearinghouse) | ||
| * No modifications were made from the Numenta version | ||
|
|
||
| #### Energy data | ||
| * Found on [kaggle](https://www.kaggle.com/robikscube/hourly-energy-consumption/metadata) | ||
| * No modifications were made after downloading pjm_hourly_est.csv | ||
| We also use PJM electricity demand data found [here](https://dataminer2.pjm.com/feed/inst_load). | ||
|
|
||
|
|
||
| Pyteller is a time series forecasting library using MLPrimitives to build easy to use forecasting pipelines. | ||
|
|
||
| ## Current Available Pipelines | ||
|
|
||
| The pipelines are included as **JSON** files, which can be found | ||
| in the subdirectories inside the [pyteller/pipelines](orion/pipelines) folder. | ||
|
|
||
| This is the list of pipelines available so far, which will grow over time: | ||
| # Quickstart | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pyteller still support multiple input options? If yes, perhaps create a separate file (e.g., |
||
|
|
||
| | name | location | description | | ||
| |------|----------|-------------| | ||
| | Persistence | [pyteller/pipelines/sandbox/persistence](../pipelines/sandbox/persistence) | uses the latest input to the model as the next output | ||
|
|
||
| ## Install with pip | ||
|
|
||
| # Install | ||
|
|
||
| ## Requirements | ||
|
|
||
| **pyteller** has been developed and tested on [Python 3.5, 3.6, 3.7 and 3.8](https://www.python.org/downloads/) | ||
|
|
||
| Also, although it is not strictly required, the usage of a [virtualenv](https://virtualenv.pypa.io/en/latest/) | ||
| is highly recommended in order to avoid interfering with other software installed in the system | ||
| in which **pyteller** is run. | ||
|
|
||
| These are the minimum commands needed to create a virtualenv using python3.6 for **pyteller**: | ||
|
|
||
| ```bash | ||
| pip install virtualenv | ||
| virtualenv -p $(which python3.6) pyteller-venv | ||
| ``` | ||
|
|
||
| Afterwards, you have to execute this command to activate the virtualenv: | ||
|
|
||
| ```bash | ||
| source pyteller-venv/bin/activate | ||
| ``` | ||
|
|
||
| Remember to execute it every time you start a new console to work on **pyteller**! | ||
|
|
||
| <!-- Uncomment this section after releasing the package to PyPI for installation instructions | ||
| ## Install from PyPI | ||
|
|
||
| After creating the virtualenv and activating it, we recommend using | ||
| [pip](https://pip.pypa.io/en/stable/) in order to install **pyteller**: | ||
| The easiest and recommended way to install **pyteller** is using [pip](https://pip.pypa.io/en/stable/): | ||
|
|
||
| ```bash | ||
| pip install pyteller | ||
| ``` | ||
|
|
||
| This will pull and install the latest stable release from [PyPI](https://pypi.org/). | ||
| --> | ||
| This will pull and install the latest stable release from [PyPi](https://pypi.org/). | ||
|
|
||
| ## Install from source | ||
|
|
||
| With your virtualenv activated, you can clone the repository and install it from | ||
| source by running `make install` on the `stable` branch: | ||
|
|
||
| ```bash | ||
| git clone [email protected]:signals-dev/pyteller.git | ||
| cd pyteller | ||
| git checkout stable | ||
| make install | ||
| ``` | ||
| ## 1. Input data | ||
| The expected input to pyteller pipelines is a .csv file with target data. | ||
|
|
||
| ## Install for Development | ||
| Depending on the format of the data, the user should specify the **string** denoting which columns contains the: | ||
|
|
||
| If you want to contribute to the project, a few more steps are required to make the project ready | ||
| for development. | ||
| * `time_column`: Column denoting the timestamp column. | ||
| * `target_column`: Column denoting the target column. | ||
| * `targets`: List of the subset of targets to extract. | ||
| * `entity_column`: Column denoting the entities column. | ||
| * `entities`: Subset of entities to extract. | ||
|
|
||
| Please head to the [Contributing Guide](https://signals-dev.github.io/pyteller/contributing.html#get-started) | ||
| for more details about this process. | ||
|
|
||
| # Quick Start | ||
|
|
||
| In this short tutorial we will guide you through a series of steps that will help you | ||
| getting started with **pyteller**. | ||
|
|
||
| ## 1. Load the data | ||
|
|
||
| In the first step we will load the **Alabama Weather** data into a dataframe from the demo datasets in the `data` folder. This represents all of the data up-to-date that will be used to train the model. | ||
| Here is an example of loading the [Alabama Weather](pyteller/data/AL_Weather.csv) demo data which has multiple entities in long form: | ||
|
|
||
| ```python3 | ||
| from pyteller.data import load_data | ||
| current_data=load_data('../pyteller/data/AL_Weather_current.csv') | ||
| current_data, input_data = load_data('AL_Weather') | ||
| ``` | ||
| The output is a dataframe: | ||
| `current_data` will be used to fit the pipeline and `input_data` to forecast. Both are dataframes: | ||
|
|
||
| ``` | ||
| station valid tmpf dwpf relh drct sknt p01i alti vsby feel | ||
| 0 8A0 1/1/16 0:15 41.000 39.200 93.240 350.000 6.000 0.000 30.250 10.000 36.670 | ||
| 1 4A6 1/1/16 0:15 41.000 39.000 70.080 360.000 5.000 0.000 30.300 10.000 37.080 | ||
| 2 8A0 1/1/16 0:35 39.200 37.400 93.190 360.000 6.000 0.000 30.250 10.000 34.200 | ||
| 3 4A6 1/1/16 0:35 41.000 32.000 70.080 360.000 5.000 0.000 30.290 10.000 37.080 | ||
| 4 8A0 1/1/16 0:55 37.400 37.400 100.000 360.000 8.000 0.000 30.250 10.000 30.760 | ||
| ``` | ||
| | station | valid | tmpf | dwpf | relh | drct | | ||
| | ------- | ----------- | ---- | ---- | ----- | ---- | | ||
| | 8A0 | 1/1/16 0:15 | 41 | 39.2 | 93.24 | 350 | | ||
| | 4A6 | 1/1/16 0:15 | 41 | 32 | 70.08 | 360 | | ||
| | 8A0 | 1/1/16 0:35 | 39.2 | 37.4 | 93.19 | 360 | | ||
| | 4A6 | 1/1/16 0:35 | 41 | 32 | 70.08 | 360 | | ||
| | 8A0 | 1/1/16 0:55 | 37.4 | 37.4 | 100 | 360 | | ||
| | 4A6 | 1/1/16 0:55 | 39.2 | 32 | 75.16 | 350 | | ||
|
|
||
|
|
||
| Once we have the data, create an instance of the `Pyteller` class, where the input arguments are the forecast settings. | ||
| ## 2. Fit the pipeline | ||
| Once we have the data, create an instance of the `Pyteller` class, where the input arguments are the forecast settings and the column headers of the data. | ||
| In this example we use the `lstm` pipeline and set the training epochs to 20. | ||
|
|
||
| ```python3 | ||
| from pyteller.core import Pyteller | ||
| pyteller = Pyteller ( | ||
| pipeline='persistence', | ||
| pred_length=3, | ||
| offset=5, | ||
| from pyteller import Pyteller | ||
|
|
||
| pipeline = 'pyteller/pipelines/pyteller/LSTM/LSTM.json' | ||
|
|
||
| hyperparameters = { | ||
| 'keras.Sequential.LSTMTimeSeriesRegressor#1': { | ||
| 'epochs': 20 | ||
| } | ||
| } | ||
|
|
||
| pyteller = Pyteller( | ||
| pipeline=pipeline, | ||
| time_column='valid', | ||
| targets='tmpf', | ||
| entity_column='station', | ||
| entities='8A0' | ||
| pred_length=12, | ||
| offset=0, | ||
| hyperparameters=hyperparameters | ||
| ) | ||
| ``` | ||
|
|
||
| ## 2. Fit the data | ||
| The user now calls the `pyteller.fit` method to fit the data to the pipeline. The inputs are the loaded data and the column names. The user also specifies which signal or entities they want to predict for here. | ||
| ```python3 | ||
| pyteller.fit( | ||
| data=current_data, | ||
| timestamp_col='valid', | ||
| target_signal='tmpf', | ||
| entity_col='station') | ||
| pyteller.fit(current_data) | ||
|
|
||
| ``` | ||
|
|
||
|
|
||
| ## 3. Save the trained model | ||
| At this point, the user has a trained model that can be pickled by calling the `pyteller.save` method, inputting the desired output path: | ||
| ## 3. Forecast | ||
| To make a forecast, the user calls the `pyteller.forecast` method | ||
|
|
||
| ```python3 | ||
| pyteller.save('../fit_models/persistence') | ||
| output = pyteller.forecast(data=input_data) | ||
| ``` | ||
| The output is a ``dictionary`` which includes the ``forecasts`` and ``actuals`` ``dataframes``. Here is ``output['forecasts']``: | ||
|
|
||
| ## 4. Load the new data | ||
| Once the user gets new data that they want to use to make a prediction, they can load it in the same way they loaded the training data. | ||
| ```python3 | ||
| input_data=load_data('../pyteller/data/AL_Weather_input.csv') | ||
| ``` | ||
|
|
||
| ## 5. Forecast | ||
| To make a forecast, the user calls the `pyteller.forecast` method, which will output the forecasts for all signals and all entities. | ||
| timestamp 8A0 | ||
| 2/4/16 18:15 42.800 | ||
| 2/4/16 18:35 42.800 | ||
| 2/4/16 18:55 44.800 | ||
| ``` | ||
|
|
||
| ## 4. Evaluate | ||
| To see metrics of the forecast accuracy, the user calls the `pyteller.evaluate` method: | ||
| ```python3 | ||
| forecast = pyteller.forecast(input_data) | ||
| scores = pyteller.evaluate(test_data=output['actuals'],forecast=output['forecast'], | ||
| metrics=['sMAPE','MAPE']) | ||
|
|
||
| ``` | ||
| The output is a dataframe of all the predictions: | ||
| The output is a dataframe of the scores: | ||
|
|
||
| ```python3 | ||
| timestamp 8A0 4A6 | ||
| 2/4/16 18:15 42.800 44.800 | ||
| 2/4/16 18:35 42.800 42.600 | ||
| 2/4/16 18:55 44.800 43.000 | ||
| 8A0 | ||
| sMAPE 11.4 | ||
| MAPE 11.7 | ||
| ``` | ||
|
|
||
|
|
||
|
|
||
|
|
||
| ## Releases | ||
| In every release, we run a pyteller benchmark. We maintain an up-to-date leaderboard with the current scoring to the benchmarking procedure explained [here](benchmark). | ||
|
|
||
| Results obtained during the benchmarking process as well as previous benchmarks can be found | ||
| within [benchmark/results](benchmark/results) folder as CSV files. In addition, you can find it in the [details Google Sheets document](https://docs.google.com/spreadsheets/d/1EQd2x4BPSYEs6KLLUKrxzY3e8TuysnYnaSYAsBiPwCA/edit?usp=sharing). | ||
|
|
||
| ### Leaderboard | ||
| We summarize the results in the [leaderboard](benchmark/leaderboard.md) table. We showcase the percentage of times each pipeline wins over the ARIMA pipeline. | ||
|
|
||
| The summarized results can also be browsed in the following [summary Google Sheets document](https://docs.google.com/spreadsheets/d/1OPwAslqfpWvzpUgiGoeEq-Wk_yK-GYPGpmS7TwEaSbw/edit?usp=sharing). | ||
|
|
||
|
|
||
| # What's next? | ||
|
|
||
| For more details about **pyteller** and all its possibilities | ||
| and features, please check the [documentation site]( | ||
| https://signals-dev.github.io/pyteller/). | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.