<<<<<<< HEAD
=======
Predicting anomaly probabilities for Wind Turbine data using cluster and pattern based semi supervised models.
Cluster and Pattern based models are used to detect anomalies in time series data from sensors. Being able to detect anomalous trends and behaviours will help operators to identify problems early on that will reduce maintenance costs and extend turbine life.
Column names in the csv files have to be same as indicated the original csv files. Column order is not relevant. Average active power columns are not needed, and if provided it will be removed automatically.
The dataset used in this project are uploaded to Azure Blob Storage and listed as follows:
- Melancthon Wind Turbine time series data from 9 turbines with 44 features each.
The pipeline for this project is built on Microsoft Azure Machine Learning. To run the code in this project; Upload the csv file to Azure Blob, and note its SUBSCRIPTION_ID, RESOURCE_GROUP, WORKSPACE_NAME, DATASET_NAME.
Steps to install required packages:
-
Create a virtual environment on the azure compute instance terminal.
-
Clone this repository to your compute and install the requirements;
pip install requirements.txt
-
Install Anomatools package:
pip install git+https://github.com/Vincent-Vercruyssen/anomatools.git@master
pip install dtaidistance
-
Install PBAD package:
-
Clone the PBAD repository
-
Build the code by running the setup.py file:
cd src/utils/cython_utils/
python setup.py build_ext --inplace
-
If you receive an error, run the command second time.
-
Note the location of the "src" folder, this will be required in the config file.
-
- Activate the virtual environment the installation was completed;
conda activate environment_name
-
Update the data acquisition configuration:
SUBSCRIPTION_ID
: Set the String value you received from Azure Dataset Blob (Consume tab)RESOURCE_GROUP
: Set the String value you received from Azure Dataset Blob (Consume tab)WORKSPACE_NAME
: Set the String value you received from Azure Dataset Blob (Consume tab)DATASET_NAME
: Set the String value you received from Azure Dataset Blob (Consume tab)
-
Update the data processing configuration:
GROUP_TURBINE_NAME_LIST
: Set the list of turbine names to be extractedEXCLUDE_FROM_MEAN_LIST
: Set the list of turbines to be excluded from mean calculation (if imputation is being done)FEATURE_ID
: Set the feature suffix to define the imputation target (if imputation is being done)TURBINE_NAME
: Set the turbine name to be imputed ((if imputation is being done))MODEL_NAME
: Set the String value for the model name, either pbad or ssdo.LABEL_LIST
: Set the list to specify the turbine names and labels for each turbine
-
Update the model training configuration:
PBAD_SC_PATH
: Set the path for the source code of PBAD packageEXCLUDE_COLUMNS
: Set the list of columns to be excluded from the model fit. Add additional features here if you want to exclude them from your experimentESTIMATORS
: Set the SSDO parameter for the number of base estimators in the Isolation Forest
-
Update the eda configuration used for plotting the results:
PLOT_FOLDER
: Set the path for the folder to save the plots for anomaly probabilitiesCOLORSCALE
: Set the colorscale for the plots. List of options can be access herePLOT_OPTION
: Set the option for either save or show (show can be used in notebooks).TURBINE_NAME
: Set the turbine name to be plottedSTART_TIME
: Set the plot start timeEND_TIME
: Set the plot end time
-
To run the pipeline, type one of the following commands in Azure Machine Learning terminal in the root of the project:
- For PBAD without imputation:
python run_pipeline.py pbad
- For PBAD with imputation:
python run_pipeline.py pbad impute
- For SSDO without imputation:
python run_pipeline.py ssdo
- For SSDO with imputation:
python run_pipeline.py pbad impute
The model will output anomaly probability score plots as HTML files in the defined folder, one plot for each feature.
Sample results for both models ( Turbines 1528-07,1528-22, and 1528-43).