Predictive Maintenance Modeler using XGBoost's Accelerated Failure Time Implementation. This is a survival analysis model (see explanation below), applied to the automotive or industrial context. This is a personal project inspired by a work project; please feel free to fork the repository or contact me if you have suggestions.
In many real world settings we are interested in estimating the duration of time until an event occurs, whether that event be a mechanical failure in the automotive or industrial context, death in a biological sense, or a customer purchase of a good or service (churn). The type of modeling context is known as survival analysis in statistics, reliability analysis in engineering, or duration analysis in economics.
Survival analysis seeks to model the time to a singlar event for an entity (e.g. a human or machine). However, in many real-life situations, entitles can potentially experience multiple types of event, e.g. a person can die from cancer, a heart attack, car accident. When only one of these types of events can occur this is referred to a competing events, in the sense that each event competes with the other to yield the observed event of interest, and the occurrence of one type of event precludes the occurrence of the others. Consequently, probabilities of each events are referred to as competing risks (for more information, see).
The code in this repository was developed to model a single outcome for classicial time-to-failure analysis. Competing events is currently not supported.
Fundamental to both classical survival and competing events models is the idea of censoring, i.e. the label is not fully observed. As explained in the XGBoost documentation, there are four kinds of censoring:
-
Uncensored: The observation's true lifetime and event is observed
-
Left-censored: The observations true lifetime is only partly observed, though the event is observed. For instance, you're provided with telemetry data for a machine that already has some wear, but not data from the beginning of time. Over time, you see degredation and failure of the machine.
-
Right-censored: The observation's true lifetime is not fully observed and no event is observed (though it is assumed to eventually happen). For example, you are provided telemetry data for machines' first 12 months of life. During that period some are observed to fail, while others do not. These observations with no observed failure are right-censored, as their true lifetimes are to the right of the observed timeline, either because the study ended before an event, or because the subject exited the study early (attrition).
-
Interval-censored: A given observation exhibits both left- and right-censoring. This happens oftentimes in discrete-time panel data, where an observation is observed repeatedly in time1, time2, time3, etc. The Scania example is one example of this type of censoring.
Critically, it is assumed that censoring is uncorrelated with the outcome(s) of interest. That is, censored observations are assumed to have the same probability of experiencing a subsequent event as those observations whose event was observed (the so-called "non-informative censoring" assumption).
In automotive and manufacturing applications, we are frequently interested anticipating failures before they occur. There are several approaches to this, sorted by increasing level of modeling sophistication:
- Reactive maintenance - perform maintenance only when a component breaks
- Preventative maintenance - perform maintenance based on a maintenance (time or usage) schedules, regardless of the actual physical wear
- Conditions-based maintenance - perform maintenance based on current wear indications, but the actual remaining useful life (RuL) is unclear
- Predictive maintenance - perform maintenance using a forecast to estimate the RuL
Of these strategies, predictive mainteance is the most efficient as it promises to replace parts or components just in time, optimizing their lifecycle.
The Python environment used for this project was created using the Conda package manager, which if you don't already can be downloaded here. This code has only been developed for local execution. It hasn't been configured for any cloud or Docker implementation, though this would be logical extension.
This code and environment were developed on a Mac. If you also use OSX, you can create the same Python 3.11.8 environment simply by running:
conda env create -f environment.yaml
conda activate pdm_v2 # activates the environment
conda create --name pdm_v2 python=3.12.2 # creates new virtual environment
conda activate pdm_v2
Windows users should next install the following packages pdm_v2 virtual environment:
- conda install
- xgboost
- hydra-core
- pandas
- scikit-learn
- matplotlib
- lifelines
- optuna
- scikit-survival
- pip install
- you can ignore these dependencies
This hasn't been tested on Windows, so you may need to experiment a bit.
Prior to running the code you must configure the training and data build file, which is done in ./configs/config.yaml. Currently, this is populated with the default values for the Scania dataset (explained below).
This config file contains the following arguments:
-
mode: whether the modeler should be used for training or prediction (inference). Accepted values: [train, predict] -
data:data_path: path to CSV training data. Code expects CSV data type and will throw an error otherwiseunit_identifier: ID feature nametime_identifier: time feature nametarget_feature: outcome feature name. Note - this feature is expected to be binary or booleanlag_length: only applies to panel data. Number of previous time-series lags to add as new columns per feature. Time-series observations <lag_lengthwill be filled as far back as possible, using the values for a given ID's first time period to fill the restsampling_n: Maximum number of observations per ID to sample. This reduces the influence of statistical outliers (e.g. vehicles with many time-series observations). Set to 1 to convert panel data to a simple cross-section by keeping only the last observation perunit_identifier. Note -sampling_nmust be >=lag_length
-
training_config:test_size: proportion of unique IDs to use in cross-validation. Test set is constructued assuming a panel structure, though works equivalently for cross-sectional datahyperoptimize: whether to hyperoptimize using Optuna. This employs the default Tree-structured Parzen Estimatorn_trials: number of hyperparameter trials to execute. Ignored ifhyperoptimizeis falsesave_model: whether to save trained XGBoost model as JSON at the end
-
predict_config:model_path: path to gzipped directory containing previously-trained model artifacts
-
seed: initializer value for pseudo random number generator
Note - don't alter the Hydra configuration information, this alters the default destination for outputting logs and artifacts.
Ensure the current working directory of your IDE is set at ./src
python -m pdm
Logging is streamed to the console during the train/predict procedure. At the end, a log file, figures, and model artifacts are output to a new folder with the datetime when training is initiated: ./output/<YYYY-MM-DD>/<HH-mm-SS>.
To see an example of this code applied to a multivariate time-series dataset from Scania trucks, see scania example