Testing metrics with Tensorboard

Selecting the best metrics for evaluating a model

This is a multi-output regression problem, thus we can only train/evaluate our models using regression losses/metrics.

With tf.keras we have the option of using the following built in losses/metrics:

A possible strategy for selecting a set of metrics which can objectively measure a model's performance could be as follows:

Plot training and validation error curves of the default DNN model in tensorboard for all of the above metrics.
Select a subset of metrics which are not being minimized to a sufficient degree (relative to others). The idea is to correspond the poor performance of the default model with the relevant metric values. Metrics which are already converging to a very low value are not useful and can be discarded.

Using the above strategy, we have four candidate metrics as follows: RootMeanSquaredError, MeanAbsoluteError, MeanAbsolutePercentageError, and MeanSquaredError.

We can now set any one of these four metrics as our loss function whilst also tracking the values of the other three in tensorboard.

Model A will be objectively better than Model B if Model A manages to minimize a majority of these metrics better than Model B.

Here's an example of the plots we obtain when setting each of these metrics (excluding MeanAbsolutePercentageError) as our loss function (orange line is train curve, blue line is validation curve):

RMSE loss:

alt text

MAE loss:

alt text

MSE loss:

alt text

The above plots suggest that RootMeanSquaredError, MeanAbsoluteError and MeanSquaredError are all good options for the loss function.

Overall, it is recommended to set any one of these three options as the loss whilst also simultaneously checking the curves of the other three metrics (including MeanAbsolutePercentageError) in tensorboard.

RootMeanSquaredError or MeanSquaredError is a good & popular choice when we want to minimize the average error.
MeanAbsoluteError is a good choice when we want to optimize the median error i.e. there should be as many positive errors as there are negative errors.

Example of validation strategy

Comparing train curves of default UNet model trained on 900 events (orange) against the same model trained on 4000 events (red) using RMSE loss:

alt text

Comparing validation curves of default UNet model trained on 900 events (dark blue) against the same model trained on 4000 events (light blue) using RMSE loss:

alt text

As we can see from the above plots, an objectively better model will be produce lower error curves across a majority of the aforementioned four metrics. Since different metrics measure errors differently, some will be minimized to a greater degree than others and thus show much starker differences in a comparison.