Skip to content

Evaluation

Abhinav Tushar edited this page Jan 26, 2018 · 8 revisions

This page documents the evaluation method used for scoring the models. First we describe the targets and what an error means for it. Then we describe how the true values are collected and used to score the models.

Before going any further, here are definitions for a few terms we will encounter repeatedly:

  1. An epiweek is the fundamental unit of time based on MMWR weeks and is uniquely identified using a combination of an year and MMWR week. For example 201348 is an epiweek representing MMWR week 48 of year 2013.
  2. An epidemic season is an ordered set of epiweeks starting at 20xx30 and ending at 20yy29 where 20yy is 20xx + 1. A season is usually represented using both the consecutive years, e.g. 2013-2014, or using just the first year like 2013. Because the numbers of MMWR weeks in an year can be either 52 or 53, a season can also have either 52 or 53 epiweeks (e.g. season 2014-2015 has 53 weeks since year 2014 has 53 MMWR weeks).
  3. A target is something that models try to predict at each time step. Targets which specify properties of a season, like when is the peak week are seasonal targets. On the other hand, targets like n weeks ahead are weekly targets.
  4. There are 11 geographical regions. 10 identifying the 10 HHS regions and 1 for the complete nation.
  5. Weighted influenza-like illness index, wili%, is the metric used in the time series. It is defined as the percentage of outpatient doctor visits for influenza-like illness, weighed by state population. This page on CDC.gov describes it in more details.

Targets

At each time point, every model provides predictions for the following 7 targets (for each of the 11 regions):

  1. 1 week ahead wili% value.
  2. 2 week ahead wili% value.
  3. 3 week ahead wili% value.
  4. 4 week ahead wili% value.
  5. Peak week. The epiweek with the maximum wili% in the season.
  6. Peak wili%. The wili% value at the peak week.
  7. Onset week. An onset week for a given season is derived using a baseline wili% value set by the CDC for that season and region. It is defined as the first of the first 3 consecutive weeks with wili% equaling or exceeding the baseline.

Truth

Since the values of wili% are revised as a season progress. The final wili% at epiweek 201802 (say) might not be equal to the wili% value for the same epiweek when queried at that time. Both the seasonal and the weekly targets might vary during a live (one whose wili% values are not settled) season.

Delphi Epidata API provides ways to collect both the final wili% data and the (unsettled) data observed at a certain time in season.

Some other subtleties related to seasonal targets follow:

  1. Since a peak week (and the value) can only be found when we get all the data of a season, it will not be available for a live season.
  2. A season is allowed to have multiple peak weeks if the corresponding values are close enough.
  3. Due to the way its defined, the onset week for a season might be unavailable for a few weeks and then be available for the rest of the season without changing.
  4. Onset week can also be null for the whole season.

Scores

Due to the differences in targets and their true values, there are multiple ways to score a target prediction. As of now, there are the following 3 places where scores are calculated and displayed for the models involved in this project:

  1. Scoring the training data for weight estimation.
  2. Scores shown up on the visualizer at http://flusightnetworks.io
  3. CDC FluSight project where live models are submitted.

Among these, the 1st mimicks the scoring used in 3rd so we only describe the first two. For each target, a model provides both a probability distribution and a point value corresponding to the bin with maximum probability. Therefore, there are two scores we calculate:

  1. Log score. Natural log of probability assigned to the true bin. Higher is better.
  2. Absolute error. Absolute error between the model's point estimate and the truth value. Lower is better.

1. Training data scoring

Models in ./model-forecasts/component-models and ./model-forecasts/cv-ensemble-models are scored using multivalue and multibin log scoring rule.

multivalue means that multiple truth values are allowed and the log score is the natural log of the sum of probabilities assigned to each true value. For example, if there are two peak weeks with wili values 3.2 and 5.3 respectively, a model providing a bin distribution such that bin [3.2, 3.3) has probability 0.2 and bin [5.3, 5.4) has probability 0.001 will have log score of `Math.log(0.2 + 0.001) = -1.6044503709230613**

multibin means that multiple bins around the truth are considered for scoring instead of just one bin. As an example, consider the truth is 2.3 and the bins (around the truth) with probabilities are:

...
[2.0, 2.1) 0.00
[2.1, 2.2) 0.00
[2.2, 2.3) 0.02
[2.3, 2.4) 0.10 // The true bin
[2.4, 2.5) 0.20
[2.5, 2.6) 0.08
[2.6, 2.7) 0.01
...

A single bin scoring rule will return a log score of Math.log(0.10) = -2.3025850929940455. If we instead use multibin scoring with a window of 2 bins around the truth (effectively 5 true bins), we get the score as:

Math.log(0.00 + 0.02 + 0.10 + 0.20 + 0.08) = -0.916290731874155

We use a window of 5 bins (total of 11 bins) for wili% targets (peak wili and week ahead targets) and a window of 1 bin (total 3 bins) for week targets (onset and peak wk).

The script to generate these scores is in ./scripts/generate-scores.js which uses true data as available at the time the predictions were made from the file ./scores/target-multivals.csv. The generated scores are in ./scores/scores.csv.

2. Visualizer scoring

Scoring used in the visualizer uses the package flusight-csv-tools for collecting truth and scoring models.

The scores shown here differ from the above scheme in the following ways:

  1. The truth matched against is based on the latest revision of data (not the data observed at the time of prediction).
  2. Scores as calculated using single bin and single value.
Clone this wiki locally