-
Notifications
You must be signed in to change notification settings - Fork 22
Evaluation
This page documents the evaluation method used for scoring the models. First we describe the targets and what an error means for it. Then we describe how the true values are collected and used to score the models.
Before going any further, here are definitions for a few terms we will encounter repeatedly:
- An
epiweek
is the fundamental unit of time based on MMWR weeks and is uniquely identified using a combination of an year and MMWR week. For example 201348 is an epiweek representing MMWR week 48 of year 2013. - An epidemic
season
is an ordered set of epiweeks starting at 20xx30 and ending at 20yy29 where 20yy is 20xx + 1. A season is usually represented using both the consecutive years, e.g. 2013-2014, or using just the first year like 2013. Because the numbers of MMWR weeks in an year can be either 52 or 53, a season can also have either 52 or 53 epiweeks (e.g. season 2014-2015 has 53 weeks since year 2014 has 53 MMWR weeks). - A
target
is something that models try to predict at each time step. Targets which specify properties of a season, like when is the peak week areseasonal
targets. On the other hand, targets like n weeks ahead areweekly
targets. - There are 11 geographical
regions
. 10 identifying the 10 HHS regions and 1 for the complete nation. - Weighted influenza-like illness index,
wili%
, is the metric used in the time series. It is defined as the percentage of outpatient doctor visits for influenza-like illness, weighed by state population. This page on CDC.gov describes it in more details.
At each time point, every model provides predictions for the following 7 targets (for each of the 11 regions):
- 1 week ahead wili% value.
- 2 week ahead wili% value.
- 3 week ahead wili% value.
- 4 week ahead wili% value.
- Peak week. The epiweek with the maximum wili% in the season.
- Peak wili%. The wili% value at the peak week.
- Onset week. An onset week for a given season is derived using a baseline wili% value set by the CDC for that season and region. It is defined as the first of the first 3 consecutive weeks with wili% equaling or exceeding the baseline.
Since the values of wili% are revised as a season progress. The final wili% at epiweek 201802 (say) might not be equal to the wili% value for the same epiweek when queried at that time. Both the seasonal and the weekly targets might vary during a live (one whose wili% values are not settled) season.
Delphi Epidata API provides ways to collect both the final wili% data and the (unsettled) data observed at a certain time in season.
Some other subtleties related to seasonal targets follow:
- Since a peak week (and the value) can only be found when we get all the data of a season, it will not be available for a live season.
- A season is allowed to have multiple peak weeks if the corresponding values are close enough.
- Due to the way its defined, the onset week for a season might be unavailable for a few weeks and then be available for the rest of the season without changing.
- Onset week can also be null for the whole season.
Due to the differences in targets and their true values, there are multiple ways to score a target prediction. As of now, there are the following 3 places where scores are calculated and displayed for the models involved in this project:
- Scoring the training data for weight estimation.
- Scores shown up on the visualizer at http://flusightnetworks.io
- CDC FluSight project where live models are submitted.
Among these, the 1st mimicks the scoring used in 3rd so we only describe the first two. For each target, a model provides both a probability distribution and a point value corresponding to the bin with maximum probability. Therefore, there are two scores we calculate:
- Log score. Natural log of probability assigned to the true bin. Higher is better.
- Absolute error. Absolute error between the model's point estimate and the truth value. Lower is better.
Models in ./model-forecasts/component-models
and
./model-forecasts/cv-ensemble-models
are scored using multivalue and
multibin log scoring rule.
multivalue means that multiple truth values are allowed and the log score is the natural log of the sum of probabilities assigned to each true value. For example, if there are two peak weeks with wili values 3.2 and 5.3 respectively, a model providing a bin distribution such that bin [3.2, 3.3) has probability 0.2 and bin [5.3, 5.4) has probability 0.001 will have log score of `Math.log(0.2 + 0.001) = -1.6044503709230613**
multibin means that multiple bins around the truth are considered for scoring instead of just one bin. As an example, consider the truth is 2.3 and the bins (around the truth) with probabilities are:
...
[2.0, 2.1) 0.00
[2.1, 2.2) 0.00
[2.2, 2.3) 0.02
[2.3, 2.4) 0.10 // The true bin
[2.4, 2.5) 0.20
[2.5, 2.6) 0.08
[2.6, 2.7) 0.01
...
A single bin scoring rule will return a log score of Math.log(0.10) = -2.3025850929940455
. If we instead use multibin scoring with a window of 2 bins
around the truth (effectively 5 true bins), we get the score as:
Math.log(0.00 + 0.02 + 0.10 + 0.20 + 0.08) = -0.916290731874155
We use a window of 5 bins (total of 11 bins) for wili% targets (peak wili and week ahead targets) and a window of 1 bin (total 3 bins) for week targets (onset and peak wk).
The script to generate these scores is in ./scripts/generate-scores.js
which
uses true data as available at the time the predictions were made from the
file ./scores/target-multivals.csv
. The generated scores are in
./scores/scores.csv
.
Scoring used in the visualizer uses the package flusight-csv-tools for collecting truth and scoring models.
The scores shown here differ from the above scheme in the following ways:
- The truth matched against is based on the latest revision of data (not the data observed at the time of prediction).
- Scores as calculated using single bin and single value.