Skip to content

Outliers #47

@stweb75

Description

@stweb75

Rbeast::beast can accommodate outliers with:

hasOutlier [default FALSE]
boolean; if true, fit a model with an outlier component that refers to potential spikes or dips at isolated data points: Y = trend + outlier + error if season='none',and Y = trend + season + outlier + error if season ~= 'none'.

ocp.minmax [default c(0,10)]
a vector of 2 integers (>=0); the min and max numbers of outlier-type changepoints (ocp) allowed in the time seriestrend component. Ocp refers to spikes or dips at isolated times that can't be modeled as trends or seasonal terms.

There is also the undocumented:

prior$outlierSigFactor = 2.5 # above which datapoints considered outliers

In my experience:

  • outliers 'absorb' variance and can result in overfitting of the trend and season components.
  • there often are many more outliers than there appear to be in a plot of the time series; often ncp_mode = ncp_max = ocp.minmax[2].

Below are three examples comparing Rbeast:beast and forecast::tsoutliers. The latter function is explained at Detecting time series in outliers. It is similarly based on a decomposition but identifies fewer outliers than does the former. I have also tried increasing outlierSigFactor.

I don't know how Rbeast::beast identifies outliers but suspect that prior$outlierSigFactor acts similarly to the 3IQR(r) tolerance interval in forecast::tsoutliers. For a symmetric distribution:
Q3(r) = Q2(r) + 0.5IQR(r) ~ mean(r) + 0.5IQR(r); IQR(r)~1.349sd(r); which gives a tolerance interval +/-1.349(0.5 + 3)IQR(r) ~ 4.7sd(r). That's a far more strict criteria than is 2.5sd(r). A further concern is that IQR(r) or mad(r) are robust estimators of scale and sd(r) is not.

Anyhow, my current strategy to avoid overfitting outliers is to set ocp.minmax[2] = length(tsoutliers(data)$index) or, being generous, 1.5length(tsoutliers(data)$index).

library(forecast)
library(Rbeast)
## Example 1
Data(Nile)
# Rbeast::beast(outlierSigFactor = 2.5)
# mode 7 outliers and Pr(ncp=7)=0.54
# 4 ocp with cpPr > 0.95
mod <- beast(Nile, hasOutlier=TRUE)
plot(mod)
mod
# no outliers by forecast::tsoutliers()
tsoutliers(Nile)
# Rbeast::beast(outlierSigFactor = 4.7)
# mode 1 outliers and and Pr(ncp=1)=0.95
# 1 ocp with cpPr > 0.95
mod <- beast123(Y = as.numeric(Nile),
  season="none",
  metadata = list(time = time(Nile),
    deltaTime = 1,
    hasOutlier = TRUE),
  prior = list(outlierSigFactor = 4.7),
  extra = list(dumpInputData=TRUE,
    computeTrendOrder=TRUE,
    computeTrendSlope = TRUE))
plot(mod)
mod
## Example 2
# Rbeast::beast(outlierSigFactor = 2.5)
# mode 10 outliers and Pr(ncp=10)=0.83
# 7 ocp with cpPr > 0.95
mod <- beast(googletrend_beach, hasOutlier=TRUE)
plot(mod)
mod
# five outliers by forecast::tsoutliers()
out <- tsoutliers(googletrend_beach)
# these match the top 5 outliers detected by Rbeast::beast()
out
time(googletrend_beach)[out$index]
# the ts is complex and the outliers are not obvious by visual inspection
plot(googletrend_beach)
points(time(googletrend_beach)[out$index], googletrend_beach[out$index], col="red")
# Rbeast::beast(outlierSigFactor = 4.7)
# mode 2 outliers and and Pr(ncp=2)=0.94
# 1 ocp with cpPr > 0.95
mod <- beast123(Y = googletrend_beach,
  season="harmonic",
  metadata = list(time = time(googletrend_beach),
    deltaTime = 1/12,
    period = 1.0,    
    hasOutlier = TRUE),
  prior = list(outlierSigFactor = 4.7),
  extra = list(dumpInputData=TRUE,
    computeTrendOrder=TRUE,
    computeTrendSlope = TRUE))
plot(mod)
mod
# Rbeast::beast(ocp.minmax=c(0,5))
# mode 5 outliers and and Pr(ncp=5)=0.84
# 3 ocp with cpPr > 0.95
mod <- beast(googletrend_beach, hasOutlier=TRUE, ocp.minmax=c(0,5))
plot(mod)
mod
## Example 3
data(co2)
# mode 10 outliers by Rbeast::beast() however Pr(ncp=10)=0.30
# zero ocp with cpPr > 0.95
mod <- beast(co2, hasOutlier=TRUE)
plot(mod)
mod
# no outliers by forecast::tsoutliers()
tsoutliers(co2)
# Rbeast::beast(outlierSigFactor = 4.7)
# mode 0 outliers and and Pr(ncp=0)=0.579
# zero ocp with cpPr > 0.95
mod <- beast123(Y = co2,
  season="harmonic",
  metadata = list(time = time(co2),
    deltaTime = 1/12,
    period = 1.0,    
    hasOutlier = TRUE),
  prior = list(outlierSigFactor = 4.7),
  extra = list(dumpInputData=TRUE,
    computeTrendOrder=TRUE,
    computeTrendSlope = TRUE))
plot(mod)
mod

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions