STL Regression Anomaly Detection — anomalize_ss_anom

For Single Site, Anomaly Detection, Longitudinal analyses where the time_period is smaller than a year, this function will execute timetk::anomalize() to identify outliers in the time series using STL regression. For year-level analyses, the same input table will be returned and a different anomaly detection method will be used at the *_output stage

Usage

anomalize_ss_anom_la(fot_input_tbl, grp_vars, time_var, var_col)

Arguments

fot_input_tbl

tabular input || required

A table, typically output by compute_fot()

grp_vars

string or vector || required

The variable(s) to be used as grouping variables in the analysis. These variables will also be preserved in the cross-join, meaning there should not be any NAs as an artifact of the join for these variables.

time_var

string || required

The variable with the time period date information (typically time_start)

var_col

string || required

The variable with the numerical statistic of interest for the euclidean distance computation

Value

For yearly analyses, the same input table will be returned and the anomaly detection method will be executed via a control chart in the *_output step. For smaller time increments, this function will return the input dataframe with all columns from the original input table plus the columns needed for timetk output generated by the anomalize function. These include an anomaly indicator and variables related to the decomposition of the time series.

Examples

# sample single-site, longitudinal input data (modeled after EVP)
sample_ss_la_input <- dplyr::tibble('variable' = c('scd', 'scd', 'scd',
                                                   'scd', 'scd', 'scd',
                                                   'scd', 'scd', 'scd',
                                                   'scd', 'scd', 'scd',
                                                   'scd', 'scd'),
                             'site' = c('Site A','Site A','Site A',
                                        'Site A','Site A','Site A',
                                        'Site A','Site A','Site A',
                                        'Site A','Site A','Site A',
                                        'Site A','Site A'),
                             'count' = c(15, 24, 100, 93, 47, 65,
                                         33, 92, 153, 122, 5, 99,
                                         10, 30),
                             'time_start'=c('2018-01-01', '2018-02-01',
                                '2018-03-01', '2018-04-01', '2018-05-01',
                                '2018-06-01', '2018-07-01', '2018-08-01',
                                '2018-09-01', '2018-10-01', '2018-11-01',
                                '2018-12-01', '2019-01-01', '2019-02-01'),
                             'time_increment' = c('month','month','month',
                                   'month', 'month','month', 'month','month',
                                   'month','month','month','month','month',
                                   'month'))
# execute 'anomalization' from timetk package to find anomalies
anomalize_ss_anom_la(fot_input_tbl = sample_ss_la_input %>%
                         dplyr::mutate(time_start = as.Date(time_start)),
                     grp_vars = c('site','variable'),
                     time_var = 'time_start',
                     var_col = 'count')
#> Joining with `by = join_by(time_start, site, variable)`
#> frequency = 3 observations per 1 quarter
#> trend = 14 (Number of observations insufficient for shorter trend cycles.
#> Joining with `by = join_by(time_start, site, variable)`
#> # A tibble: 14 × 16
#>    time_start site   variable count time_increment observed season trend
#>    <date>     <chr>  <chr>    <dbl> <chr>             <dbl>  <dbl> <dbl>
#>  1 2018-01-01 Site A scd         15 month                15  -12.1  44.7
#>  2 2018-02-01 Site A scd         24 month                24  -24.7  50.3
#>  3 2018-03-01 Site A scd        100 month               100   36.8  55.9
#>  4 2018-04-01 Site A scd         93 month                93  -12.1  61.5
#>  5 2018-05-01 Site A scd         47 month                47  -24.7  67.1
#>  6 2018-06-01 Site A scd         65 month                65   36.8  72.0
#>  7 2018-07-01 Site A scd         33 month                33  -12.1  76.8
#>  8 2018-08-01 Site A scd         92 month                92  -24.7  74.0
#>  9 2018-09-01 Site A scd        153 month               153   36.8  71.2
#> 10 2018-10-01 Site A scd        122 month               122  -12.1  67.8
#> 11 2018-11-01 Site A scd          5 month                 5  -24.7  64.4
#> 12 2018-12-01 Site A scd         99 month                99   36.8  60.3
#> 13 2019-01-01 Site A scd         10 month                10  -12.1  56.2
#> 14 2019-02-01 Site A scd         30 month                30  -24.7  51.1
#> # ℹ 8 more variables: remainder <dbl>, seasadj <dbl>, anomaly <chr>,
#> #   anomaly_direction <dbl>, anomaly_score <dbl>, recomposed_l1 <dbl>,
#> #   recomposed_l2 <dbl>, observed_clean <dbl>