For Single Site, Anomaly Detection, Longitudinal analyses where the time_period is
smaller than a year, this function will execute timetk::anomalize() to identify
outliers in the time series using STL regression. For year-level analyses, the same
input table will be returned and a different anomaly detection method will be used at
the *_output stage
Arguments
- fot_input_tbl
tabular input || required
A table, typically output by
compute_fot()- grp_vars
string or vector || required
The variable(s) to be used as grouping variables in the analysis. These variables will also be preserved in the cross-join, meaning there should not be any NAs as an artifact of the join for these variables.
- time_var
string || required
The variable with the time period date information (typically
time_start)- var_col
string || required
The variable with the numerical statistic of interest for the euclidean distance computation
Value
For yearly analyses, the same input table will be returned and the anomaly
detection method will be executed via a control chart in the *_output step.
For smaller time increments, this function will return the input dataframe
with all columns from the original input table plus the columns needed for
timetk output generated by the anomalize function. These include an anomaly
indicator and variables related to the decomposition of the time series.
Examples
# sample single-site, longitudinal input data (modeled after EVP)
sample_ss_la_input <- dplyr::tibble('variable' = c('scd', 'scd', 'scd',
'scd', 'scd', 'scd',
'scd', 'scd', 'scd',
'scd', 'scd', 'scd',
'scd', 'scd'),
'site' = c('Site A','Site A','Site A',
'Site A','Site A','Site A',
'Site A','Site A','Site A',
'Site A','Site A','Site A',
'Site A','Site A'),
'count' = c(15, 24, 100, 93, 47, 65,
33, 92, 153, 122, 5, 99,
10, 30),
'time_start'=c('2018-01-01', '2018-02-01',
'2018-03-01', '2018-04-01', '2018-05-01',
'2018-06-01', '2018-07-01', '2018-08-01',
'2018-09-01', '2018-10-01', '2018-11-01',
'2018-12-01', '2019-01-01', '2019-02-01'),
'time_increment' = c('month','month','month',
'month', 'month','month', 'month','month',
'month','month','month','month','month',
'month'))
# execute 'anomalization' from timetk package to find anomalies
anomalize_ss_anom_la(fot_input_tbl = sample_ss_la_input %>%
dplyr::mutate(time_start = as.Date(time_start)),
grp_vars = c('site','variable'),
time_var = 'time_start',
var_col = 'count')
#> Joining with `by = join_by(time_start, site, variable)`
#> frequency = 3 observations per 1 quarter
#> trend = 14 (Number of observations insufficient for shorter trend cycles.
#> Joining with `by = join_by(time_start, site, variable)`
#> # A tibble: 14 × 16
#> time_start site variable count time_increment observed season trend
#> <date> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 2018-01-01 Site A scd 15 month 15 -12.1 44.7
#> 2 2018-02-01 Site A scd 24 month 24 -24.7 50.3
#> 3 2018-03-01 Site A scd 100 month 100 36.8 55.9
#> 4 2018-04-01 Site A scd 93 month 93 -12.1 61.5
#> 5 2018-05-01 Site A scd 47 month 47 -24.7 67.1
#> 6 2018-06-01 Site A scd 65 month 65 36.8 72.0
#> 7 2018-07-01 Site A scd 33 month 33 -12.1 76.8
#> 8 2018-08-01 Site A scd 92 month 92 -24.7 74.0
#> 9 2018-09-01 Site A scd 153 month 153 36.8 71.2
#> 10 2018-10-01 Site A scd 122 month 122 -12.1 67.8
#> 11 2018-11-01 Site A scd 5 month 5 -24.7 64.4
#> 12 2018-12-01 Site A scd 99 month 99 36.8 60.3
#> 13 2019-01-01 Site A scd 10 month 10 -12.1 56.2
#> 14 2019-02-01 Site A scd 30 month 30 -24.7 51.1
#> # ℹ 8 more variables: remainder <dbl>, seasadj <dbl>, anomaly <chr>,
#> # anomaly_direction <dbl>, anomaly_score <dbl>, recomposed_l1 <dbl>,
#> # recomposed_l2 <dbl>, observed_clean <dbl>