Euclidean Distance Computation — ms_anom

This function will compute the Euclidean Distance for the var_col at each site in comparison to the overall, all-site mean. This is the backend for most of the Multi Site, Anomaly Detection, Longitudinal analyses.

Usage

ms_anom_euclidean(fot_input_tbl, grp_vars, var_col)

Arguments

fot_input_tbl

tabular input || required

A table, typically output by compute_fot()

grp_vars

string or vector || required

The variable(s) to be used as grouping variables in the analysis. These variables will also be preserved in the cross-join, meaning there should not be any NAs as an artifact of the join for these variables.

var_col

string || required

The variable with the numerical statistic of interest for the euclidean distance computation

Value

This function will return the original data frame, where any time periods without data are filled in with 0s, with mean and median values for the var_col and the euclidean distance value based on the all-site mean

Examples

# sample multi-site, longitudinal input data (modeled after EVP)
sample_ms_la_input <- dplyr::tibble('variable' = c('scd', 'scd', 'scd',
                                                   'scd', 'scd', 'scd',
                                                   'scd', 'scd', 'scd',
                                                   'scd', 'scd', 'scd',
                                                   'scd', 'scd'),
                             'site' = c('Site A','Site A','Site A',
                                        'Site A','Site A','Site A',
                                        'Site A','Site B','Site B',
                                        'Site B','Site B','Site B',
                                        'Site B','Site B'),
                             'count' = c(15, 24, 100, 93, 47, 65,
                                         33, 92, 153, 122, 5, 99,
                                         10, 30),
                             'time_start'=c('2018-01-01','2019-01-01',
                                   '2020-01-01', '2021-01-01', '2022-01-01',
                                   '2023-01-01', '2024-01-01','2018-01-01',
                                   '2019-01-01', '2020-01-01', '2021-01-01',
                                   '2022-01-01', '2023-01-01', '2024-01-01'),
                             'time_increment' = c('year','year','year',
                                   'year', 'year','year', 'year','year',
                                   'year','year','year','year','year',
                                   'year'))

# compute euclidean distance for each site & variable combination
ms_anom_euclidean(fot_input_tbl = sample_ms_la_input %>%
                        dplyr::mutate(time_start = as.Date(time_start)),
                  grp_vars = c('site', 'variable'),
                  var_col = 'count')
#> Joining with `by = join_by(time_start, site, variable)`
#> Joining with `by = join_by(site, variable, time_start)`
#> Joining with `by = join_by(variable, time_start)`
#> # A tibble: 14 × 9
#>    site   time_start variable count mean_allsiteprop median date_numeric
#>    <chr>  <date>     <chr>    <dbl>            <dbl>  <dbl>        <dbl>
#>  1 Site A 2018-01-01 scd         15             53.5   53.5        17532
#>  2 Site A 2019-01-01 scd         24             88.5   88.5        17897
#>  3 Site A 2020-01-01 scd        100            111    111          18262
#>  4 Site A 2021-01-01 scd         93             49     49          18628
#>  5 Site A 2022-01-01 scd         47             73     73          18993
#>  6 Site A 2023-01-01 scd         65             37.5   37.5        19358
#>  7 Site A 2024-01-01 scd         33             31.5   31.5        19723
#>  8 Site B 2018-01-01 scd         92             53.5   53.5        17532
#>  9 Site B 2019-01-01 scd        153             88.5   88.5        17897
#> 10 Site B 2020-01-01 scd        122            111    111          18262
#> 11 Site B 2021-01-01 scd          5             49     49          18628
#> 12 Site B 2022-01-01 scd         99             73     73          18993
#> 13 Site B 2023-01-01 scd         10             37.5   37.5        19358
#> 14 Site B 2024-01-01 scd         30             31.5   31.5        19723
#> # ℹ 2 more variables: site_loess <dbl>, dist_eucl_mean <dbl>