Hotspots Anomaly Detection Eligibility Determination
Source:R/anomaly_detection.R
compute_dist_anomalies.RdThis function will, for each group in a dataframe, identify groups that are eligible
for anomaly detection analysis by examining the values in the var_col.
The following conditions will disqualify a group from the anomaly detection analysis:
(1) Mean < 0.02 or Median < 0.01
(2) Mean value < 0.05 and range < 0.01
(3) Coefficient of variance < 0.1 and sample size < 11
If no groups meet this criteria, a warning will display in the console indicating
that no groups were eligible.
Arguments
- df_tbl
tabular input || required
A dataframe with at least one numerical variable & any relevant variables needed for grouping
- grp_vars
string or vector || required
The variable(s) to be used as grouping variables in the analysis
- var_col
string || required
The variable with the numerical statistic of interest for the euclidean distance computation
- denom_cols
string or vector || required
The variable containing a denominator or any other variables that should be preserved without nulls after a cross_join takes place
Value
This function will return the original df_tbl with the addition of the
summary statistics used in the eligibility computation and a flag indicating
whether a given variable (based on the grp_vars) is eligible for anomaly
detection analysis. This table can then be passed into detect_outliers
to identify anomalous values.
Examples
# create sample input (modeled after EVP)
sample_ms_input <- dplyr::tibble('site' = c('Site A', 'Site A', 'Site A',
'Site A', 'Site B', 'Site B',
'Site B', 'Site B'),
'variable' = c('dx', 'dx', 'drug', 'drug',
'dx', 'dx', 'drug', 'drug'),
'count' = c(100, 140, 39, 42, 137, 111,
12, 15),
'total_var' = c(1000, 1000, 200, 200, 1500,
1500, 100, 100))
# execute the full analysis, including compute_dist_anomalies and
# detect_outliers
anomaly_output1 <- compute_dist_anomalies(df_tbl = sample_ms_input,
grp_vars = 'variable',
var_col = 'count',
denom_cols = 'total_var')
#> Joining with `by = join_by(site)`
#> Joining with `by = join_by(site, variable, total_var)`
anomaly_output1
#> # A tibble: 12 × 14
#> site variable total_var count mean_val median_val sd_val mad_val cov_val
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Site A dx 1000 100 81.3 106. 64.8 48.9 0.797
#> 2 Site A dx 1000 140 81.3 106. 64.8 48.9 0.797
#> 3 Site A dx 200 0 81.3 106. 64.8 48.9 0.797
#> 4 Site A drug 1000 0 18 13.5 18.5 20.0 1.03
#> 5 Site A drug 200 39 18 13.5 18.5 20.0 1.03
#> 6 Site A drug 200 42 18 13.5 18.5 20.0 1.03
#> 7 Site B dx 1500 137 81.3 106. 64.8 48.9 0.797
#> 8 Site B dx 1500 111 81.3 106. 64.8 48.9 0.797
#> 9 Site B dx 100 0 81.3 106. 64.8 48.9 0.797
#> 10 Site B drug 1500 0 18 13.5 18.5 20.0 1.03
#> 11 Site B drug 100 12 18 13.5 18.5 20.0 1.03
#> 12 Site B drug 100 15 18 13.5 18.5 20.0 1.03
#> # ℹ 5 more variables: max_val <dbl>, min_val <dbl>, range_val <dbl>,
#> # total_ct <int>, analysis_eligible <chr>
anomaly_output2 <- detect_outliers(df_tbl = anomaly_output1,
column_analysis = 'count',
column_variable = 'variable')
#> Joining with `by = join_by(site, variable, total_var, count, mean_val,
#> median_val, sd_val, mad_val, cov_val, max_val, min_val, range_val, total_ct,
#> analysis_eligible)`
anomaly_output2
#> # A tibble: 12 × 17
#> site variable total_var count mean_val median_val sd_val mad_val cov_val
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Site A dx 1000 100 81.3 106. 64.8 48.9 0.797
#> 2 Site A dx 1000 140 81.3 106. 64.8 48.9 0.797
#> 3 Site A dx 200 0 81.3 106. 64.8 48.9 0.797
#> 4 Site A drug 1000 0 18 13.5 18.5 20.0 1.03
#> 5 Site A drug 200 39 18 13.5 18.5 20.0 1.03
#> 6 Site A drug 200 42 18 13.5 18.5 20.0 1.03
#> 7 Site B dx 1500 137 81.3 106. 64.8 48.9 0.797
#> 8 Site B dx 1500 111 81.3 106. 64.8 48.9 0.797
#> 9 Site B dx 100 0 81.3 106. 64.8 48.9 0.797
#> 10 Site B drug 1500 0 18 13.5 18.5 20.0 1.03
#> 11 Site B drug 100 12 18 13.5 18.5 20.0 1.03
#> 12 Site B drug 100 15 18 13.5 18.5 20.0 1.03
#> # ℹ 8 more variables: max_val <dbl>, min_val <dbl>, range_val <dbl>,
#> # total_ct <int>, analysis_eligible <chr>, lower_tail <dbl>,
#> # upper_tail <dbl>, anomaly_yn <chr>