This function will identify anomalies in a dataframe using the
hotspots::outliers() function. It assumes:
(1) No time component;
(2) Table has a column indicating whether a particular group or row is eligible for analysis;
(3) numerical variable exists that should be used for anomaly detection analysis
These conditions are met by the output of compute_dist_anomalies(), which is typically
the input for this function
Usage
detect_outliers(
df_tbl,
tail_input = "both",
p_input = 0.9,
column_analysis,
column_variable,
column_eligible = "analysis_eligible"
)Arguments
- df_tbl
tabular input || required
A table meeting the previously described criteria. This input will typically be the table output by
compute_dist_anomalies()- tail_input
string || defaults to
bothA string indicating whether outliers are identified for positive values, negative values, or both.
Acceptable inputs are
positive,negative, orboth- p_input
numeric || defaults to
0.9The p-value threshold that should be used to identify anomalies
- column_analysis
string || required
The name of the numerical column that is the target of the anomaly detection analysis
- column_variable
string || required
The name of the column with the variable to which the analysis column is related (ex: concept_id for the Concept Set Distribution module)
- column_eligible
string || defaults to
analysis_eligibleThe name of the column that indicates eligibility for analysis
Examples
# create sample input (modeled after EVP)
sample_ms_input <- dplyr::tibble('site' = c('Site A', 'Site A', 'Site A',
'Site A', 'Site B', 'Site B',
'Site B', 'Site B'),
'variable' = c('dx', 'dx', 'drug', 'drug',
'dx', 'dx', 'drug', 'drug'),
'count' = c(100, 140, 39, 42, 137, 111,
12, 15),
'total_var' = c(1000, 1000, 200, 200, 1500,
1500, 100, 100))
# execute the full analysis, including compute_dist_anomalies and
# detect_outliers
anomaly_output1 <- compute_dist_anomalies(df_tbl = sample_ms_input,
grp_vars = 'variable',
var_col = 'count',
denom_cols = 'total_var')
#> Joining with `by = join_by(site)`
#> Joining with `by = join_by(site, variable, total_var)`
anomaly_output1
#> # A tibble: 12 × 14
#> site variable total_var count mean_val median_val sd_val mad_val cov_val
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Site A dx 1000 100 81.3 106. 64.8 48.9 0.797
#> 2 Site A dx 1000 140 81.3 106. 64.8 48.9 0.797
#> 3 Site A dx 200 0 81.3 106. 64.8 48.9 0.797
#> 4 Site A drug 1000 0 18 13.5 18.5 20.0 1.03
#> 5 Site A drug 200 39 18 13.5 18.5 20.0 1.03
#> 6 Site A drug 200 42 18 13.5 18.5 20.0 1.03
#> 7 Site B dx 1500 137 81.3 106. 64.8 48.9 0.797
#> 8 Site B dx 1500 111 81.3 106. 64.8 48.9 0.797
#> 9 Site B dx 100 0 81.3 106. 64.8 48.9 0.797
#> 10 Site B drug 1500 0 18 13.5 18.5 20.0 1.03
#> 11 Site B drug 100 12 18 13.5 18.5 20.0 1.03
#> 12 Site B drug 100 15 18 13.5 18.5 20.0 1.03
#> # ℹ 5 more variables: max_val <dbl>, min_val <dbl>, range_val <dbl>,
#> # total_ct <int>, analysis_eligible <chr>
anomaly_output2 <- detect_outliers(df_tbl = anomaly_output1,
column_analysis = 'count',
column_variable = 'variable')
#> Joining with `by = join_by(site, variable, total_var, count, mean_val,
#> median_val, sd_val, mad_val, cov_val, max_val, min_val, range_val, total_ct,
#> analysis_eligible)`
anomaly_output2
#> # A tibble: 12 × 17
#> site variable total_var count mean_val median_val sd_val mad_val cov_val
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Site A dx 1000 100 81.3 106. 64.8 48.9 0.797
#> 2 Site A dx 1000 140 81.3 106. 64.8 48.9 0.797
#> 3 Site A dx 200 0 81.3 106. 64.8 48.9 0.797
#> 4 Site A drug 1000 0 18 13.5 18.5 20.0 1.03
#> 5 Site A drug 200 39 18 13.5 18.5 20.0 1.03
#> 6 Site A drug 200 42 18 13.5 18.5 20.0 1.03
#> 7 Site B dx 1500 137 81.3 106. 64.8 48.9 0.797
#> 8 Site B dx 1500 111 81.3 106. 64.8 48.9 0.797
#> 9 Site B dx 100 0 81.3 106. 64.8 48.9 0.797
#> 10 Site B drug 1500 0 18 13.5 18.5 20.0 1.03
#> 11 Site B drug 100 12 18 13.5 18.5 20.0 1.03
#> 12 Site B drug 100 15 18 13.5 18.5 20.0 1.03
#> # ℹ 8 more variables: max_val <dbl>, min_val <dbl>, range_val <dbl>,
#> # total_ct <int>, analysis_eligible <chr>, lower_tail <dbl>,
#> # upper_tail <dbl>, anomaly_yn <chr>