Source and Concept Vocabularies — scv_process • sourceconceptvocabularies

This is a concept-set testing module that will compute frequency distributions for the usage of either source-to-concept or concept-to-source concept pairs in order to highlight mapping patterns and impacts of concept standardization. The user will provide the domain definitions (domain_tbl) and a concept set with the concepts of interest (concept_set). Sample versions of these inputs are included as data in the package and are accessible with sourceconceptvocabularies::. Results can optionally be stratified by site, age group, and/or time. This function is compatible with both the OMOP and the PCORnet CDMs based on the user's selection.

Usage

scv_process(
  cohort,
  concept_set,
  omop_or_pcornet,
  domain_tbl = sourceconceptvocabularies::scv_domain_file,
  code_type,
  code_domain,
  jaccard_index = FALSE,
  multi_or_single_site = "single",
  anomaly_or_exploratory = "exploratory",
  p_value = 0.9,
  age_groups = NULL,
  time = FALSE,
  time_span = c("2012-01-01", "2020-01-01"),
  time_period = "year"
)

Arguments

cohort

tabular input || required

The cohort to be used for data quality testing. This table should contain, at minimum:

site | character | the name(s) of institutions included in your cohort
person_id / patid | integer / character | the patient identifier
start_date | date | the start of the cohort period
end_date | date | the end of the cohort period

Note that the start and end dates included in this table will be used to limit the search window for the analyses in this module.

concept_set

tabular input or vector || required

For analyses where time = FALSE, this input should be a table containing the standard CDM concepts OR source concepts of interest for the analysis. This input should contain at least one of following:

concept_id | integer | the concept_id of interest (required for OMOP)
concept_code | character | the code of interest (required for PCORnet)

For certain PCORnet applications, it should also contain

vocabulary_id | character | the vocabulary of the code, which should match what is listed in the domain table's vocabulary_field

For analyses where time = TRUE, this input should be a vector with up to 5 standard CDM OR source concepts of interest for the analysis. This limitation is applied to reduce computational strain. We recommend running a cross-sectional analysis first to identify potential concepts of interest, then using these as input for the longitudinal analysis.

omop_or_pcornet

string || required

A string, either omop or pcornet, indicating the CDM format of the data

domain_tbl

tabular input || required

A table that defines the domains where concepts should be identified. This input should contain:

domain | character | the name of the CDM table where the concepts can be identified
concept_field | character | the name of the field in the CDM table where standard cdm codes can be identified (i.e. drug_concept_id or dx)
source_concept_field | character | the name of the field in the CDM table where source codes can be identified (i.e. drug_source_concept_id or raw_dx)
date_field | character | the name of the field in the CDM table that should be used for temporal filtering
vocabulary_field | character | for PCORnet applications, the name of the field in the domain table with a vocabulary identifier to differentiate concepts from one another (ex: dx_type); can be set to NA for OMOP applications

To see an example of this input, see ?sourceconceptvocabularies::scv_domain_file

code_type

string || required

A string identifying the type of concept that has been provided in the concept_set.

Acceptable values are cdm (the standard, mapped code that is included in the CDM) or source (the "raw" concept from the source system)

code_domain

string || required

The string name of the domain where the concepts can be identified. This input should match at least one of the domains in the domain_tbl, and it will function to filter this table down to only the relevant domain and allow the user to store multiple domains in this table for reuse in other analyses.

jaccard_index

boolean || defaults to FALSE

A boolean indicating whether a Jaccard index should be computed at the visit level to determine how often two mapped concepts cooccur in the same encounter. This computation can help identify potential instances of post-coordination for SNOMED concepts.

This is only applicable for the Single Site, Anomaly Detection, Cross-Sectional check.

multi_or_single_site

string || defaults to single

A string, either single or multi, indicating whether a single-site or multi-site analysis should be executed

anomaly_or_exploratory

string | Option to conduct an exploratory or anomaly detection analysis. Exploratory analyses give a high level summary of the data to examine the fact representation within the cohort. Anomaly detection analyses are specialized to identify outliers within the cohort.

p_value

numeric || defaults to 0.9

The p value to be used as a threshold in the Multi-Site, Anomaly Detection, Cross-Sectional analysis

age_groups

tabular input || defaults to NULL

If you would like to stratify the results by age group, create a table or CSV file with the following columns and use it as input to this parameter:

min_age | integer | the minimum age for the group (i.e. 10)
max_age | integer | the maximum age for the group (i.e. 20)
group | character | a string label for the group (i.e. 10-20, Young Adult, etc.)

If you would not like to stratify by age group, leave as NULL

time

boolean || defaults to FALSE

A boolean to indicate whether to execute a longitudinal analysis

time_span

vector - length 2 || defaults to c('2012-01-01', '2020-01-01')

A vector indicating the lower and upper bounds of the time series for longitudinal analyses

time_period

string || defaults to year

A string indicating the distance between dates within the specified time_span. Defaults to year, but other time periods such as month or week are also acceptable

Value

This function will return a dataframe summarizing the mapping patterns for each concept provided by the user. For a more detailed description of output specific to each check type, see the PEDSpace metadata repository

Examples


#' Source setup file
source(system.file('setup.R', package = 'sourceconceptvocabularies'))

#' Create in-memory RSQLite database using data in extdata directory
conn <- mk_testdb_omop()

#' Establish connection to database and generate internal configurations
initialize_dq_session(session_name = 'scv_process_test',
                      working_directory = my_directory,
                      db_conn = conn,
                      is_json = FALSE,
                      file_subdirectory = my_file_folder,
                      cdm_schema = NA)
#> Connected to: :memory:@NA

#' Build mock study cohort
cohort <- cdm_tbl('person') %>% dplyr::distinct(person_id) %>%
  dplyr::mutate(start_date = as.Date(-5000),
                #RSQLite does not store date objects,
                #hence the numerics
                end_date = as.Date(15000),
                site = ifelse(person_id %in% c(1:6), 'synth1', 'synth2'))

#' Prepare input tables
scv_domain_tbl <- dplyr::tibble(domain = 'condition_occurrence',
                                concept_field = 'condition_concept_id',
                                source_concept_field =
                                  'condition_source_concept_id',
                                date_field = 'condition_start_date',
                                vocabulary_field = NA)

scv_concept_set <- read_codeset('dx_hypertension')

#' Execute `scv_process` function
#' This example will use the single site, exploratory, cross sectional
#' configuration
scv_process_example <- scv_process(cohort = cohort,
                                   multi_or_single_site = 'single',
                                   anomaly_or_exploratory = 'exploratory',
                                   time = FALSE,
                                   omop_or_pcornet = 'omop',
                                   code_type = 'cdm',
                                   code_domain = 'condition_occurrence',
                                   domain_tbl = scv_domain_tbl,
                                   concept_set = scv_concept_set) %>%
  suppressMessages()
#> ┌ Output Function Details ──────────────────────────────────────┐
#> │ You can optionally use this dataframe in the accompanying     │
#> │ `scv_output` function. Here are the parameters you will need: │
#> │                                                               │
#> │ Always Required: process_output, code_type                    │
#> │ Required for Check: num_codes, num_mappings                   │
#> │ Optional: vocab_tbl                                           │
#> │                                                               │
#> │ See ?scv_output for more details.                             │
#> └───────────────────────────────────────────────────────────────┘

scv_process_example
#> # A tibble: 1 × 10
#>   site     domain            concept_id source_concept_id    ct denom_concept_ct
#>   <chr>    <chr>                  <int>             <int> <int>            <int>
#> 1 combined condition_occurr…     320128            320128     5                5
#> # ℹ 4 more variables: denom_source_ct <int>, concept_prop <dbl>,
#> #   source_prop <dbl>, output_function <chr>

#' Execute `scv_output` function
scv_output_example <- scv_output(process_output = scv_process_example,
                                 code_type = 'cdm',
                                 vocab_tbl = NULL) %>%
  suppressMessages()

scv_output_example[[1]]


#' Easily convert the graph into an interactive ggiraph or plotly object with
#' `make_interactive_squba()`

make_interactive_squba(scv_output_example[[1]])