For demonstration purposes, we generate simulated data. Code and reference documents used to generate the data can be found in here
For this application, I created 3 data frames. To populate this application for your purpose, similar data frames are required.
-
'VAR_info'
This is a data frame that contains all variable information used in this application. Similar to a data dictionary, this data frame includes variable name, variable label, category (metadata or participant-level data), variable types, value range, the minimum and maximum value for numeric variables.
-
‘ppt_all_fc’
‘ppt_all_fc’ is a file containing all participant-level data across studies, where all categorical variables were were converted to factor variables.
The example syntax I used is:
… %>% mutate(across(c(SMOKESTAT,ALCSTAT), ~factor(.,levels = c(1,2,3), labels = c('Never', 'Ex', 'Current'))))Also, note that NA values of factors were converted to NA level to control the display using the following syntax:
… %>% mutate(across(c(ETHNICBACK, SEX, EDUHIGHS, MARISTAT, DECEASED), ~ fct_na_value_to_level(.x, level = 'Missing'))) -
‘studymeta’
Similarly, ‘studymeta’ is a data frame that contains all study-level data, where all categorical variables converted to factor variables.
- First, use
stringi::stri_rand_lipsumto generate random lorem ipsum text, separate strings by punctuation, and get a list of study full names. (n=30) - Continents, countries, and country income level data source: worldbank
- Take random samples from the list of countries, grouped by continent.
- Use
simstudy::genDatato generate data based on self-defined distribution.
- Assuming studies from the same continent will follow a similar distribution.
- Define variable distribution and missing patterns for one continent first
- For each continent, update the distribution and missing patterns both manually and with some randomness
- For each continent, generate a data pool using the updated distribution
- Define a function to take a random sample from the above data pool for each study, and data validation to ensure the consistency between metadata and participant-level data for each study
- Combine data for each study into one
-
Set the factor levels properly
-
For both metadata and participants level data, create two copies - one with factor, another remains numerical for different visualisation purposes.
-
Create a data frame with all variable information
-
With the following syntax, all required data will be saved in the proper folder for the package use.
usethis::use_data(VAR_info, overwrite = TRUE) # Variable Information usethis::use_data(studymeta, overwrite = TRUE) # study meta data usethis::use_data(ppt_all_fc, overwrite = TRUE) # participant-level data, converted to factors
The simstudy package was used to generate the simulated data.