|
| 1 | +--- |
| 2 | +title: "Customized Distributions" |
| 3 | +output: rmarkdown::html_vignette |
| 4 | +vignette: > |
| 5 | + %\VignetteIndexEntry{Customized Distributions} |
| 6 | + %\VignetteEngine{knitr::rmarkdown} |
| 7 | + \usepackage[utf8]{inputenc} |
| 8 | +--- |
| 9 | + |
| 10 | +```{r chunkname, echo=-1} |
| 11 | +data.table::setDTthreads(2) |
| 12 | +``` |
| 13 | + |
| 14 | +```{r, echo = FALSE, message = FALSE} |
| 15 | +library(simstudy) |
| 16 | +library(ggplot2) |
| 17 | +library(scales) |
| 18 | +library(grid) |
| 19 | +library(gridExtra) |
| 20 | +library(survival) |
| 21 | +library(gee) |
| 22 | +library(data.table) |
| 23 | +library(ordinal) |
| 24 | +
|
| 25 | +odds <- function (p) p/(1 - p) # TODO temporary remove when added to package |
| 26 | +plotcolors <- c("#B84226", "#1B8445", "#1C5974") |
| 27 | +
|
| 28 | +cbbPalette <- c("#B84226","#B88F26", "#A5B435", "#1B8446", |
| 29 | + "#B87326","#B8A526", "#6CA723", "#1C5974") |
| 30 | +
|
| 31 | +ggtheme <- function(panelback = "white") { |
| 32 | + |
| 33 | + ggplot2::theme( |
| 34 | + panel.background = element_rect(fill = panelback), |
| 35 | + panel.grid = element_blank(), |
| 36 | + axis.ticks = element_line(colour = "black"), |
| 37 | + panel.spacing =unit(0.25, "lines"), # requires package grid |
| 38 | + panel.border = element_rect(fill = NA, colour="gray90"), |
| 39 | + plot.title = element_text(size = 8,vjust=.5,hjust=0), |
| 40 | + axis.text = element_text(size=8), |
| 41 | + axis.title = element_text(size = 8) |
| 42 | + ) |
| 43 | + |
| 44 | +} |
| 45 | +
|
| 46 | +``` |
| 47 | + |
| 48 | +Custom distributions can be specified in `defData` and `defDataAdd` by setting the argument *dist* to "custom". When defining a custom distribution, you provide the name of the user-defined function as a string in the *formula* argument. The arguments of the custom function are listed in the *variance* argument, separated by commas and formatted as "**arg_1 = val_form_1, arg_2 = val_form_2, $\dots$, arg_K = val_form_K**". |
| 49 | + |
| 50 | +Here, the *arg_k's* represent the names of the arguments passed to the customized function, where $k$ ranges from $1$ to $K$. You can use values or formulas for each *val_form_k*. If formulas are used, ensure that the variables have been previously generated. Double dot notation is available in specifying *value_formula_k*. One important requirement of the custom function is that the parameter list used to define the function must include an argument"**n = n**", but do not include $n$ in the definition as part of `defData` or `defDataAdd`. |
| 51 | + |
| 52 | +### Example 1 |
| 53 | + |
| 54 | +Here is an example where we would like to generate data from a zero-inflated beta distribution. In this case, there is a user-defined function `zeroBeta` that takes on shape parameters $a$ and $b$, as well as $p_0$, the proportion of the sample that is zero. Note that the function also takes an argument $n$ that will not to be be specified in the data definition; $n$ will represent the number of observations being generated: |
| 55 | + |
| 56 | +```{r} |
| 57 | +zeroBeta <- function(n, a, b, p0) { |
| 58 | + betas <- rbeta(n, a, b) |
| 59 | + is.zero <- rbinom(n, 1, p0) |
| 60 | + betas*!(is.zero) |
| 61 | +} |
| 62 | +``` |
| 63 | + |
| 64 | +The data definition specifies a new variable $zb$ that sets $a$ and $b$ to 0.75, and $p_0 = 0.02$: |
| 65 | + |
| 66 | +```{r} |
| 67 | +def <- defData( |
| 68 | + varname = "zb", |
| 69 | + formula = "zeroBeta", |
| 70 | + variance = "a = 0.75, b = 0.75, p0 = 0.02", |
| 71 | + dist = "custom" |
| 72 | +) |
| 73 | +``` |
| 74 | + |
| 75 | +The data are generated: |
| 76 | + |
| 77 | +```{r} |
| 78 | +set.seed(1234) |
| 79 | +dd <- genData(100000, def) |
| 80 | +``` |
| 81 | + |
| 82 | +```{r, echo = FALSE} |
| 83 | +dd |
| 84 | +``` |
| 85 | + |
| 86 | +A plot of the data reveals dis-proportion of zero's: |
| 87 | + |
| 88 | +```{r, fig.width = 6, fig.height = 3, echo = FALSE} |
| 89 | +ggplot(data = dd, aes(x = zb)) + |
| 90 | + geom_histogram(binwidth = 0.01, boundary = 0, fill = "grey60") + |
| 91 | + theme(panel.grid = element_blank()) |
| 92 | +``` |
| 93 | + |
| 94 | +### Example 2 |
| 95 | + |
| 96 | +In this second example, we are generating sets of truncated Gaussian distributions with means ranging from $-1$ to $1$. The limits of the truncation vary across three different groups. `rnormt` is a customized (user-defined) function that generates the truncated Gaussiandata. The function requires four arguments (the left truncation value, the right truncation value, the distribution average and the standard deviation). |
| 97 | + |
| 98 | +```{r} |
| 99 | +rnormt <- function(n, min, max, mu, s) { |
| 100 | + |
| 101 | + F.a <- pnorm(min, mean = mu, sd = s) |
| 102 | + F.b <- pnorm(max, mean = mu, sd = s) |
| 103 | + |
| 104 | + u <- runif(n, min = F.a, max = F.b) |
| 105 | + qnorm(u, mean = mu, sd = s) |
| 106 | + |
| 107 | +} |
| 108 | +``` |
| 109 | + |
| 110 | + |
| 111 | +In this example, truncation limits vary based on group membership. Initially, three groups are created, followed by the generation of truncated values. For Group 1, truncation occurs within the range of $-1$ to $1$, for Group 2, it's $-2$ to $2$ and for Group 3, it's $-3$ to $3$. We'll generate three data sets, each with a distinct mean denoted by M, using the double-dot notation to implement these different means. |
| 112 | + |
| 113 | +```{r} |
| 114 | +def <- |
| 115 | + defData( |
| 116 | + varname = "limit", |
| 117 | + formula = "1/4;1/2;1/4", |
| 118 | + dist = "categorical" |
| 119 | + ) |> |
| 120 | + defData( |
| 121 | + varname = "tn", |
| 122 | + formula = "rnormt", |
| 123 | + variance = "min = -limit, max = limit, mu = ..M, s = 1.5", |
| 124 | + dist = "custom" |
| 125 | + ) |
| 126 | +``` |
| 127 | + |
| 128 | +The data generation requires three calls to `genData`. The output is a list of three data sets: |
| 129 | + |
| 130 | +```{r} |
| 131 | +mus <- c(-1, 0, 1) |
| 132 | +dd <-lapply(mus, function(M) genData(100000, def)) |
| 133 | +``` |
| 134 | + |
| 135 | +Here are the first six observations from each of the three data sets: |
| 136 | + |
| 137 | +```{r, echo=FALSE} |
| 138 | +lapply(dd, function(D) head(D)) |
| 139 | +``` |
| 140 | + |
| 141 | +A plot highlights the group differences. |
| 142 | + |
| 143 | +```{r, fig.width = 8, fig.height = 6, echo = FALSE} |
| 144 | +pfunc <- function(dx, i) { |
| 145 | + ggplot(data = dx, aes(x = tn)) + |
| 146 | + geom_histogram(aes(fill = factor(limit)), binwidth = 0.05, boundary = 0, alpha = .8) + |
| 147 | + facet_grid( ~ limit) + |
| 148 | + theme(panel.grid = element_blank(), |
| 149 | + legend.position = "none") + |
| 150 | + scale_fill_manual(values = plotcolors) + |
| 151 | + scale_x_continuous(breaks = seq(-3, 3, by =1)) + |
| 152 | + scale_y_continuous(limits = c(0, 1000)) + |
| 153 | + ggtitle(paste("mu =", mus[i])) |
| 154 | +} |
| 155 | +
|
| 156 | +plist <- lapply(seq_along(dd), function(a) pfunc(dd[[a]], a)) |
| 157 | +grid.arrange(grobs = plist, nrow = 3) |
| 158 | +``` |
| 159 | + |
0 commit comments