Skip to content

Commit fa230fc

Browse files
committed
finer edits
1 parent ce8066e commit fa230fc

File tree

2 files changed

+36
-22
lines changed

2 files changed

+36
-22
lines changed

vignettes/HPC-computing.Rmd

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,9 @@ Given the above specifications, you may decide that each of the 300 computing no
157157
rc <- 100 # number of times the design row was repeated
158158
Design300 <- expandDesign(Design, repeat_conditions = rc)
159159
Design300
160+
161+
# compare the Design.IDs
162+
print(Design, show.IDs = TRUE)
160163
print(Design300, show.IDs = TRUE)
161164
162165
# target replication number for each condition
@@ -438,9 +441,9 @@ scancel <jobid> # cancel a specific job
438441
scancel -u <username> # cancel all queued and running jobs for a specific user
439442
```
440443

441-
## My HPC cluster excution time/RAM is limited and terminates before the simulation is complete
444+
## My HPC cluster excution time is limited and terminates before the simulation is complete
442445

443-
This issue is important whenever the HPC cluster has mandatory time/RAM limits for the job submissions, where the array job may not complete within the assigned resources --- hence, if not properly managed, will discard any valid replication information when abruptly terminated. Unfortunately, this is a very likely occurrence, and is largely a function of being unsure about how long each simulation condition/replication will take to complete when distributed across the arrays (some conditions/replications will take longer than others, and it is difficult to be perfectly knowledgeable about this information beforehand) or how large the final objects will grow as the simulation progresses.
446+
This issue is important whenever the HPC cluster has mandatory time/RAM limits for the job submissions, where the array job may not complete within the assigned resources --- hence, if not properly managed, will discard any valid replication information when abruptly terminated. Unfortunately, this is a very likely occurrence, and is largely a function of being unsure about how long each simulation condition/replication will take to complete when distributed across the arrays (some conditions/replications will take longer than others, and it is difficult to be perfectly knowledgeable about this information beforehand).
444447

445448
To avoid this time/resource waste it is **strongly recommended** to add a `max_time` argument to the `control` list (see `help(runArraySimulation)` for supported specifications) which is less than the Slurm specifications. This control flag will halt the `runArraySimulation()` executions early and return only the complete simulation results up to this point. However, this will only work if the argument is *non-trivially less than the allocated Slurm resources*; otherwise, you'll run the risk that the job terminates before the `SimDesign` functions have the chance to store the successfully completed replications. Setting this to around 90-95% of the respective `#SBATCH --time=` input should, however, be sufficient in most cases.
446449

@@ -491,6 +494,8 @@ Missed
491494
2 30 2000 10000
492495
```
493496

497+
### Create new conditions for missing replications, and use `rbindDesign()`
498+
494499
Next, build a new simulation structure containing only the missing information components.
495500

496501
```{r include=FALSE}
@@ -504,6 +509,8 @@ subDesign <- Design[c(1,3),]
504509
replications_missed <- subset(Missed, select=MISSED_REPLICATIONS)
505510
```
506511

512+
Notice that the `Design.ID` terms below are associated with the problematic conditions in the original `Design` object.
513+
507514
```{r}
508515
print(subDesign, show.IDs = TRUE)
509516
replications_missed
@@ -513,7 +520,7 @@ At this point, you can return to the above logic of organizing the simulation sc
513520
original `Design` object in its construction so that the internal
514521
`Design.ID` attributes are properly tracked.
515522

516-
Finally, we now glue on the new `subDesign` information to the original expanded version using `rbindDesign()` with `keep.IDs = TRUE`, though telling the scheduler to only evaluate these new rows in the `#SBATCH --array` specification (this is technically unnecessary, but is conceptually clear and keeps all simulation files and array IDs consistent).
523+
Finally, the new `subDesign` information is row-bound to the original expanded version using `rbindDesign()` with `keep.IDs = TRUE` (the default), though telling the scheduler to only evaluate these new rows in the `#SBATCH --array` specification.
517524

518525
```{r}
519526
rc <- 50
@@ -535,9 +542,11 @@ iseed <- 1276149341
535542
```
536543
Again, this approach simply expands the original simulation with 300 array jobs to
537544
one with 400 array jobs as though the added structure was an intended part of the
538-
initial design (which is obviously wasn't, but is organized as such).
545+
initial design (which is obviously wasn't, but is organized as such). This also ensure that the random number generation is properly accounted for as the new conditions to evaluate will be uncorrelated with the previous array evaluation jobs.
539546

540-
Finally, in the `.slurm` submission file you no longer want to evaluate the first 1-300 cases,
547+
### Submit the new job, evaluating only the new conditions
548+
549+
Finally, in your new `.slurm` submission file you no longer want to evaluate the first 1-300 cases,
541550
as these `.rds` files have already been evaluated, and instead want to change the `--array` line from
542551

543552
```
@@ -547,7 +556,9 @@ to
547556
```
548557
#SBATCH --array=301-400
549558
```
550-
Submit this job to compute all the missing replication information, which stores these files into the same working directory but with the new information stored as `mysim-301.rds` through `mysim-400.rds`. In this example, there will now be a total of 400 files that have been saved. Once complete, run
559+
Submit this job to compute all the missing replication information, which stores these files into the same working directory but with the new information stored as `mysim-301.rds` through `mysim-400.rds`. In this example, there will now be a total of 400 files that have been saved.
560+
561+
Once complete, run
551562
```{r eval=FALSE}
552563
# See if any missing still
553564
SimCollect('mysimfiles', check.only=TRUE)

vignettes/SimDesign-intro.Rmd

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,15 @@ par(mar=c(3,3,1,1)+.1)
3131
3232
Whether you are interested in evaluating the performance of a new optimizer or estimation criteria,
3333
re-evaluating previous research claims (e.g., ANOVA is 'robust' to violations of normality),
34-
determine power rates for an upcoming research proposal,
35-
or simply to appease a strange thought in your head about a new statistical idea you heard about,
34+
want to determine power rates for an upcoming research proposal (cf. the `Spower` package),
35+
or simply wish to appease a strange thought in your head about a new statistical idea you heard about,
3636
designing Monte Carlo simulations can be incredibly rewarding and are extremely important to those who are statistically oriented.
37+
3738
However, organizing simulations can be a challenge, particularly to those new to the topic,
38-
where all too often coders resort to the inefficient and error prone strategies (e.g., the dreaded
39+
where all too often investigators resort to the inefficient and error prone strategies (e.g., the dreaded
3940
"for-loop" strategy, *for*-ever resulting in confusing, error prone, and simulation specific code).
4041
The package `SimDesign` is one attempt to fix these and other issues that often arise when designing Monte Carlo simulation experiments, while also providing a templated setup that is designed to support many
41-
useful features that can be useful when evaluating simulation research.
42+
useful features that can be useful when evaluating simulation research for novice and advanced users.
4243

4344
Generally speaking, Monte Carlo simulations can be broken into three major components:
4445

@@ -61,8 +62,8 @@ options(digits = 2)
6162
```
6263

6364
After loading the `SimDesign` package, we begin by defining the required user-constructed functions. To expedite this process,
64-
a call to `SimFunctions()` will create a template to be filled in, where all the necessary functional arguments have been pre-assigned, and only the body of the functions need to be modified. The documentation of each argument can be found in the respective
65-
R help files, however there organization is very simple conceptually.
65+
a call to `SimFunctions()` can be used to create a suitable template, where all the necessary functional arguments have been pre-assigned and only the body of the functions need to be modified. The documentation of each argument can be found in the respective
66+
R help files, however the organization is conceptually simple.
6667

6768
To begin, the following code should be copied and saved to an external source (i.e., text) file.
6869

@@ -73,7 +74,7 @@ SimFunctions()
7374

7475
Alternatively, if you are lazy (read: efficient) or just don't like copy-and-pasting, `SimFunctions()` can write the output to a file
7576
by providing a `filename` argument. The following creates a file (`mysim.R`) containing the simulation
76-
design/execution and required user-defined functions.
77+
design/execution and required user-defined functions. For Rstudio users, this will also automatically open up the file in a new coding window.
7778

7879
```{r eval=FALSE}
7980
SimDesign::SimFunctions('mysim')
@@ -92,7 +93,7 @@ portions of the code (e.g., one analyse function for fitting and extracting comp
9293

9394
# Simulation: Determine estimator efficiency
9495

95-
As a toy example, let's consider how the following question can be investigated with `SimDesign`:
96+
As a toy example, let's consider how the following investigation using `SimDesign`:
9697

9798
*Question*: How does trimming affect recovering the mean of a distribution? Investigate this using
9899
different sample sizes with Gaussian and $\chi^2$ distributions. Also, demonstrate the effect of using the
@@ -102,19 +103,19 @@ median to recover the mean.
102103

103104
First, define the condition combinations that should be investigated. In this case we wish to study
104105
4 different sample sizes, and use a symmetric and skewed distribution. The use of `createDesign()` is
105-
extremely helpful here to create a completely crossed-design for each combination (there are 8 in total).
106+
required to create a completely crossed-design for each combination (there are 8 in total).
106107

107108
```{r}
108109
Design <- createDesign(sample_size = c(30, 60, 120, 240),
109110
distribution = c('norm', 'chi'))
110111
Design
111112
```
112113

113-
Each row in `Design` represents a unique condition to be studied in the simulation. In this case, the first condition to be studied comes from row 1, where $N=30$ and the distribution should be normal.
114+
Each row in `Design` represents a unique condition to be studied in the simulation. In this case, the first condition to be studied comes from row 1, where $N=30$ and the distribution is from the Gaussian/normal family.
114115

115116
## Define the functions
116117

117-
We first start by defining the data generation functional component. The only argument accepted by this function is `condition`, which will always be a *single row from the Design data.frame object* of class `data.frame`. Conditions are run sequentially from row 1 to the last row in `Design`. It is also
118+
We first start by defining the data generation functional component. The only argument accepted by this function is `condition`, which will always be a *single row from the Design data.frame object*. Conditions are run sequentially from row 1 to the last row in `Design`. It is also
118119
possible to pass a `fixed_objects` object to the function for including fixed sets of population parameters and other conditions, however for this simple simulation this input is not required.
119120

120121
```{r}
@@ -131,7 +132,7 @@ Generate <- function(condition, fixed_objects) {
131132
```
132133

133134
As we can see above, `Generate()` will return a numeric vector of length $N$ containing the data to
134-
be analysed each with a population mean of 3 (because a $\chi^2$ distribution has a mean equal to its df).
135+
be analysed, each with a population mean of 3 (because a $\chi^2$ distribution has a mean equal to its df).
135136
Next, we define the `analyse` component to analyse said data:
136137

137138
```{r}
@@ -147,7 +148,7 @@ Analyse <- function(condition, dat, fixed_objects) {
147148
```
148149

149150
This function accepts the data previously returned from `Generate()` (`dat`), the condition vector previously
150-
mentioned.
151+
mentioned, and returns 4 named elements. Note that the element names do not have to be constant across the row-conditions, however it will often make conceptual sense to do so.
151152

152153
At this point, we may conceptually think of the first two functions as being evaluated independently $R$ times to obtain
153154
$R$ sets of results. In other words, if we wanted the number of replications to be 100, the first two functions
@@ -181,11 +182,13 @@ a rectangular form, such as in a `matrix`, `data.frame`, or `tibble`. Well, you'
181182
## Putting it all together
182183

183184
The last stage of the `SimDesign` work-flow is to pass the four defined elements to the `runSimulation()`
184-
function which, unsurprisingly given it's name, runs the simulation. There are numerous options available in the
185+
function which, unsurprisingly given its name, runs the simulation.
186+
187+
There are numerous options available in the
185188
function, and these should be investigated by reading the `help(runSimulation)` HTML file. Options for
186189
performing simulations in parallel, storing/resuming temporary results, debugging functions,
187190
and so on are available. Below we simply request that each condition be run 1000 times on a
188-
single processor, and finally store the results to an object called `results`.
191+
single processor, and finally store the results to an object called `res`.
189192

190193
```{r include=FALSE}
191194
set.seed(1234)
@@ -282,7 +285,7 @@ to obtain average estimates, their associated sampling error, their efficiency,
282285
Summarise(condition, results)
283286
```
284287

285-
This process is then repeated for each row in the `Design` object until the entire simulation study
288+
This process is then repeated for each row `condition` in the `Design` object until the entire simulation study
286289
is complete.
287290

288291
Of course, `runSimulation()` does much more than this conceptual outline, which is why it exists. Namely, errors and warnings are controlled and tracked, data is re-drawn when needed, parallel processing is supported, debugging is easier with the `debug` input (or by inserting `browser()` directly), temporary and full results can be saved to external files, the simulation state can be saved/restored, build-in safety features are included, and more. The point, however, is that you as the user *should not be bogged down with the nitty-gritty details of setting up the simulation work-flow/features*; instead, you should be focusing your time on the important generate-analyse-summarise steps, organized in the body of the above functions, that are required to obtain your interesting simulation results. After all, the point designing a computer simulation experiment is to understand the resulting output, not to become a master of all aspects of your select computing language pertaining to object storage, parallel processing, RAM storage, defensive coding, progress reporting, reproducibility, post-processing, ..., ad nauseam.

0 commit comments

Comments
 (0)