You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -438,9 +441,9 @@ scancel <jobid> # cancel a specific job
438
441
scancel -u <username> # cancel all queued and running jobs for a specific user
439
442
```
440
443
441
-
## My HPC cluster excution time/RAM is limited and terminates before the simulation is complete
444
+
## My HPC cluster excution time is limited and terminates before the simulation is complete
442
445
443
-
This issue is important whenever the HPC cluster has mandatory time/RAM limits for the job submissions, where the array job may not complete within the assigned resources --- hence, if not properly managed, will discard any valid replication information when abruptly terminated. Unfortunately, this is a very likely occurrence, and is largely a function of being unsure about how long each simulation condition/replication will take to complete when distributed across the arrays (some conditions/replications will take longer than others, and it is difficult to be perfectly knowledgeable about this information beforehand) or how large the final objects will grow as the simulation progresses.
446
+
This issue is important whenever the HPC cluster has mandatory time/RAM limits for the job submissions, where the array job may not complete within the assigned resources --- hence, if not properly managed, will discard any valid replication information when abruptly terminated. Unfortunately, this is a very likely occurrence, and is largely a function of being unsure about how long each simulation condition/replication will take to complete when distributed across the arrays (some conditions/replications will take longer than others, and it is difficult to be perfectly knowledgeable about this information beforehand).
444
447
445
448
To avoid this time/resource waste it is **strongly recommended** to add a `max_time` argument to the `control` list (see `help(runArraySimulation)` for supported specifications) which is less than the Slurm specifications. This control flag will halt the `runArraySimulation()` executions early and return only the complete simulation results up to this point. However, this will only work if the argument is *non-trivially less than the allocated Slurm resources*; otherwise, you'll run the risk that the job terminates before the `SimDesign` functions have the chance to store the successfully completed replications. Setting this to around 90-95% of the respective `#SBATCH --time=` input should, however, be sufficient in most cases.
446
449
@@ -491,6 +494,8 @@ Missed
491
494
2 30 2000 10000
492
495
```
493
496
497
+
### Create new conditions for missing replications, and use `rbindDesign()`
498
+
494
499
Next, build a new simulation structure containing only the missing information components.
Notice that the `Design.ID` terms below are associated with the problematic conditions in the original `Design` object.
513
+
507
514
```{r}
508
515
print(subDesign, show.IDs = TRUE)
509
516
replications_missed
@@ -513,7 +520,7 @@ At this point, you can return to the above logic of organizing the simulation sc
513
520
original `Design` object in its construction so that the internal
514
521
`Design.ID` attributes are properly tracked.
515
522
516
-
Finally, we now glue on the new `subDesign` information to the original expanded version using `rbindDesign()` with `keep.IDs = TRUE`, though telling the scheduler to only evaluate these new rows in the `#SBATCH --array` specification (this is technically unnecessary, but is conceptually clear and keeps all simulation files and array IDs consistent).
523
+
Finally, the new `subDesign` information is row-bound to the original expanded version using `rbindDesign()` with `keep.IDs = TRUE` (the default), though telling the scheduler to only evaluate these new rows in the `#SBATCH --array` specification.
517
524
518
525
```{r}
519
526
rc <- 50
@@ -535,9 +542,11 @@ iseed <- 1276149341
535
542
```
536
543
Again, this approach simply expands the original simulation with 300 array jobs to
537
544
one with 400 array jobs as though the added structure was an intended part of the
538
-
initial design (which is obviously wasn't, but is organized as such).
545
+
initial design (which is obviously wasn't, but is organized as such). This also ensure that the random number generation is properly accounted for as the new conditions to evaluate will be uncorrelated with the previous array evaluation jobs.
539
546
540
-
Finally, in the `.slurm` submission file you no longer want to evaluate the first 1-300 cases,
547
+
### Submit the new job, evaluating only the new conditions
548
+
549
+
Finally, in your new `.slurm` submission file you no longer want to evaluate the first 1-300 cases,
541
550
as these `.rds` files have already been evaluated, and instead want to change the `--array` line from
542
551
543
552
```
@@ -547,7 +556,9 @@ to
547
556
```
548
557
#SBATCH --array=301-400
549
558
```
550
-
Submit this job to compute all the missing replication information, which stores these files into the same working directory but with the new information stored as `mysim-301.rds` through `mysim-400.rds`. In this example, there will now be a total of 400 files that have been saved. Once complete, run
559
+
Submit this job to compute all the missing replication information, which stores these files into the same working directory but with the new information stored as `mysim-301.rds` through `mysim-400.rds`. In this example, there will now be a total of 400 files that have been saved.
Copy file name to clipboardExpand all lines: vignettes/SimDesign-intro.Rmd
+19-16Lines changed: 19 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -31,14 +31,15 @@ par(mar=c(3,3,1,1)+.1)
31
31
32
32
Whether you are interested in evaluating the performance of a new optimizer or estimation criteria,
33
33
re-evaluating previous research claims (e.g., ANOVA is 'robust' to violations of normality),
34
-
determine power rates for an upcoming research proposal,
35
-
or simply to appease a strange thought in your head about a new statistical idea you heard about,
34
+
want to determine power rates for an upcoming research proposal (cf. the `Spower` package),
35
+
or simply wish to appease a strange thought in your head about a new statistical idea you heard about,
36
36
designing Monte Carlo simulations can be incredibly rewarding and are extremely important to those who are statistically oriented.
37
+
37
38
However, organizing simulations can be a challenge, particularly to those new to the topic,
38
-
where all too often coders resort to the inefficient and error prone strategies (e.g., the dreaded
39
+
where all too often investigators resort to the inefficient and error prone strategies (e.g., the dreaded
39
40
"for-loop" strategy, *for*-ever resulting in confusing, error prone, and simulation specific code).
40
41
The package `SimDesign` is one attempt to fix these and other issues that often arise when designing Monte Carlo simulation experiments, while also providing a templated setup that is designed to support many
41
-
useful features that can be useful when evaluating simulation research.
42
+
useful features that can be useful when evaluating simulation research for novice and advanced users.
42
43
43
44
Generally speaking, Monte Carlo simulations can be broken into three major components:
44
45
@@ -61,8 +62,8 @@ options(digits = 2)
61
62
```
62
63
63
64
After loading the `SimDesign` package, we begin by defining the required user-constructed functions. To expedite this process,
64
-
a call to `SimFunctions()`will create a template to be filled in, where all the necessary functional arguments have been pre-assigned, and only the body of the functions need to be modified. The documentation of each argument can be found in the respective
65
-
R help files, however there organization is very simple conceptually.
65
+
a call to `SimFunctions()`can be used to create a suitable template, where all the necessary functional arguments have been pre-assigned and only the body of the functions need to be modified. The documentation of each argument can be found in the respective
66
+
R help files, however the organization is conceptually simple.
66
67
67
68
To begin, the following code should be copied and saved to an external source (i.e., text) file.
68
69
@@ -73,7 +74,7 @@ SimFunctions()
73
74
74
75
Alternatively, if you are lazy (read: efficient) or just don't like copy-and-pasting, `SimFunctions()` can write the output to a file
75
76
by providing a `filename` argument. The following creates a file (`mysim.R`) containing the simulation
76
-
design/execution and required user-defined functions.
77
+
design/execution and required user-defined functions. For Rstudio users, this will also automatically open up the file in a new coding window.
77
78
78
79
```{r eval=FALSE}
79
80
SimDesign::SimFunctions('mysim')
@@ -92,7 +93,7 @@ portions of the code (e.g., one analyse function for fitting and extracting comp
92
93
93
94
# Simulation: Determine estimator efficiency
94
95
95
-
As a toy example, let's consider how the following question can be investigated with`SimDesign`:
96
+
As a toy example, let's consider how the following investigation using`SimDesign`:
96
97
97
98
*Question*: How does trimming affect recovering the mean of a distribution? Investigate this using
98
99
different sample sizes with Gaussian and $\chi^2$ distributions. Also, demonstrate the effect of using the
@@ -102,19 +103,19 @@ median to recover the mean.
102
103
103
104
First, define the condition combinations that should be investigated. In this case we wish to study
104
105
4 different sample sizes, and use a symmetric and skewed distribution. The use of `createDesign()` is
105
-
extremely helpful here to create a completely crossed-design for each combination (there are 8 in total).
106
+
required to create a completely crossed-design for each combination (there are 8 in total).
Each row in `Design` represents a unique condition to be studied in the simulation. In this case, the first condition to be studied comes from row 1, where $N=30$ and the distribution should be normal.
114
+
Each row in `Design` represents a unique condition to be studied in the simulation. In this case, the first condition to be studied comes from row 1, where $N=30$ and the distribution is from the Gaussian/normal family.
114
115
115
116
## Define the functions
116
117
117
-
We first start by defining the data generation functional component. The only argument accepted by this function is `condition`, which will always be a *single row from the Design data.frame object* of class `data.frame`. Conditions are run sequentially from row 1 to the last row in `Design`. It is also
118
+
We first start by defining the data generation functional component. The only argument accepted by this function is `condition`, which will always be a *single row from the Design data.frame object*. Conditions are run sequentially from row 1 to the last row in `Design`. It is also
118
119
possible to pass a `fixed_objects` object to the function for including fixed sets of population parameters and other conditions, however for this simple simulation this input is not required.
This function accepts the data previously returned from `Generate()` (`dat`), the condition vector previously
150
-
mentioned.
151
+
mentioned, and returns 4 named elements. Note that the element names do not have to be constant across the row-conditions, however it will often make conceptual sense to do so.
151
152
152
153
At this point, we may conceptually think of the first two functions as being evaluated independently $R$ times to obtain
153
154
$R$ sets of results. In other words, if we wanted the number of replications to be 100, the first two functions
@@ -181,11 +182,13 @@ a rectangular form, such as in a `matrix`, `data.frame`, or `tibble`. Well, you'
181
182
## Putting it all together
182
183
183
184
The last stage of the `SimDesign` work-flow is to pass the four defined elements to the `runSimulation()`
184
-
function which, unsurprisingly given it's name, runs the simulation. There are numerous options available in the
185
+
function which, unsurprisingly given its name, runs the simulation.
186
+
187
+
There are numerous options available in the
185
188
function, and these should be investigated by reading the `help(runSimulation)` HTML file. Options for
186
189
performing simulations in parallel, storing/resuming temporary results, debugging functions,
187
190
and so on are available. Below we simply request that each condition be run 1000 times on a
188
-
single processor, and finally store the results to an object called `results`.
191
+
single processor, and finally store the results to an object called `res`.
189
192
190
193
```{r include=FALSE}
191
194
set.seed(1234)
@@ -282,7 +285,7 @@ to obtain average estimates, their associated sampling error, their efficiency,
282
285
Summarise(condition, results)
283
286
```
284
287
285
-
This process is then repeated for each row in the `Design` object until the entire simulation study
288
+
This process is then repeated for each row `condition`in the `Design` object until the entire simulation study
286
289
is complete.
287
290
288
291
Of course, `runSimulation()` does much more than this conceptual outline, which is why it exists. Namely, errors and warnings are controlled and tracked, data is re-drawn when needed, parallel processing is supported, debugging is easier with the `debug` input (or by inserting `browser()` directly), temporary and full results can be saved to external files, the simulation state can be saved/restored, build-in safety features are included, and more. The point, however, is that you as the user *should not be bogged down with the nitty-gritty details of setting up the simulation work-flow/features*; instead, you should be focusing your time on the important generate-analyse-summarise steps, organized in the body of the above functions, that are required to obtain your interesting simulation results. After all, the point designing a computer simulation experiment is to understand the resulting output, not to become a master of all aspects of your select computing language pertaining to object storage, parallel processing, RAM storage, defensive coding, progress reporting, reproducibility, post-processing, ..., ad nauseam.
0 commit comments