r-pkgs/r.rmd at master · Smudgerville/r-pkgs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
layout: default
title: R code
output: bookdown::html_chapter
---

# R code

The most important part of a package is the `R/` directory --- it contains all of your R code! Even if you do nothing else, putting your R files in this directory is valuable because it gives you access to some useful tools.

In this chapter you'll learn:

* How to organise the R code in your package.
* How to make a fluid package development workflow.
* What happens when you install a package.
* The difference between a library and package.

## Getting started {#getting-started}

The easiest way to get started with a package is to run `devtools::create("path/to/package/pkgname")`. This makes the package directory, `path/to/package/pkgname/`, and adds four items to make the smallest usable package:

1. An RStudio project file, `pkgname.Rproj`.
1. An `R/` directory.
1. A basic `DESCRIPTION` file.
1. A basic `NAMESPACE` file.

In this chapter, you'll learn about the `R/` directory and the RStudio project file. Ignore the other files for now: you'll learn about `DESCRIPTION` in [package metadata](#description) and `NAMESPACE` in [namespaces](#namespace).

__Never__ use `package.skeleton()` to create a package. It's designed for an older era of package development, and mostly serves to make your life harder, not easier. Currently I don't recommend using RStudio's "create a new package" tool because it uses `package.skeleton()`. That will be fixed by the time the book is published.

The first principle of using a package is that all R code goes in `R/`. If you have existing code for your new package, now's a good time to copy it into `R/`.

## RStudio projects {#projects}

To get started with your new package in RStudio, double-click the `pkgname.Rproj` file that `create()` just made. This will open a new RStudio project for your package. Projects are a great way to develop packages because:

*   Each project is isolated; unrelated things are kept unrelated.

*   You get handy code navigation tools like `F2` to jump to a function
    definition and `Ctrl + .` to look up functions by name.

*   You get useful keyboard shortcuts for common package development tasks.
    You'll learn about them throughout the book. But to see them all, press
    Alt + Shift + K or use the Help | Keyboard shortcuts menu.

    ```{r, echo = FALSE}
    bookdown::embed_png("screenshots/keyboard-shortcuts.png")
    ```

(If you want to learn more RStudio tips and tricks, follow @[rstudiotips](https://twitter.com/rstudiotips) on twitter.)

`create()` makes an `.Rproj` file for you. If you have an existing package that doesn't have a `.Rproj` file, you can use `devtools::use_rstudio("path/to/package")` to add it. If you don't use RStudio, you can get many of the benefits by starting a new R session and ensuring the working directory is set to the package directory.

`.Rproj` files are just text files. The project file created by devtools looks like this:

```
Version: 1.0

RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
Encoding: UTF-8

AutoAppendNewline: Yes
StripTrailingWhitespace: Yes

BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,collate,namespace
```

Never modify this file by hand. Instead, use the friendly project options dialog, accessible from the projects menu in the top-right corner of RStudio.

```{r, echo = FALSE}
bookdown::embed_png("screenshots/project-options-1.png", dpi = 220)
bookdown::embed_png("screenshots/project-options-2.png", dpi = 220)
```

## Organising and running code {#r-code}

The first advantage of using a package is that it's easy to re-load your code. There are two main options for re-loading your code:

* `devtools::load_all()`, reloads all code in the package. In RStudio access via __Cmd + Shift + L__,
  which also saves all open files, saving you a keystroke.

* Build & reload, only available in RStudio via __Cmd + Shift + B__. This installs the package, restarts R, and then reloads the package with
  `library()` (doing this by hand is painful).

These commands lead to a fluid development workflow:

1. Edit R files in the editor.

1. Press Cmd + Shift + L (or Cmd + Shift + B).

1. Explore the code in the console.

1. Rinse and repeat.

While you're free to arrange functions into files as you wish, the two extremes are bad: don't put all functions into one file and don't put each function into its own separate file. (It's OK if some files only contain one function, particularly if the function is large or has a lot of documentation.).

My rule of thumb is that if I can't remember the name of the file where a function lives, I need to either separate the functions into more files or give the file a better name. (Unfortunately you can't use subdirectories inside `R/`. The next best thing is to use a common prefix, e.g., `abc-*.R`.).

The arrangement of functions within files is less important if you master two important RStudio keyboard shortcuts that let you jump to the definition of a function:

*   Click a function name in code and press __F2__.

*   Press __Ctrl + .__ then start typing the name.

    ```{r, echo = FALSE}
    bookdown::embed_png("screenshots/file-finder.png", dpi = 220)
    ```

After navigating to a function using one of these tools, you can go back to where you were by clicking the back arrow at the top-left of the editor (`r bookdown::embed_png("screenshots/arrows.png", dpi = 240)`), or by pressing Cmd-F9.

Congratulations, you now understand the basics of using a package! In the rest of this chapter, you'll learn more about the various forms of a package, and exactly what happens when you run `install.packages()` or `install_github()`.

### Avoiding side effects {#side-effects}

One big difference between a script and a package is that the code in a package should not have side effects. Your code should only create objects (mostly functions), and you should not call functions that affect the global state.

* __Don't use `library()` or `require()`__. Use the [DESCRIPTION](description.html)
  to specify your package's requirements.

* __Don't modify global `options()` or graphics `par()`__. Put state changing
  operations in functions that the user can call when they want.

* __Don't save files to disk with `write()`, `write.csv()`, or `saveRDS()`__.
  Use [data/](data.html) to cache important data files.

There are two reasons to avoid side-effects. The first reason is pragmatic: while functions with side-effects will work while you're developing a package locally with `load_all()`, they won't work when you're using a package. This is because your R code is only run once when the package is built, and not every time `library()` is called. The second reason is principled: you shouldn't change global states behind your users' backs.

### When you __do__ need side-effects

Occasionally, packages do need side-effects. This is most common if your package talks to an external system --- you might need to do some initial setup when the package loads. To do that, you can use two special functions: `.onLoad()` and `.onAttach()`. These are called when the package is loaded and attached. You'll learn about the distinction between the two in [Namespaces](#namespace). For now, you should always use `.onLoad()` unless explicitly directed otherwise.

Some common uses of `.onLoad()` and `.onAttach()` are:

*   To dynamically load a compiled DLL. In most cases, you won't need to
    use `.onLoad()` to do this. Instead, you'll use a special namespace
    construct; see [namespaces](#namespace) for details.

*   To display an informative message when the package loads. This might make
    usage conditions clear, or display useful tips. Startup messages is one
    place where you should use `.onAttach()` instead of `.onLoad()`. To display
    startup messages, always use `packageStartupMessage()`, and not `message()`.
    (This allows `suppressPackageStartupMessages()` to selectively suppress
    package startup messages).

    ```{r, eval = FALSE}
    .onAttach <- function(libname, pkgname) {
      packageStartupMessage("Welcome to my package")
    }
    ```

*   To connect R to another programming language. For example, if you use RJava
    to talk to a `.jar` file, you need to call `rJava::.jpackage()`. To
    make C++ classes available as reference classes in R with Rcpp modules,
    you call `Rcpp::loadRcppModules()`.

*   To register vignette engines with `tools::vignetteEngine()`.

*   To set custom options for your package with `options()`. To avoid conflicts
    with other packages, ensure that you prefix option names with the name
    of your package. Also be careful not to override options that the user
    has already set.

    I use the following code in devtools to set up useful options:

    ```{r, eval = FALSE}
    .onLoad <- function(libname, pkgname) {
      op <- options()
      op.devtools <- list(
        devtools.path = "~/R-dev",
        devtools.install.args = "",
        devtools.name = "Your name goes here",
        devtools.desc.author = '"First Last <first.last@example.com> [aut, cre]"',
        devtools.desc.license = "What license is it under?",
        devtools.desc.suggests = NULL,
        devtools.desc = list()
      )
      toset <- !(names(op.devtools) %in% names(op))
      if(any(toset)) options(op.devtools[toset])

      invisible()
    }
    ```

As you can see in the examples, `.onLoad()` and `.onAttach()` are called with two arguments: `libname` and `pkgname`. They're rarely used (they're a holdover from the days when you needed to use `library.dynam()` to load compiled code). They give the path where the package is installed (the "library"), and the name of the package.

If you use `.onLoad()`, consider using `.onUnload()` to clean up any side effects. By convention, `.onLoad()` and friends are usually saved in a file called `zzz.R`. (Note that `.First.lib()` and `.Last.lib()` are old versions of `.onLoad()` and `.onUnload()` and should no longer be used.)

### S4 classes, generics and methods

Another type of side-effect is defining S4 classes, methods and generics. R packages capture these side-effects so they can be replayed when the package is loaded, but they need to be called in the right order. For example, before you can define a method, you must have defined both the generic and the class. This requires that the R files be sourced in a specific order. This order is controlled by the `Collate` field in the `DESCRIPTION`. This is described in more detail in [documenting S4](#man-s4).

### CRAN notes {#r-cran}

If you're planning on submitting your package to CRAN, you must use only ASCII characters in your `.R` files. You can still include unicode characters in strings, but you need to use the special unicode escape `"\u1234"` format. The easiest way to do that is to use `stringi::stri_escape_unicode()`:

```{r}
x <- "This is a bullet •"
y <- "This is a bullet \u2022"
identical(x, y)

cat(stringi::stri_escape_unicode(x))
```

Your R directory should not include any files other than R code (NB: Subdirectories will be silently ignored).

## What is a package? {#package}

To make your first package, all you need to know is what you've learnt above. But to master package development, particularly when you're distributing a package to others, you'll need to understand the differences between the five types of packages: source, bundled, binary, installed and in memory. Doing so will help you understand exactly what happens when you install a package with `install.packages()` or with `devtools::install_github()`.

### Source packages

So far we've just worked with a __source__ package: the development version of a package that lives on your computer. A source package is just a directory with components like `R/`, `DESCRIPTION`, and so on.

### Bundled packages

A __bundled__ package is a package compressed into a single file. By convention, package bundles in R use the extension `.tar.gz`. This convention comes from Linux: it means that multiple files have been collapsed into a single file (`.tar`) and then compressed using gzip (`.gz`). While a bundle is not that useful on its own, it is a useful intermediary between other steps. In the rare case that you do need a bundle, call `devtools::build()` to make it.

If you decompress a bundle, you'll see it looks almost the same as your source package. The main differences between a decompressed bundle and a source package are:

* Vignettes are built so that you get html and pdf output instead of
  markdown or latex input.

* Your source package might contain temporary files used to save time during
  development, like compilation artefacts in `src/`. These are never found in
  a bundle.

* Any files listed in `.Rbuildignore` are not included in the bundle.

`.Rbuildignore` prevents files in the source package appearing in the bundled package. It allows you to have additional directories in your source package that are not included in the package bundle. This is particularly useful when you generate package contents (e.g. data) from other files. Those files should be included in the source package, but only the results need to be distributed. This is particularly important for CRAN packages (where the set of allowed top-level directories is fixed). Each line gives a Perl-compatible regular expression that is matched case-insensitively against the path to each file (i.e. `dir(full.names = TRUE)` run from the package root directory) - if the regular expression matches, the file is excluded.

If you wish to exclude a specific file or directory (which is the most common use case), you __MUST__ escape the regular expression. For example, to exclude a directory called notes, use `$notes^`. The regular expression `notes` will match any file name containing notes, e.g. `R/notes.R`, `man/important-notes.R`, `data/endnotes.Rdata`, etc. The safest way to exclude a specific file or directory is to use `devtools::use_build_ignore("notes")`, which will do the escaping for you.

### Binary packages

If you want to distribute your package to an R user who doesn't have package development tools, you'll need to make a __binary__ package. Like a package bundle, a binary package is a single file. But if you uncompress it, you'll see that the internal structure is rather different from a source package:

* There are no `.R` files in the `R/` directory - instead there are three
  files that store the parsed functions in an efficient format. This is
  basically the result of loading all the R code and then saving the
  functions with `save()`. (In the process, this adds a little extra metadata to
  make things as fast as possible).

* A `Meta/` directory contains a number of `Rds` files. These files contain
  cached metadata about the package, like what topics the help files cover and
  parsed versions of the `DESCRIPTION` files. (You can use `readRDS()` to see
  exactly what's in those files). These files make package loading faster
  by caching costly computations.

* An `html/` directory contains files needed for HTML help.

* If you had any code in the `src/` directory there will now be a `libs/`
  directory that contains the results of compiling code for 32 bit
  (`i386/`) and 64 bit (`x64/`).

* The contents of `inst/` are moved to the top-level directory.

Binary packages are platform specific: you can't install a Windows binary package on a Mac or vice versa. Also, Mac binary packages end in `.tgz` and Windows binary packages end in `.zip`. You can use `devtools::build(binary = TRUE)` to make a binary package.

The following diagram summarises the files present in the root directory for the source, bundled and binary versions of devtools.

```{r, echo = FALSE}
bookdown::embed_png("diagrams/package-files.png")
```

### Installed packages {#install}

An __installed__ package is just a binary package that's been decompressed into a package library (described next). The following diagram illustrates the many ways a package can be installed. This diagram is complicated! In an ideal world installing a package would involve stringing together a set of simple steps: source -> bundle, bundle -> binary, binary -> installed. In the real world it's not this simple: each step in the sequence is slow, and there are often faster shortcuts available.

```{r, echo = FALSE}
bookdown::embed_png("diagrams/installation.png")
```

The tool that powers all package installations is the command line tool `R CMD install` - it can install a source, bundle or a binary package. Devtools functions provide wrappers which allow you to access this tool from R rather than from the command line. `install()` is effectively a wrapper for `R CMD install`. `build()` is a wrapper for `R CMD build` that turns source packages into bundles. `install_github()` downloads a source package from GitHub, runs `build()` to make vignettes, and then uses `R CMD install` to do the install. `install_url()`, `install_gitorious()`, `install_bitbucket()` work similarly for packages found elsewhere on the internet.

`install.packages()` and `devtools::install_github()` allow you to install a remote package. They both work by first downloading the package. `install.packages()` normally downloads a binary package built by CRAN. This makes installation very speedy. `install_github()` has to work a little differently - it first downloads a source package, builds it and then installs it.

You can prevent files in the package bundle from being included in the installed package using `.Rinstignore`. This works the same way as `.Rbuildignore`, described above. It is rarely needed.

### In memory packages

To use a package, you must load it into memory. When you're not developing a package, you load it into memory with `library()`. When you are developing a package, you can either use `load_all()` or "Build and reload". You now know enough about packages to understand the difference between the two: `load_all()` skips the installation step and goes directly from on-disk to in-memory.

The following diagram summarises the three ways of loading a package into memory:

```{r, echo = FALSE}
bookdown::embed_png("diagrams/loading.png")
```

### Exercises

1.  Go to CRAN and download the source and binary for XXX. Unzip and compare.
    How do they differ?

1.  Download the __source__ packages for XXX, YYY, ZZZ. What directories do they
    contain?

## What is a library? {#library}

A library is simply a directory containing installed packages. You can have multiple libraries on your computer and almost every one has at least two: one for packages you’ve installed, and one for the recommended packages that come with every R installation (like base, stats, etc). Normally, that first directory varies based on the version of R that you’re using. That's why it seems like you lose all of your packages when you reinstall R --- they're still on your hard drive, but R can't find them.

You can use `.libPaths()` to see which libraries are currently active. Here are mine:

```{r, eval = FALSE}
.libPaths()
#> [1] "/Users/hadley/R"
#> [2] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"
lapply(.libPaths(), dir)
#> [[1]]
#>   [1] "AnnotationDbi"   "ash"             "assertthat"
#>   ...
#> [163] "xtable"          "yaml"            "zoo"
#>
#> [[2]]
#>  [1] "base"         "boot"         "class"        "cluster"
#>  [5] "codetools"    "compiler"     "datasets"     "foreign"
#>  [9] "graphics"     "grDevices"    "grid"         "KernSmooth"
#> [13] "lattice"      "MASS"         "Matrix"       "methods"
#> [17] "mgcv"         "nlme"         "nnet"         "parallel"
#> [21] "rpart"        "spatial"      "splines"      "stats"
#> [25] "stats4"       "survival"     "tcltk"        "tools"
#> [29] "translations" "utils"
```

The first lib path is for the packages I've installed (I've installed at lot!). The second is for so-called "recommended" packages that come with every installation of R.

When you use `library(pkg)` to load a package, R looks through each path in `.libPaths()` to see if a directory called `pkg` exists. If it doesn't, you'll get an error message:

```{r, error = TRUE}
library(blah)
```

The main difference between `library()` and `require()` is what happens when a package isn't found. While `library()` throws an error, `require()` prints a message and returns FALSE. In practice this distinction isn't important because when building a package you should __NEVER__ use either inside a package. See [package dependencies](#dependencies) for what you should do instead.

When you start learning R it's easy to get confused between libraries and packages, because you use `library()` function to load a _package_. However, the distinction between libraries and packages is important and useful. One important application is packrat, which automates the process of managing project specific libraries. This means that when you upgrade a package in one project, it only affects that project, not every project on your computer. This is useful because it allows you to play around with cutting-edge packages without affecting other projects' use of older, more reliable packages. This is also useful when you're both developing and using a package.

### Exercises

1.  Where is your default library? What happens to that library when
    you install a new package from CRAN?

1.  Can you have multiple versions of the same package installed at the same
    time?