|
| 1 | +--- |
| 2 | +title: Thirteen Simple Steps for Creating An R Package with an External C++ Library |
| 3 | + |
| 4 | +# Use letters for affiliations |
| 5 | +author: |
| 6 | + - name: Dirk Eddelbuettel |
| 7 | + affiliation: 1 |
| 8 | + |
| 9 | +address: |
| 10 | + - code: 1 |
| 11 | + address: Department of Statistics, University of Illinois, Urbana-Champaign, IL, USA |
| 12 | + |
| 13 | +# For footer text TODO(fold into template, allow free form two-authors) |
| 14 | +lead_author_surname: Eddelbuettel |
| 15 | + |
| 16 | +# Place DOI URL or CRAN Package URL here |
| 17 | +#doi: "https://cran.r-project.org/package=anytime" |
| 18 | + |
| 19 | +# Abstract |
| 20 | +abstract: | |
| 21 | + We desribe how we extend R with an external C++ code library by using the Rcpp |
| 22 | + package. Our working example uses the recent machine learning library and application |
| 23 | + 'Corels' providing optimal yet easily interpretable rule lists \citep{arxiv:corels} |
| 24 | + which we bring to R in the form of the \pkg{RcppCorels} package |
| 25 | + \citep{github:rcppcorels}. We discuss each step in the process, and derive a set of |
| 26 | + simple rules and recommendations which are illustrated with the concrete example. |
| 27 | +
|
| 28 | +
|
| 29 | +# Font size of the document, values of 9pt (default), 10pt, 11pt and 12pt |
| 30 | +fontsize: 9pt |
| 31 | + |
| 32 | +# Optional: Force one-column layout, default is two-column |
| 33 | +two_column: true |
| 34 | + |
| 35 | +# Optional: Enables lineno mode, but only if one_column mode is also true |
| 36 | +#lineno: true |
| 37 | + |
| 38 | +# Optional: Enable one-sided layout, default is two-sided |
| 39 | +#one_sided: true |
| 40 | + |
| 41 | +# Optional: Enable section numbering, default is unnumbered |
| 42 | +#numbersections: true |
| 43 | + |
| 44 | +# Optional: Specify the depth of section number, default is 5 |
| 45 | +#secnumdepth: 5 |
| 46 | + |
| 47 | +# Optional: Skip inserting final break between acknowledgements, default is false |
| 48 | +skip_final_break: true |
| 49 | + |
| 50 | +# Optional: Bibliography |
| 51 | +# bibliography: references |
| 52 | + |
| 53 | +# Optional: Enable a 'Draft' watermark on the document |
| 54 | +watermark: false |
| 55 | + |
| 56 | +# Customize footer, eg by referencing the vignette |
| 57 | +footer_contents: "Thirteen Steps for R and C++ Library Packages" |
| 58 | + |
| 59 | +# Produce a pinp document |
| 60 | +output: |
| 61 | + pinp::pinp: |
| 62 | + collapse: true |
| 63 | + keep_tex: false |
| 64 | + |
| 65 | +header-includes: > |
| 66 | + \newcommand{\proglang}[1]{\textsf{#1}} |
| 67 | + \newcommand{\pkg}[1]{\textbf{#1}} |
| 68 | +
|
| 69 | +vignette: > |
| 70 | + %\VignetteIndexEntry{Rcpp-Libraries} |
| 71 | + %\VignetteKeywords{Rcpp, Package, Library} |
| 72 | + %\VignettePackage{Rcpp} |
| 73 | + %\VignetteEngine{knitr::rmarkdown} |
| 74 | + %\VignetteEncoding{UTF-8} |
| 75 | +--- |
| 76 | + |
| 77 | +```{r initialsetup, include=FALSE} |
| 78 | +knitr::opts_chunk$set(cache=TRUE) |
| 79 | +cwd <- getwd() |
| 80 | +``` |
| 81 | + |
| 82 | +# Introduction |
| 83 | + |
| 84 | +The process of building a new package with Rcpp can range from the very easy---a single |
| 85 | +simple C++ function---to the very complex. If, and how, external resources are utilised |
| 86 | +makes a big difference as this too can range from the very simple---making use of a |
| 87 | +header-only library, or directly including a few C++ source files without further |
| 88 | +dependencies---to the very complex. <!-- A recent example of a very involved build with |
| 89 | +dependencies on several external libraries which may not be packaged is the \pkg{arrow} |
| 90 | +package \citep{CRAN:arrow}. --> |
| 91 | + |
| 92 | +Yet a lot of the important action happens in the middle ground. Packages may bring their |
| 93 | +own source code, but also depend on just one or two external libraries. This paper |
| 94 | +describes one such approach in detail: how we turned the Corels application |
| 95 | +\citep{arxiv:corels,github:corels} (provided as a standalone C++-based executable) into an |
| 96 | +R-callable package \pkg{RcppCorels} \citep{github:rcppcorels} via \pkg{Rcpp} |
| 97 | +\citep{CRAN:Rcpp,JSS:Rcpp}. |
| 98 | + |
| 99 | +# The Thirteen Key Steps |
| 100 | + |
| 101 | +## Ensure Use of a Suitable license |
| 102 | + |
| 103 | +Before embarking on such a journey, it is best to ensure that the licensing framework is |
| 104 | +suitable. Many different open-source licenses exists, yet a few key ones dominate and can |
| 105 | +generally be used _with each other_. There is however a fair amount of possible legalese |
| 106 | +involved, so it is useful to check inter-license compatibility, as well as general |
| 107 | +usability of the license in question. Several sites can help via license recommendations, |
| 108 | +and checks for interoperability. One example is the site at |
| 109 | +[choosealicense.com](https://choosealicense.com/) (which is backed by GitHub) can help, as |
| 110 | +can [tldrlegal.com](https://tldrlegal.com/). License choice is a complex topic, and |
| 111 | +general recommendations are difficult to make besides the key point of sticking to |
| 112 | +already-established and known licenses. |
| 113 | + |
| 114 | +## Ensure the Software builds |
| 115 | + |
| 116 | +In order to see how hard it may to combine an external entity, either a program a library, |
| 117 | +with R, it helps to ensure that the external entity actually still builds and runs. |
| 118 | + |
| 119 | +This may seem like a small and obvious steps, but experience suggests that it worth |
| 120 | +asserting the ability to build with current tools, and possibly also with more than one |
| 121 | +compiler or build-system. Consideration to other platforms used by R also matter a great |
| 122 | +deal as one of the strengths of the R package system is its ability to cover the three key |
| 123 | +operating system families. |
| 124 | + |
| 125 | +## Ensure it still works |
| 126 | + |
| 127 | +This may seem like a variation on the previous point, but besides the ability to _build_ |
| 128 | +we also need to ensure the ability to _run_ the software. If the external entity has |
| 129 | +tests and demo, it is highly recommended to run them. If there are reference results, we |
| 130 | +should ensure that they are still obtained, and also that the run-time performance it |
| 131 | +still (at a minimum) reasonable. |
| 132 | + |
| 133 | +## Ensure it is compelling |
| 134 | + |
| 135 | +This is of course a very basic litmus test: is the new software relevant? Is is helpful? |
| 136 | +Would others benefit from having it packaged and maintained? |
| 137 | + |
| 138 | +## Start an Rcpp package |
| 139 | + |
| 140 | +The first step in getting a new package combing R and C++ is often the creation of a new |
| 141 | +Rcpp package. There are several helper functions to choose from. A natural first choice |
| 142 | +is `Rcpp.package.skeleton()` from the \pkg{Rcpp} package \citep{CRAN:Rcpp}. It can be |
| 143 | +improved by having the optional helper package \pkg{pkgKitten} \citep{CRAN:pkgKitten} |
| 144 | +around as its `kitten()` function smoothes some rougher edges left by the underlying Base |
| 145 | +R function `package.skeleton()`. This step is shown below in then appendix, and |
| 146 | +corresponds to the first commit, followed by a first edit of file `DESCRIPTION`. |
| 147 | + |
| 148 | +Any code added by the helper functions, often just a simple `helloWorld()` variant, can be |
| 149 | +run to ensure that the package is indeed functional. More importantly, at this stage, we |
| 150 | +can also start building the package as a compressed tar archive and run the R checker on |
| 151 | +it. |
| 152 | + |
| 153 | +## Integrate External Package |
| 154 | + |
| 155 | +Given a basic package with C++ support, we can now turn to integrating the external |
| 156 | +package. This complexity of this step can, as alluded to earlier, vary from very easy to |
| 157 | +very complex. Simple cases include just dependending on library headers which can either |
| 158 | +be copied to the package, or be provided by another package such as \pkg{BH} |
| 159 | +\citep{CRAN:BH}. It may also be a dependency on a fairly standard library available on |
| 160 | +most if not all systems. The graphics formats bmp, jpeg or png may be example; text |
| 161 | +formats like JSON or XML are another. One difficulty, though, may be that _run-time_ |
| 162 | +support does not always guarantee _compile-time_ support. In these cases, a `-dev` or |
| 163 | +`-devel` package may need to be installed. |
| 164 | + |
| 165 | +In the concrete case of Corels, we |
| 166 | + |
| 167 | +- copied all existing C++ source and header files over into the `src/` directory; |
| 168 | +- renamed all header files from `*.hh` to `*.h` to comply with an R preference; |
| 169 | +- create a minimal `src/Makevars` file, initially with link instructions for GMP later relaxed to |
| 170 | + conditional use of GMP (see below); |
| 171 | +- moved `main.cc` to a subdirectory as we cannot build with another `main()` function (and R will not include files from subdirectories); |
| 172 | +- added a minimal R-callable function along with a `logger` instance. |
| 173 | + |
| 174 | +Here, the last step was needed as the file `main.cc` provided a global instance referred |
| 175 | +to from other files. Hence, a minimal R-callable wrapper is being added at this stage |
| 176 | +(shown in the appendix as well). Actual functionality will be added later. |
| 177 | + |
| 178 | +We will come back to the step concerning the link instructions. |
| 179 | + |
| 180 | +As this point we have a package for R also containing the library we want to add. |
| 181 | + |
| 182 | +## Make the External Code compliant with R Policies |
| 183 | + |
| 184 | +R has fairly strict guidelines, defined both in the _CRAN Repository Policy_ document at |
| 185 | +the CRAN website, and in the manual _Writing R Extension_. Certain standard C and C++ |
| 186 | +functions are not permitted as their use could interfere with running code from R. This |
| 187 | +includes somewhat obvious recommendations ("do not call `abort`" as it would terminate the |
| 188 | +R sessions) but extends to not using native print methods in order to cooperate better |
| 189 | +with the input and output facilities of R. So here, and reflecting that last aspect, we |
| 190 | +changed all calls to `printf()` to calls to `Rprintf()`. Similarly, R prefers its own |
| 191 | +(well-tested) random-number generators so we replaced one (scaled) call to `random() / |
| 192 | +RAND_MAX` with the equivalent call to R's `unif_rand()`. We also avoided one use of |
| 193 | +`stdout` in `rulelib.h`. |
| 194 | + |
| 195 | +The requirement for such changes may seem excessive at first, but the value added stemming |
| 196 | +from consistent application of the CRAN Policies is appreciated by most R users. |
| 197 | + |
| 198 | +## Complete the Interface |
| 199 | + |
| 200 | +In order to further test the package, and of course also for actual use, we need to expose |
| 201 | +the key parameters and arguments. Corels parsed command-line arguments; we can translate |
| 202 | +this directly into suitable arguments for the main function. At a first pass, we created the |
| 203 | +following interface: |
| 204 | + |
| 205 | +```c++ |
| 206 | +// [[Rcpp::export]] |
| 207 | +bool corels(std::string rules_file, |
| 208 | + std::string labels_file, |
| 209 | + std::string log_dir, |
| 210 | + std::string meta_file = "", |
| 211 | + bool run_bfs = false, |
| 212 | + bool calculate_size = false, |
| 213 | + bool run_curiosity = false, |
| 214 | + int curiosity_policy = 0, |
| 215 | + bool latex_out = false, |
| 216 | + int map_type = 0, |
| 217 | + int verbosity = 0, |
| 218 | + int max_num_nodes = 100000, |
| 219 | + double regularization = 0.01, |
| 220 | + int logging_frequency = 1000, |
| 221 | + int ablation = 0) { |
| 222 | + |
| 223 | + // actual function body omitted |
| 224 | +} |
| 225 | +``` |
| 226 | +
|
| 227 | +Rcpp facilities the integration by adding another wrapper exposing all the function |
| 228 | +arguments, and setting up required arguments without default (the first three) along with |
| 229 | +optional arguments given a default. The user can now call `corels()` from R with three |
| 230 | +required arguments (the two input files plus the log directory) as well as number of |
| 231 | +optional arguments. |
| 232 | +
|
| 233 | +## Add Sample Data |
| 234 | +
|
| 235 | +R package can access data files that are shipped with them. That is very useful feature, |
| 236 | +and we therefore also copy in the files include in the Corels repository and its `data/` |
| 237 | +directory. |
| 238 | +
|
| 239 | +```{r} |
| 240 | +fs::dir_tree("../../../rcppcorels/inst/sample_data") |
| 241 | +``` |
| 242 | + |
| 243 | +## Set up working example |
| 244 | + |
| 245 | +Combining the two preceding steps, we can now offer an illustrative example. It is |
| 246 | +included in the helpd page for function `corels()` and can be run from R via |
| 247 | +`example("corels")`. |
| 248 | + |
| 249 | +```r |
| 250 | +library(RcppCorels) |
| 251 | + |
| 252 | +.sysfile <- function(f) # helper function |
| 253 | + system.file("sample_data",f,package="RcppCorels") |
| 254 | + |
| 255 | +rules_file <- .sysfile("compas_train.out") |
| 256 | +label_file <- .sysfile("compas_train.label") |
| 257 | +meta_file <- .sysfile("compas_train.minor") |
| 258 | +logdir <- tempdir() |
| 259 | + |
| 260 | +stopifnot(file.exists(rules_file), |
| 261 | + file.exists(labels_file), |
| 262 | + file.exists(meta_file), |
| 263 | + dir.exists(logdir)) |
| 264 | + |
| 265 | +corels(rules_file, labels_file, logdir, meta_file, |
| 266 | + verbosity = 100, |
| 267 | + regularization = 0.015, |
| 268 | + curiosity_policy = 2, # by lower bound |
| 269 | + map_type = 1) # permutation map |
| 270 | + |
| 271 | +cat("See ", logdir, " for result file.") |
| 272 | +``` |
| 273 | + |
| 274 | +In the example, we pass the two required arguments for rules and labels files, the |
| 275 | +optional argument for the 'meta' file as well as an added required argument for the output |
| 276 | +directory. R policy prohibits writing in user-directories, we default to using the |
| 277 | +temporary directory of the current session, and report its value at the end. For other |
| 278 | +arguments default values are used. |
| 279 | + |
| 280 | +## Finesse Library Dependencies |
| 281 | + |
| 282 | +One common difficulty when bringing an extermal library to R via a package consists in dealing with |
| 283 | +an external dependency. In the case of 'Corels', the GNU GMP library for multi-precision arithmetic |
| 284 | +is an optional extension which, if available, improves and accelerates internal processing. |
| 285 | + |
| 286 | +The simplest approach is to declare a compile-time variable in the `src/Makevars` file. Using |
| 287 | +`-DGMP` _defines_ the `GMP` variable at the level of the C and C++ code. One can then condition on |
| 288 | +the variable. A very standard approach, also used here is `#if defined(GMP) ... #else ... #endif` |
| 289 | +where one of the two code branches is in effect depending on whether the `GMP` variable is defined |
| 290 | +or not. |
| 291 | + |
| 292 | +In order to detect presence of a required (or optional) library, tools like 'autoconf' or 'cmake' |
| 293 | +are often used. For example, one can rely of an existing 'autoconf' macro provided by the GMP |
| 294 | +documentation to detect presence of the the GNU GMP header and library. We are making use of this |
| 295 | +facility here to deploy GMP when it is available. As 'Corels' can be built with and without GMP, |
| 296 | +the build and installation succeeds either way---but deployment of the more-featureful variant with |
| 297 | +use GMP is automated. |
| 298 | + |
| 299 | +## Finalise License and Copyright |
| 300 | + |
| 301 | +It is good (and common) practice to clearly attribute authorship. Here, credit is given to |
| 302 | +the 'Corels' team and authors as well as to the authors of the underlying 'rulelib' code |
| 303 | +used by 'Corels' via the file `inst/AUTHORS` (which will be installed as `AUTHORS` with |
| 304 | +the package. In addition, the file `inst/LICENSE` clarifies the GNU GPL-3 license for 'RcppCorels' and 'Corels', and the MIT license for 'rulelib'. |
| 305 | + |
| 306 | +## Additional Bonus: Some more 'meta' files |
| 307 | + |
| 308 | +Several files help to improve the package. For example, `.Rbuildignore` allows to exclude |
| 309 | +listed files from the resulting R package keeping it well-defined. Similarly, `.gitignore` |
| 310 | +can exclude files from being added to the `git` repository. We also like `.editorconfig` |
| 311 | +for consistent editing default across a range of modern editors. |
| 312 | + |
| 313 | +# Summary |
| 314 | + |
| 315 | +We describe s series of steps to turn the standalone library 'Corels' describes by |
| 316 | +\citet{arxiv:corels} into a R package \pkg{RcppCorels} using the facilities offered by |
| 317 | +\pkg{Rcpp} \citep{CRAN:Rcpp}. Along the way, we illustrate key aspects of the R package |
| 318 | +standards and CRAN Repository Policy proving a template for other research software |
| 319 | +wishing to provide their implementations in a form that is accessibly by R users. |
| 320 | + |
| 321 | + |
| 322 | +\bibliography{Rcpp} |
| 323 | +\bibliographystyle{jss} |
| 324 | + |
| 325 | +\newpage |
| 326 | +\onecolumn |
| 327 | + |
| 328 | +## Appendix 1: Creating the basic package |
| 329 | + |
| 330 | +```sh |
| 331 | +edd@rob:~/git$ r --packages Rcpp --eval 'Rcpp.package.skeleton("RcppCorels")' |
| 332 | + |
| 333 | +Attaching package: ‘utils’ |
| 334 | + |
| 335 | +The following objects are masked from ‘package:Rcpp’: |
| 336 | + |
| 337 | + .DollarNames, prompt |
| 338 | + |
| 339 | +Creating directories ... |
| 340 | +Creating DESCRIPTION ... |
| 341 | +Creating NAMESPACE ... |
| 342 | +Creating Read-and-delete-me ... |
| 343 | +Saving functions and data ... |
| 344 | +Making help files ... |
| 345 | +Done. |
| 346 | +Further steps are described in './RcppCorels/Read-and-delete-me'. |
| 347 | + |
| 348 | +Adding Rcpp settings |
| 349 | + >> added Imports: Rcpp |
| 350 | + >> added LinkingTo: Rcpp |
| 351 | + >> added useDynLib directive to NAMESPACE |
| 352 | + >> added importFrom(Rcpp, evalCpp) directive to NAMESPACE |
| 353 | + >> added example src file using Rcpp attributes |
| 354 | + >> added Rd file for rcpp_hello_world |
| 355 | + >> compiled Rcpp attributes |
| 356 | +edd@rob:~/git$ |
| 357 | +edd@rob:~/git$ mv RcppCorels/ rcppcorels # prefer lowercase directories |
| 358 | +edd@rob:~/git$ |
| 359 | +``` |
| 360 | + |
| 361 | +## Appendix 2: A Minimal src/Makevars |
| 362 | + |
| 363 | +In the file shown here, use of GMP is unconditional: we define `GMP` as a compiler flag, and |
| 364 | +instruct the linker to link with the GMP library. |
| 365 | + |
| 366 | +```sh |
| 367 | + |
| 368 | +CXX_STD = CXX11 |
| 369 | + |
| 370 | +PKG_CXXFLAGS = -I. -DGMP |
| 371 | + |
| 372 | +PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) -lgmp |
| 373 | + |
| 374 | +``` |
| 375 | + |
| 376 | +## Appendix 3: A Placeholder Wrapper |
| 377 | + |
| 378 | +```c++ |
| 379 | + |
| 380 | +#include "queue.h" |
| 381 | + |
| 382 | +#include <Rcpp.h> |
| 383 | + |
| 384 | +/* |
| 385 | + * Logs statistics about the execution of the algorithm and dumps it to a file. |
| 386 | + * To turn off, pass verbosity <= 1 |
| 387 | + */ |
| 388 | +NullLogger* logger; |
| 389 | + |
| 390 | +// [[Rcpp::export]] |
| 391 | +bool corels() { |
| 392 | + return true; // more to fill in, naturally |
| 393 | +} |
| 394 | +``` |
0 commit comments