Skip to content

Commit 50536d7

Browse files
committed
add new vignette rmd source file from its repo
1 parent 6ae0ab4 commit 50536d7

File tree

1 file changed

+394
-0
lines changed

1 file changed

+394
-0
lines changed

vignettes/rmd/Rcpp-libraries.Rmd

Lines changed: 394 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,394 @@
1+
---
2+
title: Thirteen Simple Steps for Creating An R Package with an External C++ Library
3+
4+
# Use letters for affiliations
5+
author:
6+
- name: Dirk Eddelbuettel
7+
affiliation: 1
8+
9+
address:
10+
- code: 1
11+
address: Department of Statistics, University of Illinois, Urbana-Champaign, IL, USA
12+
13+
# For footer text TODO(fold into template, allow free form two-authors)
14+
lead_author_surname: Eddelbuettel
15+
16+
# Place DOI URL or CRAN Package URL here
17+
#doi: "https://cran.r-project.org/package=anytime"
18+
19+
# Abstract
20+
abstract: |
21+
We desribe how we extend R with an external C++ code library by using the Rcpp
22+
package. Our working example uses the recent machine learning library and application
23+
'Corels' providing optimal yet easily interpretable rule lists \citep{arxiv:corels}
24+
which we bring to R in the form of the \pkg{RcppCorels} package
25+
\citep{github:rcppcorels}. We discuss each step in the process, and derive a set of
26+
simple rules and recommendations which are illustrated with the concrete example.
27+
28+
29+
# Font size of the document, values of 9pt (default), 10pt, 11pt and 12pt
30+
fontsize: 9pt
31+
32+
# Optional: Force one-column layout, default is two-column
33+
two_column: true
34+
35+
# Optional: Enables lineno mode, but only if one_column mode is also true
36+
#lineno: true
37+
38+
# Optional: Enable one-sided layout, default is two-sided
39+
#one_sided: true
40+
41+
# Optional: Enable section numbering, default is unnumbered
42+
#numbersections: true
43+
44+
# Optional: Specify the depth of section number, default is 5
45+
#secnumdepth: 5
46+
47+
# Optional: Skip inserting final break between acknowledgements, default is false
48+
skip_final_break: true
49+
50+
# Optional: Bibliography
51+
# bibliography: references
52+
53+
# Optional: Enable a 'Draft' watermark on the document
54+
watermark: false
55+
56+
# Customize footer, eg by referencing the vignette
57+
footer_contents: "Thirteen Steps for R and C++ Library Packages"
58+
59+
# Produce a pinp document
60+
output:
61+
pinp::pinp:
62+
collapse: true
63+
keep_tex: false
64+
65+
header-includes: >
66+
\newcommand{\proglang}[1]{\textsf{#1}}
67+
\newcommand{\pkg}[1]{\textbf{#1}}
68+
69+
vignette: >
70+
%\VignetteIndexEntry{Rcpp-Libraries}
71+
%\VignetteKeywords{Rcpp, Package, Library}
72+
%\VignettePackage{Rcpp}
73+
%\VignetteEngine{knitr::rmarkdown}
74+
%\VignetteEncoding{UTF-8}
75+
---
76+
77+
```{r initialsetup, include=FALSE}
78+
knitr::opts_chunk$set(cache=TRUE)
79+
cwd <- getwd()
80+
```
81+
82+
# Introduction
83+
84+
The process of building a new package with Rcpp can range from the very easy---a single
85+
simple C++ function---to the very complex. If, and how, external resources are utilised
86+
makes a big difference as this too can range from the very simple---making use of a
87+
header-only library, or directly including a few C++ source files without further
88+
dependencies---to the very complex. <!-- A recent example of a very involved build with
89+
dependencies on several external libraries which may not be packaged is the \pkg{arrow}
90+
package \citep{CRAN:arrow}. -->
91+
92+
Yet a lot of the important action happens in the middle ground. Packages may bring their
93+
own source code, but also depend on just one or two external libraries. This paper
94+
describes one such approach in detail: how we turned the Corels application
95+
\citep{arxiv:corels,github:corels} (provided as a standalone C++-based executable) into an
96+
R-callable package \pkg{RcppCorels} \citep{github:rcppcorels} via \pkg{Rcpp}
97+
\citep{CRAN:Rcpp,JSS:Rcpp}.
98+
99+
# The Thirteen Key Steps
100+
101+
## Ensure Use of a Suitable license
102+
103+
Before embarking on such a journey, it is best to ensure that the licensing framework is
104+
suitable. Many different open-source licenses exists, yet a few key ones dominate and can
105+
generally be used _with each other_. There is however a fair amount of possible legalese
106+
involved, so it is useful to check inter-license compatibility, as well as general
107+
usability of the license in question. Several sites can help via license recommendations,
108+
and checks for interoperability. One example is the site at
109+
[choosealicense.com](https://choosealicense.com/) (which is backed by GitHub) can help, as
110+
can [tldrlegal.com](https://tldrlegal.com/). License choice is a complex topic, and
111+
general recommendations are difficult to make besides the key point of sticking to
112+
already-established and known licenses.
113+
114+
## Ensure the Software builds
115+
116+
In order to see how hard it may to combine an external entity, either a program a library,
117+
with R, it helps to ensure that the external entity actually still builds and runs.
118+
119+
This may seem like a small and obvious steps, but experience suggests that it worth
120+
asserting the ability to build with current tools, and possibly also with more than one
121+
compiler or build-system. Consideration to other platforms used by R also matter a great
122+
deal as one of the strengths of the R package system is its ability to cover the three key
123+
operating system families.
124+
125+
## Ensure it still works
126+
127+
This may seem like a variation on the previous point, but besides the ability to _build_
128+
we also need to ensure the ability to _run_ the software. If the external entity has
129+
tests and demo, it is highly recommended to run them. If there are reference results, we
130+
should ensure that they are still obtained, and also that the run-time performance it
131+
still (at a minimum) reasonable.
132+
133+
## Ensure it is compelling
134+
135+
This is of course a very basic litmus test: is the new software relevant? Is is helpful?
136+
Would others benefit from having it packaged and maintained?
137+
138+
## Start an Rcpp package
139+
140+
The first step in getting a new package combing R and C++ is often the creation of a new
141+
Rcpp package. There are several helper functions to choose from. A natural first choice
142+
is `Rcpp.package.skeleton()` from the \pkg{Rcpp} package \citep{CRAN:Rcpp}. It can be
143+
improved by having the optional helper package \pkg{pkgKitten} \citep{CRAN:pkgKitten}
144+
around as its `kitten()` function smoothes some rougher edges left by the underlying Base
145+
R function `package.skeleton()`. This step is shown below in then appendix, and
146+
corresponds to the first commit, followed by a first edit of file `DESCRIPTION`.
147+
148+
Any code added by the helper functions, often just a simple `helloWorld()` variant, can be
149+
run to ensure that the package is indeed functional. More importantly, at this stage, we
150+
can also start building the package as a compressed tar archive and run the R checker on
151+
it.
152+
153+
## Integrate External Package
154+
155+
Given a basic package with C++ support, we can now turn to integrating the external
156+
package. This complexity of this step can, as alluded to earlier, vary from very easy to
157+
very complex. Simple cases include just dependending on library headers which can either
158+
be copied to the package, or be provided by another package such as \pkg{BH}
159+
\citep{CRAN:BH}. It may also be a dependency on a fairly standard library available on
160+
most if not all systems. The graphics formats bmp, jpeg or png may be example; text
161+
formats like JSON or XML are another. One difficulty, though, may be that _run-time_
162+
support does not always guarantee _compile-time_ support. In these cases, a `-dev` or
163+
`-devel` package may need to be installed.
164+
165+
In the concrete case of Corels, we
166+
167+
- copied all existing C++ source and header files over into the `src/` directory;
168+
- renamed all header files from `*.hh` to `*.h` to comply with an R preference;
169+
- create a minimal `src/Makevars` file, initially with link instructions for GMP later relaxed to
170+
conditional use of GMP (see below);
171+
- moved `main.cc` to a subdirectory as we cannot build with another `main()` function (and R will not include files from subdirectories);
172+
- added a minimal R-callable function along with a `logger` instance.
173+
174+
Here, the last step was needed as the file `main.cc` provided a global instance referred
175+
to from other files. Hence, a minimal R-callable wrapper is being added at this stage
176+
(shown in the appendix as well). Actual functionality will be added later.
177+
178+
We will come back to the step concerning the link instructions.
179+
180+
As this point we have a package for R also containing the library we want to add.
181+
182+
## Make the External Code compliant with R Policies
183+
184+
R has fairly strict guidelines, defined both in the _CRAN Repository Policy_ document at
185+
the CRAN website, and in the manual _Writing R Extension_. Certain standard C and C++
186+
functions are not permitted as their use could interfere with running code from R. This
187+
includes somewhat obvious recommendations ("do not call `abort`" as it would terminate the
188+
R sessions) but extends to not using native print methods in order to cooperate better
189+
with the input and output facilities of R. So here, and reflecting that last aspect, we
190+
changed all calls to `printf()` to calls to `Rprintf()`. Similarly, R prefers its own
191+
(well-tested) random-number generators so we replaced one (scaled) call to `random() /
192+
RAND_MAX` with the equivalent call to R's `unif_rand()`. We also avoided one use of
193+
`stdout` in `rulelib.h`.
194+
195+
The requirement for such changes may seem excessive at first, but the value added stemming
196+
from consistent application of the CRAN Policies is appreciated by most R users.
197+
198+
## Complete the Interface
199+
200+
In order to further test the package, and of course also for actual use, we need to expose
201+
the key parameters and arguments. Corels parsed command-line arguments; we can translate
202+
this directly into suitable arguments for the main function. At a first pass, we created the
203+
following interface:
204+
205+
```c++
206+
// [[Rcpp::export]]
207+
bool corels(std::string rules_file,
208+
std::string labels_file,
209+
std::string log_dir,
210+
std::string meta_file = "",
211+
bool run_bfs = false,
212+
bool calculate_size = false,
213+
bool run_curiosity = false,
214+
int curiosity_policy = 0,
215+
bool latex_out = false,
216+
int map_type = 0,
217+
int verbosity = 0,
218+
int max_num_nodes = 100000,
219+
double regularization = 0.01,
220+
int logging_frequency = 1000,
221+
int ablation = 0) {
222+
223+
// actual function body omitted
224+
}
225+
```
226+
227+
Rcpp facilities the integration by adding another wrapper exposing all the function
228+
arguments, and setting up required arguments without default (the first three) along with
229+
optional arguments given a default. The user can now call `corels()` from R with three
230+
required arguments (the two input files plus the log directory) as well as number of
231+
optional arguments.
232+
233+
## Add Sample Data
234+
235+
R package can access data files that are shipped with them. That is very useful feature,
236+
and we therefore also copy in the files include in the Corels repository and its `data/`
237+
directory.
238+
239+
```{r}
240+
fs::dir_tree("../../../rcppcorels/inst/sample_data")
241+
```
242+
243+
## Set up working example
244+
245+
Combining the two preceding steps, we can now offer an illustrative example. It is
246+
included in the helpd page for function `corels()` and can be run from R via
247+
`example("corels")`.
248+
249+
```r
250+
library(RcppCorels)
251+
252+
.sysfile <- function(f) # helper function
253+
system.file("sample_data",f,package="RcppCorels")
254+
255+
rules_file <- .sysfile("compas_train.out")
256+
label_file <- .sysfile("compas_train.label")
257+
meta_file <- .sysfile("compas_train.minor")
258+
logdir <- tempdir()
259+
260+
stopifnot(file.exists(rules_file),
261+
file.exists(labels_file),
262+
file.exists(meta_file),
263+
dir.exists(logdir))
264+
265+
corels(rules_file, labels_file, logdir, meta_file,
266+
verbosity = 100,
267+
regularization = 0.015,
268+
curiosity_policy = 2, # by lower bound
269+
map_type = 1) # permutation map
270+
271+
cat("See ", logdir, " for result file.")
272+
```
273+
274+
In the example, we pass the two required arguments for rules and labels files, the
275+
optional argument for the 'meta' file as well as an added required argument for the output
276+
directory. R policy prohibits writing in user-directories, we default to using the
277+
temporary directory of the current session, and report its value at the end. For other
278+
arguments default values are used.
279+
280+
## Finesse Library Dependencies
281+
282+
One common difficulty when bringing an extermal library to R via a package consists in dealing with
283+
an external dependency. In the case of 'Corels', the GNU GMP library for multi-precision arithmetic
284+
is an optional extension which, if available, improves and accelerates internal processing.
285+
286+
The simplest approach is to declare a compile-time variable in the `src/Makevars` file. Using
287+
`-DGMP` _defines_ the `GMP` variable at the level of the C and C++ code. One can then condition on
288+
the variable. A very standard approach, also used here is `#if defined(GMP) ... #else ... #endif`
289+
where one of the two code branches is in effect depending on whether the `GMP` variable is defined
290+
or not.
291+
292+
In order to detect presence of a required (or optional) library, tools like 'autoconf' or 'cmake'
293+
are often used. For example, one can rely of an existing 'autoconf' macro provided by the GMP
294+
documentation to detect presence of the the GNU GMP header and library. We are making use of this
295+
facility here to deploy GMP when it is available. As 'Corels' can be built with and without GMP,
296+
the build and installation succeeds either way---but deployment of the more-featureful variant with
297+
use GMP is automated.
298+
299+
## Finalise License and Copyright
300+
301+
It is good (and common) practice to clearly attribute authorship. Here, credit is given to
302+
the 'Corels' team and authors as well as to the authors of the underlying 'rulelib' code
303+
used by 'Corels' via the file `inst/AUTHORS` (which will be installed as `AUTHORS` with
304+
the package. In addition, the file `inst/LICENSE` clarifies the GNU GPL-3 license for 'RcppCorels' and 'Corels', and the MIT license for 'rulelib'.
305+
306+
## Additional Bonus: Some more 'meta' files
307+
308+
Several files help to improve the package. For example, `.Rbuildignore` allows to exclude
309+
listed files from the resulting R package keeping it well-defined. Similarly, `.gitignore`
310+
can exclude files from being added to the `git` repository. We also like `.editorconfig`
311+
for consistent editing default across a range of modern editors.
312+
313+
# Summary
314+
315+
We describe s series of steps to turn the standalone library 'Corels' describes by
316+
\citet{arxiv:corels} into a R package \pkg{RcppCorels} using the facilities offered by
317+
\pkg{Rcpp} \citep{CRAN:Rcpp}. Along the way, we illustrate key aspects of the R package
318+
standards and CRAN Repository Policy proving a template for other research software
319+
wishing to provide their implementations in a form that is accessibly by R users.
320+
321+
322+
\bibliography{Rcpp}
323+
\bibliographystyle{jss}
324+
325+
\newpage
326+
\onecolumn
327+
328+
## Appendix 1: Creating the basic package
329+
330+
```sh
331+
edd@rob:~/git$ r --packages Rcpp --eval 'Rcpp.package.skeleton("RcppCorels")'
332+
333+
Attaching package: ‘utils’
334+
335+
The following objects are masked from ‘package:Rcpp’:
336+
337+
.DollarNames, prompt
338+
339+
Creating directories ...
340+
Creating DESCRIPTION ...
341+
Creating NAMESPACE ...
342+
Creating Read-and-delete-me ...
343+
Saving functions and data ...
344+
Making help files ...
345+
Done.
346+
Further steps are described in './RcppCorels/Read-and-delete-me'.
347+
348+
Adding Rcpp settings
349+
>> added Imports: Rcpp
350+
>> added LinkingTo: Rcpp
351+
>> added useDynLib directive to NAMESPACE
352+
>> added importFrom(Rcpp, evalCpp) directive to NAMESPACE
353+
>> added example src file using Rcpp attributes
354+
>> added Rd file for rcpp_hello_world
355+
>> compiled Rcpp attributes
356+
edd@rob:~/git$
357+
edd@rob:~/git$ mv RcppCorels/ rcppcorels # prefer lowercase directories
358+
edd@rob:~/git$
359+
```
360+
361+
## Appendix 2: A Minimal src/Makevars
362+
363+
In the file shown here, use of GMP is unconditional: we define `GMP` as a compiler flag, and
364+
instruct the linker to link with the GMP library.
365+
366+
```sh
367+
368+
CXX_STD = CXX11
369+
370+
PKG_CXXFLAGS = -I. -DGMP
371+
372+
PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) -lgmp
373+
374+
```
375+
376+
## Appendix 3: A Placeholder Wrapper
377+
378+
```c++
379+
380+
#include "queue.h"
381+
382+
#include <Rcpp.h>
383+
384+
/*
385+
* Logs statistics about the execution of the algorithm and dumps it to a file.
386+
* To turn off, pass verbosity <= 1
387+
*/
388+
NullLogger* logger;
389+
390+
// [[Rcpp::export]]
391+
bool corels() {
392+
return true; // more to fill in, naturally
393+
}
394+
```

0 commit comments

Comments
 (0)