Skip to content

Commit 3043827

Browse files
[R] Update vignette "XGBoost presentation" (dmlc#10749)
--------- Co-authored-by: Jiaming Yuan <[email protected]>
1 parent 7794d3d commit 3043827

File tree

1 file changed

+98
-116
lines changed

1 file changed

+98
-116
lines changed

R-package/vignettes/xgboostPresentation.Rmd

Lines changed: 98 additions & 116 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ output:
66
number_sections: yes
77
toc: yes
88
bibliography: xgboost.bib
9-
author: Tianqi Chen, Tong He, Michaël Benesty
9+
author: Tianqi Chen, Tong He, Michaël Benesty, David Cortes
1010
vignette: >
1111
%\VignetteIndexEntry{XGBoost presentation}
1212
%\VignetteEngine{knitr::rmarkdown}
@@ -25,50 +25,34 @@ The purpose of this Vignette is to show you how to use **XGBoost** to build a mo
2525

2626
It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included:
2727

28+
- *tree learning* algorithm (in different varieties).
2829
- *linear* model ;
29-
- *tree learning* algorithm.
3030

31-
It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily.
31+
It supports various objective functions, including *regression*, *classification* (binary and multi-class) and *ranking*. The package is made to be extensible, so that users are also allowed to define their own objective functions easily.
3232

3333
It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.
3434

3535
It has several features:
3636

37-
* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than the classical `gbm`.
37+
* Speed: it can automatically do parallel computations with *OpenMP*. It is generally over 10 times faster than the classical `gbm`.
3838
* Input Type: it takes several types of input data:
3939
* *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ;
4040
* *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ;
4141
* Data File: local data files ;
42-
* `xgb.DMatrix`: its own class (recommended).
43-
* Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ;
42+
* Data frames (class `data.frame` and sub-classes from it such as `data.table`), taking
43+
both numeric and categorical (factor) features.
44+
* `xgb.DMatrix`: its own class (recommended, also supporting numeric and categorical features).
4445
* Customization: it supports customized objective functions and evaluation functions.
4546

4647
## Installation
4748

48-
49-
### GitHub version
50-
51-
52-
For weekly updated version (highly recommended), install from *GitHub*:
53-
54-
```{r installGithub, eval=FALSE}
55-
install.packages("drat", repos = "https://cran.rstudio.com")
56-
drat:::addRepo("dmlc")
57-
install.packages("xgboost", repos = "http://dmlc.ml/drat/", type = "source")
58-
```
59-
60-
> *Windows* user will need to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) first.
61-
62-
### CRAN version
63-
64-
65-
The version 0.4-2 is on CRAN, and you can install it by:
49+
Package can be easily installed from CRAN:
6650

6751
```{r, eval=FALSE}
6852
install.packages("xgboost")
6953
```
7054

71-
Formerly available versions can be obtained from the CRAN [archive](https://cran.r-project.org/src/contrib/Archive/xgboost/)
55+
For the development version, see the [GitHub page](https://github.com/dmlc/xgboost) and the [installation docs](https://xgboost.readthedocs.io/en/stable/install.html) for further instructions.
7256

7357
## Learning
7458

@@ -124,7 +108,7 @@ dim(train$data)
124108
dim(test$data)
125109
```
126110

127-
This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently.
111+
This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge datasets very efficiently.
128112

129113
As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):
130114

@@ -144,7 +128,7 @@ We are using the `train` data. As explained above, both `data` and `label` are s
144128

145129
In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.
146130

147-
We will train decision tree model using the following parameters:
131+
We will train a decision tree model using the following parameters:
148132

149133
* `objective = "binary:logistic"`: we will train a binary classification model (note that this is set automatically when `y` is a `factor`) ;
150134
* `max_depth = 2`: the trees won't be deep, because our case is very simple ;
@@ -156,12 +140,36 @@ bstSparse <- xgboost(
156140
x = train$data
157141
, y = factor(train$label, levels = c(0, 1))
158142
, objective = "binary:logistic"
159-
, params = list(max_depth = 2, eta = 1)
143+
, max_depth = 2
144+
, eta = 1
160145
, nrounds = 2
161146
, nthread = 2
162147
)
163148
```
164149

150+
Note that, while the R function `xgboost()` follows typical R idioms for statistical modeling packages
151+
such as an x/y division and having those as first arguments, it also offers a more flexible `xgb.train`
152+
interface which is more consistent across different language bindings (e.g. arguments are the same as
153+
in the Python XGBoost library) and which exposes some additional functionalities. The `xgb.train`
154+
interface uses XGBoost's own DMatrix class to pass data to it, and accepts the model parameters instead
155+
as a named list:
156+
157+
```{r}
158+
bstTrInterface <- xgb.train(
159+
data = xgb.DMatrix(train$data, label = train$label, nthread = 1)
160+
, params = list(
161+
objective = "binary:logistic"
162+
, max_depth = 2
163+
, eta = 1
164+
)
165+
, nthread = 2
166+
, nrounds = 2
167+
)
168+
```
169+
170+
For the rest of this tutorial, we'll nevertheless be using the `xgboost()` interface which will be
171+
more familiar to users of packages such as GLMNET or Ranger.
172+
165173
> More complex the relationship between your features and your `label` is, more passes you need.
166174
167175
#### Parameter variations
@@ -174,78 +182,51 @@ Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R**
174182
bstDense <- xgboost(
175183
x = as.matrix(train$data),
176184
y = factor(train$label, levels = c(0, 1)),
177-
params = list(max_depth = 2, eta = 1),
178-
nrounds = 2,
179-
nthread = 2
185+
max_depth = 2,
186+
eta = 1,
187+
nthread = 2,
188+
nrounds = 2
180189
)
181190
```
182191

183-
##### xgb.DMatrix
192+
##### Data frame
184193

185-
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.
194+
As another alternative, XGBoost will also accept `data.frame` objects, from which it can
195+
use numeric, integer and factor columns:
186196

187-
```{r trainingDmatrix, message=F, warning=F}
188-
dtrain <- xgb.DMatrix(data = train$data, label = train$label, nthread = 2)
189-
bstDMatrix <- xgb.train(
190-
data = dtrain,
191-
params = list(
192-
max_depth = 2,
193-
eta = 1,
194-
nthread = 2,
195-
objective = "binary:logistic"
196-
),
197+
```{r}
198+
df_train <- as.data.frame(as.matrix(train$data))
199+
bstDF <- xgboost(
200+
x = df_train,
201+
y = factor(train$label, levels = c(0, 1)),
202+
max_depth = 2,
203+
eta = 1,
204+
nthread = 2,
197205
nrounds = 2
198206
)
199207
```
200208

201-
##### Verbose option
202-
203-
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
209+
##### Verbosity levels
204210

205-
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced techniques).
211+
**XGBoost** has several features to help you to view how the learning progresses internally. The purpose is to help you
212+
set the best parameters, which is the key of your model quality. Note that when using the `xgb.train` interface,
213+
one can also use a separate evaluation dataset (e.g. a different subset of the data than the training dataset) on
214+
which to monitor metrics of interest, and it also offers an `xgb.cv` function which automatically splits the data
215+
to create evaluation subsets for you.
206216

207-
```{r trainingVerbose0, message=T, warning=F}
208-
# verbose = 0, no message
209-
bst <- xgb.train(
210-
data = dtrain
211-
, params = list(
212-
max_depth = 2
213-
, eta = 1
214-
, nthread = 2
215-
, objective = "binary:logistic"
216-
)
217-
, nrounds = 2
218-
, verbose = 0
219-
)
220-
```
217+
One of the simplest way to see the training progress is to set the `verbosity` option:
221218

222219
```{r trainingVerbose1, message=T, warning=F}
223-
# verbose = 1, print evaluation metric
224-
bst <- xgb.train(
225-
data = dtrain
226-
, params = list(
227-
max_depth = 2
228-
, eta = 1
229-
, nthread = 2
230-
, objective = "binary:logistic"
231-
)
232-
, nrounds = 2
233-
, verbose = 1
234-
)
235-
```
236-
237-
```{r trainingVerbose2, message=T, warning=F}
238-
# verbose = 2, also print information about tree
239-
bst <- xgb.train(
240-
data = dtrain
241-
, params = list(
242-
max_depth = 2
243-
, eta = 1
244-
, nthread = 2
245-
, objective = "binary:logistic"
246-
)
247-
, nrounds = 2
248-
, verbose = 2
220+
# verbosity = 1, print evaluation metric
221+
bst <- xgboost(
222+
x = train$data,
223+
y = factor(train$label, levels = c(0, 1)),
224+
max_depth = 2,
225+
eta = 1,
226+
nthread = 2,
227+
objective = "binary:logistic",
228+
nrounds = 5,
229+
verbosity = 1
249230
)
250231
```
251232

@@ -267,56 +248,53 @@ print(length(pred))
267248
print(head(pred))
268249
```
269250

270-
These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results.
271-
272-
## Transform the regression in a binary classification
273-
274-
275-
The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.
276-
277-
How can we use a *regression* model to perform a binary classification?
278-
279-
If we think about the meaning of a regression applied to our data, the numbers we get are probabilities that a datum will be classified as `1`. Therefore, we will set the rule that if this probability for a specific datum is `> 0.5` then the observation is classified as `1` (or `0` otherwise).
251+
These numbers reflect the predicted probabilities of belonging to the class '1' in the 'y' data. Tautologically,
252+
the probability of belonging to the class '0' is then $P(y=0) = 1 - P(y=1)$. This implies: if the number is greater
253+
than 0.5, then according to the model it is more likely than an observation will be of class '1', whereas if the
254+
number if lower than 0.5, it is more likely that the observation will be of class '0':
280255

281256
```{r predictingTest, message=F, warning=F}
282257
prediction <- as.numeric(pred > 0.5)
283258
print(head(prediction))
284259
```
285260

261+
Note that one can also control the prediction type directly to obtain classes instead of probabilities.
262+
286263
## Measuring model performance
287264

288265

289-
To measure the model performance, we will compute a simple metric, the *average error*.
266+
To measure the model performance, we will compute a simple metric, the *accuracy rate*.
290267

291268
```{r predictingAverageError, message=F, warning=F}
292-
err <- mean(as.numeric(pred > 0.5) != test$label)
293-
print(paste("test-error=", err))
269+
acc <- mean(as.numeric(pred > 0.5) == test$label)
270+
print(paste("test-acc=", acc))
294271
```
295272

296273
> Note that the algorithm has not seen the `test` data during the model construction.
297274
298275
Steps explanation:
299276

300277
1. `as.numeric(pred > 0.5)` applies our rule that when the probability (<=> regression <=> prediction) is `> 0.5` the observation is classified as `1` and `0` otherwise ;
301-
2. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ;
302-
3. `mean(vectorOfErrors)` computes the *average error* itself.
278+
2. `probabilityVectorPreviouslyComputed == test$label` whether the predicted class matches with the real data ;
279+
3. `mean(vectorOfMatches)` computes the *accuracy rate* itself.
303280

304-
The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**.
281+
The most important thing to remember is that **to obtain the predicted class of an observation, a threshold needs to be applied on the predicted probabilities**.
305282

306283
*Multiclass* classification works in a similar way.
307284

308-
This metric is **`r round(err, 2)`** and is pretty low: our yummy mushroom model works well!
285+
This metric is **`r round(acc, 2)`** and is pretty high: our yummy mushroom model works well!
309286

310287
## Advanced features
311288

312289

313290
Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content.
314291

315292

316-
### Dataset preparation
293+
### Dataset preparation for xgb.train
317294

318295

319-
For the following advanced features, we need to put data in `xgb.DMatrix` as explained above.
296+
For the following advanced features, we'll be using the `xgb.train()` interface instead of the `xbgoost()`
297+
interface, so we need to put data in an `xgb.DMatrix` as explained earlier:
320298

321299
```{r DMatrix, message=F, warning=F}
322300
dtrain <- xgb.DMatrix(data = train$data, label = train$label, nthread = 2)
@@ -332,7 +310,7 @@ One of the special feature of `xgb.train` is the capacity to follow the progress
332310

333311
One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
334312

335-
> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors.
313+
> in some way it is similar to what we have done above with the prediction accuracy. The main difference is that below it was after building the model, and now it is during the construction that we measure quality of predictions.
336314
337315
For the purpose of this example, we use the `evals` parameter. It is a list of `xgb.DMatrix` objects, each of them tagged with a name.
338316

@@ -352,7 +330,9 @@ bst <- xgb.train(
352330
)
353331
```
354332

355-
**XGBoost** has computed at each round the same average error metric than seen above (we set `nrounds` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
333+
**XGBoost** has computed at each round the same (negative of) average log-loss (logarithm of the Bernoulli likelihood)
334+
that it uses as optimization objective to minimize in both of the datasets. Obviously, the `train_logloss` number is
335+
related to the training dataset (the one the algorithm learns from) and the `test_logloss` number to the test dataset.
356336

357337
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
358338

@@ -442,12 +422,12 @@ err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label)
442422
print(paste("test-error=", err))
443423
```
444424

445-
### View feature importance/influence from the learnt model
425+
### View feature importance/influence from the fitted model
446426

447427

448428
Feature importance is similar to R gbm package's relative influence (rel.inf).
449429

450-
```
430+
```{r}
451431
importance_matrix <- xgb.importance(model = bst)
452432
print(importance_matrix)
453433
xgb.plot.importance(importance_matrix = importance_matrix)
@@ -456,15 +436,15 @@ xgb.plot.importance(importance_matrix = importance_matrix)
456436
#### View the trees from a model
457437

458438

459-
You can dump the tree you learned using `xgb.dump` into a text file.
439+
XGBoost can output the trees it fitted in a standard tabular format:
460440

461-
```{r dump, message=T, warning=F}
462-
xgb.dump(bst, with_stats = TRUE)
441+
```{r}
442+
xgb.model.dt.tree(bst)
463443
```
464444

465445
You can plot the trees from your model using ```xgb.plot.tree``
466446

467-
```
447+
```{r}
468448
xgb.plot.tree(model = bst)
469449
```
470450

@@ -475,7 +455,9 @@ xgb.plot.tree(model = bst)
475455

476456
Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
477457

478-
Hopefully for you, **XGBoost** implements such functions.
458+
XGBoost models can be saved through R functions such as `save` and `saveRDS`, but in addition, it also offers
459+
its own serialization format, which might have better compatibility guarantees across versions of XGBoost and
460+
which can also be loaded into other language bindings:
479461

480462
```{r saveModel, message=F, warning=F}
481463
# save model to binary local file
@@ -507,7 +489,7 @@ file.remove(fname)
507489

508490
> result is `0`? We are good!
509491
510-
In some very specific cases, you will want to save the model as a *R* binary vector. See below how to do it.
492+
In some very specific cases, you will want to save the model as a *R* raw vector. See below how to do it.
511493

512494
```{r saveLoadRBinVectorModel, message=F, warning=F}
513495
# save model to R's raw vector

0 commit comments

Comments
 (0)