-
Notifications
You must be signed in to change notification settings - Fork 8
xgboost loss functions
xgboost is an extremely fast R package for learning nonlinear machine learning models using gradient boosting algorithms. It supports various different kinds of outputs via the objective argument. However it is missing
- For (left, right, and interval) censored outputs, AFT (https://en.wikipedia.org/wiki/Accelerated_failure_time_model) losses (Gaussian, Logistic).
- For count data with an upper bound, Binomial loss = negative log likelihood of the https://en.wikipedia.org/wiki/Binomial_distribution
Other R packages such as gbm implement the Cox loss for boosting with censored output regression. However gbm supports neither AFT nor binomial losses.
Other R packages such as glmnet implement the binomial loss for regularized linear models. However it is a linear model so may not be as accurate as boosting for some applications/data sets.
Figure out a method for passing these outputs to xgboost. In both cases (binomial/censored) the outputs can be represented as a 2-column matrix. Typically in R the
- censored outputs would be specified via Surv(lower.limit, upper.limit, type=”interval2”)
- binomial/count outputs would be specified as in glmnet, “two-column matrix of counts or proportions (the second column is treated as the target class.” The loss function in this case is the negative log likelihood of the binomial distribution. It can actually be used (inefficiently) in the current version of xgboost, by duplicating the feature matrix, then using the logistic loss with non-uniform weights. y = [n-vector of ones, n-vector of zeros], w = [n-vector of success counts, n-vector of failure counts]. However this is relatively inefficient because a data set of size 2n must be constructed. In the proposed GSOC project we should implement the loss which works on the original data set of size n.
In xgboost, implement the binomial loss for count outputs, and the Gaussian/Logistic AFT losses for censored outputs.
Docs
Tests
This project will provide support for two common but currently un-implemented loss functions / output data types in the xgboost package.
Students, please contact mentors below after completing at least one of the tests below.
- Toby Hocking <[email protected]> is a machine learning researcher and R package developer.
- Hyunsu Cho <[email protected]> is an expert in XGBoost internals and the core C++ stack.
Students, please do one or more of the following tests before contacting the mentors above.
MENTORS: write several tests that potential students can do to demonstrate their capabilities for this particular project. Ask some hard questions that will give you insight about how the students write code to solve problems. You’ll see that the harder the questions that you ask, the easier it will be for you to choose between the students that apply for your project! Please modify the suggestions below to make them specific for your project.
- Easy: create an Rmd file with source code that performs 5-fold cross-validation to evaluate the predictive accuracy of xgboost for a non-default objective function e.g. count:poisson. Make a figure that shows the test error for xgboost and an un-informed baseline that ignores all input features (i.e all predictions should be equal to the mean of the labels in the train data).
- Easy: Compile XGBoost R package from the latest source using CMake:
git clone --recursive https://github.com/dmlc/xgboost
mkdir build
cd build
cmake .. -DR_LIB=ON # R_LIB option enables R package
make -j4
make install # this command installs XGBoost R package- Easy: Write a customized objective function in XGBoost-R. Consider the following function, which penalizes over-estimation twice as under-estimation:
The first and second partial derivate of
my_loss(y, yhat) = max(yhat - y, 0.5 * (y - yhat))^2my_losswith respect to the second argument areSee the example of customized objective at https://github.com/dmlc/xgboost/blob/master/R-package/demo/custom_objective.Rgrad(y, yhat) = ifelse(yhat > y, 0.5 * (yhat - y), -2 * (y - yhat)) hess(y, yhat) = ifelse(yhat > y, 0.5, 2) - Easy [WIP]: Add diagnostic logging at an arbitrary point in the C++ codebase. This is really helpful when debugging the core.
- Medium: write a vignette in LaTeX or MathJax explaining how to use the logistic loss with non-uniform weights to get the binomial loss function in xgboost.
- Medium: Create your own loss function in C++ [WIP]
- Hard: Can the student write a package with Rd files, tests, and vigettes? If your package interfaces with non-R code, can the student write in that other language?
- Hard:
Students, please post a link to your test results here.