- Changed
CV_lrnr_sltocv_sl - Added
Lrnr_glmtree, which uses thepartykitR package to fit recursive partitioning and regression trees in a generalized linear model. - Added fold-specific SL coefficients to the output of
cv_sl, and removed the coefficients column from the returnedcv_risktable. - Added
get_sl_revere_riskargument toLrnr_sl'scv_riskmethod to provide the option (with default ofFALSE) to add a super learner's revere-based risk (not a true cross-validated risk) tocv_riskoutput. - Changed default metalearner to
Lrnr_nnlsfor binary and continuous outcomes. - Added
cv_controlargument toLrnr_sl, which allows users to define specific cross-validation structures for fitting the super learner. This is intended for use in a nested cross-validation scheme (such as cross-validated super learner,cv_sl, or whenLrnr_slis considered in the list of candidatelearnersin anotherLrnr_sl). In addition to constructing clustered cross-validation with respect toid,cv_controlalso can be used to construct stratified cross-validation folds forLrnr_sl. Lrnr_caretnow works for binary and categorical outcomes. Previous versions state that these discrete outcome types are supported byLrnr_caret, but the functionality would brake.- Added public function for
sl3_Task,get_folds, which takes inorigami::make_foldsarguments and returns the folds. This function is now called bytask$foldsand it can be called in train as well, to obtain folds from a task that have a non-default fold structure. - Learners that use CV internally (i.e., as part of their procedure to select
tuning parameters), including
Lrnr_caret,Lrnr_glmnet,Lrnr_hal9001, andLrnr_sl, usetask$get_foldsto create folds. The learners' folds respect the default CV fold structure insl3tasks (clustered CV whenidis supplied in the task; and stratified CV when outcomes are categorical or binary, and whenidare nested in strata ifidsupplied to task). However,Vcan be modified according to the learner-specific parameters. (Lrnr_slhas a few extra CV tuning arguments, which are thoroughly documented incv_controland modifications are only recommended for advanced use ofLrnr_sl.) - Fixed learner parameter
formulabug, which was causing formulas with "." to return an empty task, and therefore learners with these formulas to fail. - Fixed bug in
Lrnr_cv_selectormetalearner, which was using the wrong folds to calculate the cross-validated risk estimate. This impactedLrnr_cv_selectorwheneval_functionwas not a loss function, e.g. AUC. By callingtask$foldson the metalearner's training task, we were deriving folds from the matrix of cross-validated predictions, and not using the folds for cross-validating the candidates. We now require the folds for cross- validating the candidates (i.e., the folds in task for trainingLrnr_sl) to be supplied whenLrnr_cv_selector'seval_functionis not a loss function. Lrnr_caretandLrnr_rpartfactor binary outcomes in theirtrainmethods, thereby considering a classification prediction problem. To avoid this behavior and consider a regression prediction problem with a binary outcome (e.g., to minimize the squared error or negative log likelihood loss in a binary outcome prediction problem), users can setfactor_binary_outcome = FALSEwhen they instantiate the learner.- Tasks can be created without an outcome. This comes in handy when creating a task that is used only for prediction, not for training, and leads to the task's outcome type being set to "none" if it's not supplied.
- When the variable type of the outcome (i.e.,
outcome_type) is necessary for a learner'spredictmethod (e.g., if categorical outcome predictions need to be "packed" together), the outcome type in the training task should be used. That is,private$.training_outcome_typeshould be used to obtain the outcome type in a learner'spredictmethod; the task supplied topredictshould not be used. The following learners were referring to the task supplied topredictin order to retain the outcome type, and they were modified to use the training task's outcome type instead:Lrnr_svm,Lrnr_randomForest,Lrnr_ranger,Lrnr_rpart,Lrnr_polspline. The issue with pulling the outcome type from the task supplied topredictis that the outcome type of that task might be "none", if theoutcomeargument is not supplied to it. - Updated the learner template (inst/templates/Lrnr_template.R) to reflect the new formatting guidelines for learner documentation.
- Updated documentation for
sl3_Taskparameters (man-roxygen/sl3_Task_extra.R). Specifically,drop_missing_outcomeandflagwere added;offsetdescription was fixed; description offoldswas added, including how to modify it and the default; and description of how the default cross-validation structure considersidand discrete (binary and categorical) outcome types to construct clustered and stratified cross-validation schemes, respectively, was added. - Added documentation for the function
process_data(R/process_data.R), which is called when instantiating a task, to process the covariates and identify missingness in the outcome. - Added
Lrnr_grfcate, a prediction function estimator for conditional average treatment effect (CATE), which uses thecausal_forestfunction ingrfpackage. This learner is intended for use in thetmle3mopttxpackage, where CATE estimation and prediction is required. - Added flexibility and error handling to optional
sl3_Taskargumentoutcome_type. Either"binomial","binary"orbinomial()can be supplied for a binary outcome;"continuous","gaussian", orgaussian()for a continuous outcome;"categorical","multinomial", ormutlinomial()"for a categorical outcome. As before, whenoutcome_typeis not supplied, we will try to detect it from the outcome values. If the suppliedoutcome_typediffers from the detected one, a warning is now thrown. Ifoutcome_typeis supplied but invalid, then an error is thrown uponsl3_Taskinstantiation, opposed to learner training. - Cross-validated super learner (
cv_sl) returns the cross-validated predictions for the super learner and its candidates.
- Updates to
Lrnr_nnlsto support binary outcomes, including support for convexity of the resultant model fit and warnings on prediction quality. - Refined, clearer documentation for
Lrnr_define_interactions - Tweaks to
Lrnr_boundto better support more flexible bounding for continuous outcomes (automatically setting a maximum of infinity). - Changes to
Lrnr_cv_selectorto support improved computation of the CV-risk, averaging the risk strictly across validation/holdout sets. - Bug fixes for
Lrnr_earth(improving formals recognition),Lrnr_glmnet(allowing offsets), andLrnr_caret(reformatting of arguments).
- Additional arguments for 'Keras' learners
Lrnr_lstm_kerasandLrnr_gru_kerasprovide support for callback functions list and 2-layer networks. Defaultcallbackslist provides early stopping criteria with respect to 'Keras' defaults andpatienceof 10 epochs. Also, these two 'Keras' learners now callargs_to_listupon initialization, and set verbose argument according tooptions("keras.fit_verbose")oroptions("sl3.verbose"). - Update
Lrnr_xgboostto support prediction tasks consisting of one observation (e.g., leave-one-out cross-validation). - Update
Lrnr_slby adding a new private slot.cv_riskto store the risk estimates, using this to avoid unnecessary re-computation in theprintmethod (the.cv_riskslot is populated on the firstprintcall, and only ever re-printed thereafter). - Update documentation of
default_metalearnerto use native markdown tables. - Fix
Lrnr_screener_importance's pairing of (a) covariates returned by the importance function with (b) covariates as they are defined in the task. This issue only arose when discrete covariates were automatically one-hot encoded upon task initiation (i.e., whencolnames(task$X) != task$nodes$covariates). - Reformat
importance_plotto plot variables in decreasing order of importance, so most important variables are placed at the top of the dotchart. - Enhanced functionality in
sl3task'sadd_interactionsmethod to support interactions that involve factors. This method is most commonly used byLrnr_define_interactions, which is intended for use with another learner (e.g.,Lrnr_glmnetorLrnr_glm) in aPipeline. - Modified
Lrnr_gamformula (if not specified by user) to not usemgcv's defaultk=10degrees of freedom for each smoothsterm when there are less thank=10degrees of freedom. This bypasses anmgcv::gamerror, and tends to be relevant only for small n. - Added
options(java.parameters = "-Xmx2500m")and warning message whenLrnr_bartMachineis initialized, if this option has not already been set. This option was incorporated since the default RAM of 500MB for a Java virtual machine often errors due to memory issues withLrnr_bartMachine. - Incorporated
stratify_cvargument inLrnr_glmnet, which stratifies internal cross-validation folds such that binary outcome prevalence in training and validation folds roughly matches the prevalence in the training task. - Incorporated
min_screenargumentLrnr_screener_coefs, which tries to ensure that at leastmin_screennumber of covariates are selected. If this argument is specified and thelearnerargument inLrnr_screener_coefsis aLrnr_glmnet, thenlambdais increased untilmin_screennumber of covariates are selected and a warning is produced. Ifmin_screenis specified and thelearnerargument inLrnr_screener_coefsis not aLrnr_glmnetthen it will error. - Updated
Lrnr_hal9001to work with v0.4.0 of thehal9001package. - Added
formulaparameter andprocess_formulafunction to the base learner,Lrnr_base, whose methods carry over to all other learners. When aformulais supplied as a learner parameter, theprocess_formula function constructs a design matrix by supplying theformulatomodel.matrix. This implementation allowsformulato be supplied to all learners, even those without nativeformulasupport. Theformulashould be an object of class "formula`", or a character string that can be coerced to that class. - Added factory function for performance-based risks for binary outcomes with
ROCRperformance measurescustom_ROCR_risk. Supports cutoff-dependent and scalarROCRperformance measures. The risk is defined as 1 - performance, and is transformed back to the performance measure incv_riskandimportancefunctions. This change prompted the revision of argument nameloss_funandloss_functiontoeval_funandeval_function, respectively, since the evaluation of predictions relative to the observations can be either a risk or a loss function. This argument name change impacted the following:Lrnr_solnp,Lrnr_optim,Lrnr_cv_selector,cv_risk,importance, andCV_Lrnr_sl. - Added name attribute to all loss functions, where naming was defined in terms
of the risk implied by each loss function (i.e., the common name for the
expected loss). The names in
cv_riskandimportancetables now swap "risk" with this name attribute. - Incorporated stratified cross-validation when
foldsare not supplied to thesl3_Taskand the outcome is a discrete (i.e., binary or categorical) variable. - Added to the
importancemethod the option to evaluate importance overcovariate_groups, by removing/permuting all covariates in the same group together. - Added
Lrnr_gaas another metalearner.
- Updates to variable importance functionality, including calculation of risk ratio and risk differences under covariate deletion or permutation.
- Addition of a
importance_plotto summarize variable importance findings. - Additions of new methods
reparameterizeandretraintoLrnr_base, which allows modification of the covariate set while training on a conserved task and prediction on a new task using previously trained learners, respectively.
[missing]
[missing]
[missing]
- Updates to variable importance functionality, including use of risk ratios.
- Change
Lrnr_hal9001andLrnr_glmnetto respect observation-level IDs. - Removal of
Remotesand deprecation ofLrnr_rfcdeandLrnr_condensier:- Both of these learner classes provided support for conditional density estimation (CDE) and were useful when support for CDE was more limited. Unfortunately, both packages are un-maintained or updated only very sporadically, resulting in both frequent bugs and presenting an obstacle for an eventual CRAN release (both packages are GitHub-only).
Lrnr_rfcdewrapped https://github.com/tpospisi/RFCDE, a sporadically maintained tool for conditional density estimation (CDE). Support for this has been removed in favor of built-in CDE tools, including, among others,Lrnr_density_semiparametric.Lrnr_condensierwrapped https://github.com/osofr/condensier, which provided a pooled hazards approach to CDE. This package contained an implementation error (osofr/condensier#15) and was removed from CRAN. Support for this has been removed in favor ofLrnr_density_semiparametricandLrnr_haldensify, both of which more reliably provide CDE support.
- Sampling methods for Monte Carlo integration and related procedures.
- A metalearner for the cross-validation selector (discrete super learner).
- A learner for bounding, including support for bounded losses.
- Resolution of a number of older issues (see #264).
- Relaxation of checks inside
Stackobjects for time series learners. - Addition of a learner property table to
README.Rmd. - Maintenance and documentation updates.
- Overhaul of data preprocessing.
- New screening methods and convex combination in
Lrnr_nnls. - Bug fixes, including covariate subsetting and better handling of
NAs. - Package and documentation cleanup; continuous integration and testing fixes.
- Reproducibility updates (including new versioning and DOI minting).
- Fixes incorrect handling of missingness in the automatic imputation procedure.
- Adds new standard learners, including from the
gamandcaretpackages. - Adds custom learners for conditional density estimation, including semiparametric methods based on conditional mean and conditional mean/variance estimation as well as generalized functionality for density estimation via a pooled hazards approach.
- Default metalearners based on task outcome types.
- Handling of imputation internally in task objects.
- Addition of several new learners, including from the
gbm,earth,polsplinepackages. - Fixing errors in existing learners (e.g., subtle parallelization in
xgboostandranger). - Support for multivariate outcomes
- Sets default cross-validation to be revere-style.
- Support for cross-validated super learner and variable importance.
- A full-featured and stable release of the project.
- Numerous learners are included and many bugs have been fixed relative to earlier versions (esp v1.0.0) of the software.
- An initial stable release.