added weibull loss function

mathias-von-ottenbreit · mathias-von-ottenbreit · commit c0042d0c35b0 · 2023-05-06T18:06:58.000+02:00
diff --git a/API_REFERENCE.md b/API_REFERENCE.md
@@ -14,10 +14,10 @@ The learning rate. Must be greater than zero and not more than one. The higher t
 Used to randomly split training observations into training and validation if ***validation_set_indexes*** is not specified when fitting.
 
 #### loss_function (default = "mse")
-Determines the loss function used. Allowed values are "mse", "binomial", "poisson", "gamma", "tweedie", "group_mse", "mae", "quantile", "negative_binomial" and "cauchy". This is used together with ***link_function***. When ***loss_function*** is "group_mse" then the "group" argument in the ***fit*** method must be provided. In the latter case APLR will try to minimize group MSE when training the model. The ***loss_function*** "quantile" is used together with the ***quantile*** constructor parameter.
+Determines the loss function used. Allowed values are "mse", "binomial", "poisson", "gamma", "tweedie", "group_mse", "mae", "quantile", "negative_binomial", "cauchy" and "weibull". This is used together with ***link_function***. When ***loss_function*** is "group_mse" then the "group" argument in the ***fit*** method must be provided. In the latter case APLR will try to minimize group MSE when training the model. The ***loss_function*** "quantile" is used together with the ***quantile*** constructor parameter.
 
 #### link_function (default = "identity")
-Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***loss_function*** "mse" and ***link_function*** "identity". For logistic regression use ***loss_function*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma", "tweedie" or "negative_binomial" ***loss_function***, depending on the data. The ***loss_function*** "poisson", "gamma", "tweedie" or "negative_binomial" should only be used with the "log" ***link_function***. Inappropriate combinations of ***loss_function*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge.
+Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***loss_function*** "mse" and ***link_function*** "identity". For logistic regression use ***loss_function*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma", "tweedie", "negative_binomial" or "weibull" ***loss_function***, depending on the data. The ***loss_function*** "poisson", "gamma", "tweedie", "negative_binomial" or "weibull" should only be used with the "log" ***link_function***. Inappropriate combinations of ***loss_function*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge.
 
 #### n_jobs (default = 0)
 Multi-threading parameter. If ***0*** then uses all available cores for multi-threading. Any other positive integer specifies the number of cores to use (***1*** means single-threading).
@@ -50,7 +50,7 @@ Limits 1) the number of terms already in the model that can be considered as int
 ***0*** does not print progress reports during fitting. ***1*** prints a summary after running the ***fit*** method. ***2*** prints a summary after each boosting step.
 
 #### dispersion_parameter (default = 1.5)
-Specifies the variance power when ***loss_function*** is "tweedie". Specifies a dispersion parameter when ***loss_function*** is "negative_binomial" or "cauchy". 
+Specifies the variance power when ***loss_function*** is "tweedie". Specifies a dispersion parameter when ***loss_function*** is "negative_binomial", "cauchy" or "weibull". 
 
 #### validation_tuning_metric (default = "default")
 Specifies which metric to use for validating the model and tuning ***m***. Available options are "default" (using the same methodology as when calculating the training error), "mse", "mae", "negative_gini", "rankability" and "group_mse". The default is often a choice that fits well with respect to the ***loss_function*** chosen. However, if you want to use ***loss_function*** or ***dispersion_parameter*** as tuning parameters then the default is not suitable. "rankability" uses a methodology similar to the one described in https://towardsdatascience.com/how-to-calculate-roc-auc-score-for-regression-models-c0be4fdf76bb except that the metric is inverted and can be weighted by sample weights. "group_mse" requires that the "group" argument in the ***fit*** method is provided.
diff --git a/cpp/APLRRegressor.h b/cpp/APLRRegressor.h
@@ -261,6 +261,8 @@ void APLRRegressor::throw_error_if_loss_function_does_not_exist()
         loss_function_exists=true;
     else if(loss_function=="cauchy")
         loss_function_exists=true;
+    else if(loss_function=="weibull")
+        loss_function_exists=true;
     if(!loss_function_exists)
         throw std::runtime_error("Loss function "+loss_function+" is not available in APLR.");   
 }
@@ -288,7 +290,7 @@ void APLRRegressor::throw_error_if_dispersion_parameter_is_invalid()
         if(dispersion_parameter_is_invalid)
             throw std::runtime_error("Invalid dispersion_parameter (variance power). It must not equal 1.0 or 2.0 and cannot be below 1.0.");
     }
-    else if(loss_function=="negative_binomial" || loss_function=="cauchy")
+    else if(loss_function=="negative_binomial" || loss_function=="cauchy" || loss_function=="weibull")
     {
         bool dispersion_parameter_is_in_invalid{std::islessequal(dispersion_parameter, 0.0)};
         if(dispersion_parameter_is_in_invalid)
@@ -373,7 +375,7 @@ void APLRRegressor::throw_error_if_response_contains_invalid_values(const Vector
         std::string error_message{"Response values for the logit link function or binomial loss_function cannot be less than zero or greater than one."};
         throw_error_if_response_is_not_between_0_and_1(y,error_message);
     }
-    else if(loss_function=="gamma" || (loss_function=="tweedie" && std::isgreater(dispersion_parameter,2)) )
+    else if(loss_function=="gamma" || (loss_function=="tweedie" && std::isgreater(dispersion_parameter,2)))
     {
         std::string error_message;
         if(loss_function=="tweedie")
@@ -382,10 +384,10 @@ void APLRRegressor::throw_error_if_response_contains_invalid_values(const Vector
             error_message="Response values for the "+loss_function+" loss_function must be greater than zero.";
         throw_error_if_response_is_not_greater_than_zero(y,error_message);
     }
-    else if(link_function=="log" || loss_function=="poisson" || loss_function=="negative_binomial" 
+    else if(link_function=="log" || loss_function=="poisson" || loss_function=="negative_binomial" || loss_function=="weibull"
         || (loss_function=="tweedie" && std::isless(dispersion_parameter,2) && std::isgreater(dispersion_parameter,1)))
     {
-        std::string error_message{"Response values for the log link function or poisson loss_function or negative binomial loss function or tweedie loss_function when dispersion_parameter<2 cannot be less than zero."};
+        std::string error_message{"Response values for the log link function or poisson loss_function or negative binomial loss function or weibull loss function or tweedie loss_function when dispersion_parameter<2 cannot be less than zero."};
         throw_error_if_vector_contains_negative_values(y,error_message);
     }
     else if(validation_tuning_metric=="negative_gini")
@@ -685,6 +687,10 @@ VectorXd APLRRegressor::calculate_neg_gradient_current(const VectorXd &sample_we
     {
         ArrayXd residuals{y_train.array()-predictions_current.array()};
         output=2*residuals / (dispersion_parameter*dispersion_parameter + residuals.pow(2));
+    }
+    else if(loss_function=="weibull")
+    {
+        output= dispersion_parameter / predictions_current.array() * ( (y_train.array()/predictions_current.array()).pow(dispersion_parameter) - 1);
     }    
     
     if(link_function!="identity")
diff --git a/cpp/functions.h b/cpp/functions.h
@@ -162,6 +162,14 @@ VectorXd calculate_cauchy_errors(const VectorXd &y,const VectorXd &predicted,dou
     return errors;    
 }
 
+VectorXd calculate_weibull_errors(const VectorXd &y,const VectorXd &predicted,double dispersion_parameter)
+{
+    VectorXd errors{ dispersion_parameter*predicted.array().log() + (1-dispersion_parameter) * y.array().log() +
+        (y.array()/predicted.array()).pow(dispersion_parameter) };
+
+    return errors;    
+}
+
 VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const VectorXd &sample_weight=VectorXd(0),const std::string &loss_function="mse",
     double dispersion_parameter=1.5, const VectorXi &group=VectorXi(0), const std::set<int> &unique_groups={}, double quantile=0.5)
 {   
@@ -186,6 +194,8 @@ VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const Vect
         errors=calculate_negative_binomial_errors(y,predicted,dispersion_parameter);
     else if(loss_function=="cauchy")
         errors=calculate_cauchy_errors(y,predicted,dispersion_parameter);
+    else if(loss_function=="weibull")
+        errors=calculate_weibull_errors(y,predicted,dispersion_parameter);
 
     if(sample_weight.size()>0)
         errors=errors.array()*sample_weight.array();
diff --git a/cpp/test ALRRegressor weibull.cpp b/cpp/test ALRRegressor weibull.cpp
@@ -0,0 +1,60 @@
+#include <iostream>
+#include "term.h"
+#include "../dependencies/eigen-master/Eigen/Dense"
+#include <vector>
+#include <numeric>
+#include "APLRRegressor.h"
+#include <cmath>
+
+
+using namespace Eigen;
+
+int main()
+{
+    std::vector<bool> tests;
+    tests.reserve(1000);
+
+    //Model
+    APLRRegressor model{APLRRegressor()};
+    model.m=100;
+    model.v=0.1;
+    model.bins=300;
+    model.n_jobs=0;
+    model.loss_function="weibull";
+    model.link_function="log";
+    model.verbosity=3;
+    model.max_interaction_level=0;
+    model.max_interactions=1000;
+    model.min_observations_in_split=20;
+    model.ineligible_boosting_steps_added=10;
+    model.max_eligible_terms=5;
+    model.dispersion_parameter=1.5;
+
+    //Data    
+    MatrixXd X_train{load_csv_into_eigen_matrix<MatrixXd>("data/X_train.csv")};
+    MatrixXd X_test{load_csv_into_eigen_matrix<MatrixXd>("data/X_test.csv")}; 
+    VectorXd y_train{load_csv_into_eigen_matrix<MatrixXd>("data/y_train.csv")};    
+    VectorXd y_test{load_csv_into_eigen_matrix<MatrixXd>("data/y_test.csv")}; 
+
+    VectorXd sample_weight{VectorXd::Constant(y_train.size(),1.0)};
+
+    std::cout<<X_train;
+
+    //Fitting
+    //model.fit(X_train,y_train);
+    model.fit(X_train,y_train,sample_weight);
+    //model.fit(X_train,y_train,sample_weight,{},{0,1,2,3,4,5,10,static_cast<size_t>(y_train.size()-1)});
+    std::cout<<"feature importance\n"<<model.feature_importance<<"\n\n";
+
+    VectorXd predictions{model.predict(X_test)};
+    MatrixXd li{model.calculate_local_feature_importance(X_test)};
+
+    //Saving results
+    save_as_csv_file("data/output.csv",predictions);
+
+    std::cout<<predictions.mean()<<"\n\n";
+    tests.push_back(is_approximately_equal(predictions.mean(),23.6979,0.00001));
+
+    //Test summary
+    std::cout<<"\n\nTest summary\n"<<"Passed "<<std::accumulate(tests.begin(),tests.end(),0)<<" out of "<<tests.size()<<" tests.";
+}
diff --git a/examples/train_aplr_cross_validation.py b/examples/train_aplr_cross_validation.py
@@ -31,7 +31,7 @@
 
 #Training model
 param_grid = {"max_interaction_level":[0,1,2,3,100],"min_observations_in_split":[1, 20, 50, 100, 200]}
-loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial and cauchy.
+loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial, cauchy and weibull.
 link_function="identity" #other available link functions are logit and log.
 grid_search_cv = GridSearchCV(APLRRegressor(random_state=random_state,verbosity=1,m=1000,v=0.1,loss_function=loss_function,link_function=link_function),param_grid,cv=5,n_jobs=4,scoring="neg_mean_squared_error")
 grid_search_cv.fit(data_train[predictors].values,data_train[response].values)
diff --git a/examples/train_aplr_validation.py b/examples/train_aplr_validation.py
@@ -35,7 +35,7 @@
 best_validation_result=np.inf
 param_grid=ParameterGrid({"max_interaction_level":[0,1,2,3,100],"min_observations_in_split":[1, 20, 50, 100, 200]})
 best_model=None
-loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial and cauchy.
+loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial, cauchy and weibull.
 link_function="identity" #other available link functions are logit and log.
 for params in param_grid:
     model = APLRRegressor(random_state=random_state,verbosity=3,m=1000,v=0.1,loss_function=loss_function,link_function=link_function,**params) 
diff --git a/setup.py b/setup.py
@@ -15,7 +15,7 @@
 
 setuptools.setup(
     name='aplr',
-    version='3.0.0',
+    version='3.1.0',
     description='Automatic Piecewise Linear Regression',
     ext_modules=[sfc_module],
     author="Mathias von Ottenbreit",

Original file line number	Diff line number	Diff line change
`@@ -261,6 +261,8 @@ void APLRRegressor::throw_error_if_loss_function_does_not_exist()`
`261`	`261`	`loss_function_exists=true;`
`262`	`262`	`else if(loss_function=="cauchy")`
`263`	`263`	`loss_function_exists=true;`
	`264`	`+ else if(loss_function=="weibull")`
	`265`	`+ loss_function_exists=true;`
`264`	`266`	`if(!loss_function_exists)`
`265`	`267`	`throw std::runtime_error("Loss function "+loss_function+" is not available in APLR.");`
`266`	`268`	`}`
`@@ -288,7 +290,7 @@ void APLRRegressor::throw_error_if_dispersion_parameter_is_invalid()`
`288`	`290`	`if(dispersion_parameter_is_invalid)`
`289`	`291`	`throw std::runtime_error("Invalid dispersion_parameter (variance power). It must not equal 1.0 or 2.0 and cannot be below 1.0.");`
`290`	`292`	`}`
`291`		`- else if(loss_function=="negative_binomial" \|\| loss_function=="cauchy")`
	`293`	`+ else if(loss_function=="negative_binomial" \|\| loss_function=="cauchy" \|\| loss_function=="weibull")`
`292`	`294`	`{`
`293`	`295`	`bool dispersion_parameter_is_in_invalid{std::islessequal(dispersion_parameter, 0.0)};`
`294`	`296`	`if(dispersion_parameter_is_in_invalid)`
`@@ -373,7 +375,7 @@ void APLRRegressor::throw_error_if_response_contains_invalid_values(const Vector`
`373`	`375`	`std::string error_message{"Response values for the logit link function or binomial loss_function cannot be less than zero or greater than one."};`
`374`	`376`	`throw_error_if_response_is_not_between_0_and_1(y,error_message);`
`375`	`377`	`}`
`376`		`- else if(loss_function=="gamma" \|\| (loss_function=="tweedie" && std::isgreater(dispersion_parameter,2)) )`
	`378`	`+ else if(loss_function=="gamma" \|\| (loss_function=="tweedie" && std::isgreater(dispersion_parameter,2)))`
`377`	`379`	`{`
`378`	`380`	`std::string error_message;`
`379`	`381`	`if(loss_function=="tweedie")`
`@@ -382,10 +384,10 @@ void APLRRegressor::throw_error_if_response_contains_invalid_values(const Vector`
`382`	`384`	`error_message="Response values for the "+loss_function+" loss_function must be greater than zero.";`
`383`	`385`	`throw_error_if_response_is_not_greater_than_zero(y,error_message);`
`384`	`386`	`}`
`385`		`- else if(link_function=="log" \|\| loss_function=="poisson" \|\| loss_function=="negative_binomial"`
	`387`	`+ else if(link_function=="log" \|\| loss_function=="poisson" \|\| loss_function=="negative_binomial" \|\| loss_function=="weibull"`
`386`	`388`	`\|\| (loss_function=="tweedie" && std::isless(dispersion_parameter,2) && std::isgreater(dispersion_parameter,1)))`
`387`	`389`	`{`
`388`		`- std::string error_message{"Response values for the log link function or poisson loss_function or negative binomial loss function or tweedie loss_function when dispersion_parameter<2 cannot be less than zero."};`
	`390`	`+ std::string error_message{"Response values for the log link function or poisson loss_function or negative binomial loss function or weibull loss function or tweedie loss_function when dispersion_parameter<2 cannot be less than zero."};`
`389`	`391`	`throw_error_if_vector_contains_negative_values(y,error_message);`
`390`	`392`	`}`
`391`	`393`	`else if(validation_tuning_metric=="negative_gini")`
`@@ -685,6 +687,10 @@ VectorXd APLRRegressor::calculate_neg_gradient_current(const VectorXd &sample_we`
`685`	`687`	`{`
`686`	`688`	`ArrayXd residuals{y_train.array()-predictions_current.array()};`
`687`	`689`	`output=2residuals / (dispersion_parameterdispersion_parameter + residuals.pow(2));`
	`690`	`+ }`
	`691`	`+ else if(loss_function=="weibull")`
	`692`	`+ {`
	`693`	`+ output= dispersion_parameter / predictions_current.array() * ( (y_train.array()/predictions_current.array()).pow(dispersion_parameter) - 1);`
`688`	`694`	`}`
`689`	`695`
`690`	`696`	`if(link_function!="identity")`