Skip to content

Commit c0042d0

Browse files
added weibull loss function
1 parent 2406748 commit c0042d0

File tree

7 files changed

+86
-10
lines changed

7 files changed

+86
-10
lines changed

API_REFERENCE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,10 @@ The learning rate. Must be greater than zero and not more than one. The higher t
1414
Used to randomly split training observations into training and validation if ***validation_set_indexes*** is not specified when fitting.
1515

1616
#### loss_function (default = "mse")
17-
Determines the loss function used. Allowed values are "mse", "binomial", "poisson", "gamma", "tweedie", "group_mse", "mae", "quantile", "negative_binomial" and "cauchy". This is used together with ***link_function***. When ***loss_function*** is "group_mse" then the "group" argument in the ***fit*** method must be provided. In the latter case APLR will try to minimize group MSE when training the model. The ***loss_function*** "quantile" is used together with the ***quantile*** constructor parameter.
17+
Determines the loss function used. Allowed values are "mse", "binomial", "poisson", "gamma", "tweedie", "group_mse", "mae", "quantile", "negative_binomial", "cauchy" and "weibull". This is used together with ***link_function***. When ***loss_function*** is "group_mse" then the "group" argument in the ***fit*** method must be provided. In the latter case APLR will try to minimize group MSE when training the model. The ***loss_function*** "quantile" is used together with the ***quantile*** constructor parameter.
1818

1919
#### link_function (default = "identity")
20-
Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***loss_function*** "mse" and ***link_function*** "identity". For logistic regression use ***loss_function*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma", "tweedie" or "negative_binomial" ***loss_function***, depending on the data. The ***loss_function*** "poisson", "gamma", "tweedie" or "negative_binomial" should only be used with the "log" ***link_function***. Inappropriate combinations of ***loss_function*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge.
20+
Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***loss_function*** "mse" and ***link_function*** "identity". For logistic regression use ***loss_function*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma", "tweedie", "negative_binomial" or "weibull" ***loss_function***, depending on the data. The ***loss_function*** "poisson", "gamma", "tweedie", "negative_binomial" or "weibull" should only be used with the "log" ***link_function***. Inappropriate combinations of ***loss_function*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge.
2121

2222
#### n_jobs (default = 0)
2323
Multi-threading parameter. If ***0*** then uses all available cores for multi-threading. Any other positive integer specifies the number of cores to use (***1*** means single-threading).
@@ -50,7 +50,7 @@ Limits 1) the number of terms already in the model that can be considered as int
5050
***0*** does not print progress reports during fitting. ***1*** prints a summary after running the ***fit*** method. ***2*** prints a summary after each boosting step.
5151

5252
#### dispersion_parameter (default = 1.5)
53-
Specifies the variance power when ***loss_function*** is "tweedie". Specifies a dispersion parameter when ***loss_function*** is "negative_binomial" or "cauchy".
53+
Specifies the variance power when ***loss_function*** is "tweedie". Specifies a dispersion parameter when ***loss_function*** is "negative_binomial", "cauchy" or "weibull".
5454

5555
#### validation_tuning_metric (default = "default")
5656
Specifies which metric to use for validating the model and tuning ***m***. Available options are "default" (using the same methodology as when calculating the training error), "mse", "mae", "negative_gini", "rankability" and "group_mse". The default is often a choice that fits well with respect to the ***loss_function*** chosen. However, if you want to use ***loss_function*** or ***dispersion_parameter*** as tuning parameters then the default is not suitable. "rankability" uses a methodology similar to the one described in https://towardsdatascience.com/how-to-calculate-roc-auc-score-for-regression-models-c0be4fdf76bb except that the metric is inverted and can be weighted by sample weights. "group_mse" requires that the "group" argument in the ***fit*** method is provided.

cpp/APLRRegressor.h

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,8 @@ void APLRRegressor::throw_error_if_loss_function_does_not_exist()
261261
loss_function_exists=true;
262262
else if(loss_function=="cauchy")
263263
loss_function_exists=true;
264+
else if(loss_function=="weibull")
265+
loss_function_exists=true;
264266
if(!loss_function_exists)
265267
throw std::runtime_error("Loss function "+loss_function+" is not available in APLR.");
266268
}
@@ -288,7 +290,7 @@ void APLRRegressor::throw_error_if_dispersion_parameter_is_invalid()
288290
if(dispersion_parameter_is_invalid)
289291
throw std::runtime_error("Invalid dispersion_parameter (variance power). It must not equal 1.0 or 2.0 and cannot be below 1.0.");
290292
}
291-
else if(loss_function=="negative_binomial" || loss_function=="cauchy")
293+
else if(loss_function=="negative_binomial" || loss_function=="cauchy" || loss_function=="weibull")
292294
{
293295
bool dispersion_parameter_is_in_invalid{std::islessequal(dispersion_parameter, 0.0)};
294296
if(dispersion_parameter_is_in_invalid)
@@ -373,7 +375,7 @@ void APLRRegressor::throw_error_if_response_contains_invalid_values(const Vector
373375
std::string error_message{"Response values for the logit link function or binomial loss_function cannot be less than zero or greater than one."};
374376
throw_error_if_response_is_not_between_0_and_1(y,error_message);
375377
}
376-
else if(loss_function=="gamma" || (loss_function=="tweedie" && std::isgreater(dispersion_parameter,2)) )
378+
else if(loss_function=="gamma" || (loss_function=="tweedie" && std::isgreater(dispersion_parameter,2)))
377379
{
378380
std::string error_message;
379381
if(loss_function=="tweedie")
@@ -382,10 +384,10 @@ void APLRRegressor::throw_error_if_response_contains_invalid_values(const Vector
382384
error_message="Response values for the "+loss_function+" loss_function must be greater than zero.";
383385
throw_error_if_response_is_not_greater_than_zero(y,error_message);
384386
}
385-
else if(link_function=="log" || loss_function=="poisson" || loss_function=="negative_binomial"
387+
else if(link_function=="log" || loss_function=="poisson" || loss_function=="negative_binomial" || loss_function=="weibull"
386388
|| (loss_function=="tweedie" && std::isless(dispersion_parameter,2) && std::isgreater(dispersion_parameter,1)))
387389
{
388-
std::string error_message{"Response values for the log link function or poisson loss_function or negative binomial loss function or tweedie loss_function when dispersion_parameter<2 cannot be less than zero."};
390+
std::string error_message{"Response values for the log link function or poisson loss_function or negative binomial loss function or weibull loss function or tweedie loss_function when dispersion_parameter<2 cannot be less than zero."};
389391
throw_error_if_vector_contains_negative_values(y,error_message);
390392
}
391393
else if(validation_tuning_metric=="negative_gini")
@@ -685,6 +687,10 @@ VectorXd APLRRegressor::calculate_neg_gradient_current(const VectorXd &sample_we
685687
{
686688
ArrayXd residuals{y_train.array()-predictions_current.array()};
687689
output=2*residuals / (dispersion_parameter*dispersion_parameter + residuals.pow(2));
690+
}
691+
else if(loss_function=="weibull")
692+
{
693+
output= dispersion_parameter / predictions_current.array() * ( (y_train.array()/predictions_current.array()).pow(dispersion_parameter) - 1);
688694
}
689695

690696
if(link_function!="identity")

cpp/functions.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,14 @@ VectorXd calculate_cauchy_errors(const VectorXd &y,const VectorXd &predicted,dou
162162
return errors;
163163
}
164164

165+
VectorXd calculate_weibull_errors(const VectorXd &y,const VectorXd &predicted,double dispersion_parameter)
166+
{
167+
VectorXd errors{ dispersion_parameter*predicted.array().log() + (1-dispersion_parameter) * y.array().log() +
168+
(y.array()/predicted.array()).pow(dispersion_parameter) };
169+
170+
return errors;
171+
}
172+
165173
VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const VectorXd &sample_weight=VectorXd(0),const std::string &loss_function="mse",
166174
double dispersion_parameter=1.5, const VectorXi &group=VectorXi(0), const std::set<int> &unique_groups={}, double quantile=0.5)
167175
{
@@ -186,6 +194,8 @@ VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const Vect
186194
errors=calculate_negative_binomial_errors(y,predicted,dispersion_parameter);
187195
else if(loss_function=="cauchy")
188196
errors=calculate_cauchy_errors(y,predicted,dispersion_parameter);
197+
else if(loss_function=="weibull")
198+
errors=calculate_weibull_errors(y,predicted,dispersion_parameter);
189199

190200
if(sample_weight.size()>0)
191201
errors=errors.array()*sample_weight.array();

cpp/test ALRRegressor weibull.cpp

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#include <iostream>
2+
#include "term.h"
3+
#include "../dependencies/eigen-master/Eigen/Dense"
4+
#include <vector>
5+
#include <numeric>
6+
#include "APLRRegressor.h"
7+
#include <cmath>
8+
9+
10+
using namespace Eigen;
11+
12+
int main()
13+
{
14+
std::vector<bool> tests;
15+
tests.reserve(1000);
16+
17+
//Model
18+
APLRRegressor model{APLRRegressor()};
19+
model.m=100;
20+
model.v=0.1;
21+
model.bins=300;
22+
model.n_jobs=0;
23+
model.loss_function="weibull";
24+
model.link_function="log";
25+
model.verbosity=3;
26+
model.max_interaction_level=0;
27+
model.max_interactions=1000;
28+
model.min_observations_in_split=20;
29+
model.ineligible_boosting_steps_added=10;
30+
model.max_eligible_terms=5;
31+
model.dispersion_parameter=1.5;
32+
33+
//Data
34+
MatrixXd X_train{load_csv_into_eigen_matrix<MatrixXd>("data/X_train.csv")};
35+
MatrixXd X_test{load_csv_into_eigen_matrix<MatrixXd>("data/X_test.csv")};
36+
VectorXd y_train{load_csv_into_eigen_matrix<MatrixXd>("data/y_train.csv")};
37+
VectorXd y_test{load_csv_into_eigen_matrix<MatrixXd>("data/y_test.csv")};
38+
39+
VectorXd sample_weight{VectorXd::Constant(y_train.size(),1.0)};
40+
41+
std::cout<<X_train;
42+
43+
//Fitting
44+
//model.fit(X_train,y_train);
45+
model.fit(X_train,y_train,sample_weight);
46+
//model.fit(X_train,y_train,sample_weight,{},{0,1,2,3,4,5,10,static_cast<size_t>(y_train.size()-1)});
47+
std::cout<<"feature importance\n"<<model.feature_importance<<"\n\n";
48+
49+
VectorXd predictions{model.predict(X_test)};
50+
MatrixXd li{model.calculate_local_feature_importance(X_test)};
51+
52+
//Saving results
53+
save_as_csv_file("data/output.csv",predictions);
54+
55+
std::cout<<predictions.mean()<<"\n\n";
56+
tests.push_back(is_approximately_equal(predictions.mean(),23.6979,0.00001));
57+
58+
//Test summary
59+
std::cout<<"\n\nTest summary\n"<<"Passed "<<std::accumulate(tests.begin(),tests.end(),0)<<" out of "<<tests.size()<<" tests.";
60+
}

examples/train_aplr_cross_validation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131

3232
#Training model
3333
param_grid = {"max_interaction_level":[0,1,2,3,100],"min_observations_in_split":[1, 20, 50, 100, 200]}
34-
loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial and cauchy.
34+
loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial, cauchy and weibull.
3535
link_function="identity" #other available link functions are logit and log.
3636
grid_search_cv = GridSearchCV(APLRRegressor(random_state=random_state,verbosity=1,m=1000,v=0.1,loss_function=loss_function,link_function=link_function),param_grid,cv=5,n_jobs=4,scoring="neg_mean_squared_error")
3737
grid_search_cv.fit(data_train[predictors].values,data_train[response].values)

examples/train_aplr_validation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
best_validation_result=np.inf
3636
param_grid=ParameterGrid({"max_interaction_level":[0,1,2,3,100],"min_observations_in_split":[1, 20, 50, 100, 200]})
3737
best_model=None
38-
loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial and cauchy.
38+
loss_function="mse" #other available families are binomial, poisson, gamma, tweedie, group_mse, mae, quantile, negative_binomial, cauchy and weibull.
3939
link_function="identity" #other available link functions are logit and log.
4040
for params in param_grid:
4141
model = APLRRegressor(random_state=random_state,verbosity=3,m=1000,v=0.1,loss_function=loss_function,link_function=link_function,**params)

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
setuptools.setup(
1717
name='aplr',
18-
version='3.0.0',
18+
version='3.1.0',
1919
description='Automatic Piecewise Linear Regression',
2020
ext_modules=[sfc_module],
2121
author="Mathias von Ottenbreit",

0 commit comments

Comments
 (0)