Skip to content

Commit 7cb9dfa

Browse files
added "quantile" family
1 parent 742c90a commit 7cb9dfa

File tree

9 files changed

+119
-18
lines changed

9 files changed

+119
-18
lines changed

API_REFERENCE.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# APLRRegressor
22

3-
## class aplr.APLRRegressor(m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5, validation_tuning_metric:str="default")
3+
## class aplr.APLRRegressor(m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5, validation_tuning_metric:str="default", quantile:float=0.5)
44

55
### Constructor parameters
66

@@ -14,7 +14,7 @@ The learning rate. Must be greater than zero and not more than one. The higher t
1414
Used to randomly split training observations into training and validation if ***validation_set_indexes*** is not specified when fitting.
1515

1616
#### family (default = "gaussian")
17-
Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma", "tweedie", "group_gaussian" and "mae". This is used together with ***link_function***. When ***family*** is "group_gaussian" then the "group" argument in the ***fit*** method must be provided. In the latter case APLR will try to minimize group MSE when training the model.
17+
Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma", "tweedie", "group_gaussian", "mae" and "quantile". This is used together with ***link_function***. When ***family*** is "group_gaussian" then the "group" argument in the ***fit*** method must be provided. In the latter case APLR will try to minimize group MSE when training the model. The ***family*** "quantile" is used together with the ***quantile*** constructor parameter.
1818

1919
#### link_function (default = "identity")
2020
Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***family*** "gaussian" and ***link_function*** "identity". For logistic regression use ***family*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma" or "tweedie" ***family***, depending on the data. The ***family*** "poisson", "gamma" or "tweedie" should only be used with the "log" ***link_function***. Inappropriate combinations of ***family*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge.
@@ -55,6 +55,10 @@ Specifies the variance power for the "tweedie" ***family***.
5555
#### validation_tuning_metric (default = "default")
5656
Specifies which metric to use for validating the model and tuning ***m***. Available options are "default" (using the same methodology as when calculating the training error), "mse", "mae", "negative_gini" and "rankability". The default is often a choice that fits well with respect to the ***family*** chosen. However, if you want to use ***family*** or ***tweedie_power*** as tuning parameters then the default is not suitable. "rankability" uses a methodology similar to the one described in https://towardsdatascience.com/how-to-calculate-roc-auc-score-for-regression-models-c0be4fdf76bb except that the metric is inverted and can be weighted by sample weights.
5757

58+
#### quantile (default = 0.5)
59+
Specifies the quantile to use when ***family*** is "quantile".
60+
61+
5862
## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[], monotonic_constraints:List[int]=[], group:npt.ArrayLike = np.empty(0))
5963

6064
***This method fits the model to data.***

aplr/aplr.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66

77
class APLRRegressor():
8-
def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5, validation_tuning_metric:str="default"):
8+
def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5, validation_tuning_metric:str="default", quantile:float=0.5):
99
self.m=m
1010
self.v=v
1111
self.random_state=random_state
@@ -23,6 +23,7 @@ def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaus
2323
self.verbosity=verbosity
2424
self.tweedie_power=tweedie_power
2525
self.validation_tuning_metric=validation_tuning_metric
26+
self.quantile=quantile
2627

2728
#Creating aplr_cpp and setting parameters
2829
self.APLRRegressor=aplr_cpp.APLRRegressor()
@@ -47,6 +48,7 @@ def __set_params_cpp(self):
4748
self.APLRRegressor.verbosity=self.verbosity
4849
self.APLRRegressor.tweedie_power=self.tweedie_power
4950
self.APLRRegressor.validation_tuning_metric=self.validation_tuning_metric
51+
self.APLRRegressor.quantile=self.quantile
5052

5153
def fit(self, X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[], monotonic_constraints:List[int]=[],group:npt.ArrayLike = np.empty(0)):
5254
self.__set_params_cpp()
@@ -116,7 +118,8 @@ def get_params(self, deep=True):
116118
"ineligible_boosting_steps_added":self.ineligible_boosting_steps_added,
117119
"max_eligible_terms":self.max_eligible_terms,
118120
"tweedie_power":self.tweedie_power,
119-
"validation_tuning_metric":self.validation_tuning_metric
121+
"validation_tuning_metric":self.validation_tuning_metric,
122+
"quantile":self.quantile
120123
}
121124

122125
#For sklearn

cpp/APLRRegressor.h

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,13 +141,14 @@ class APLRRegressor
141141
double max_training_prediction_or_response;
142142
std::vector<size_t> validation_indexes;
143143
std::string validation_tuning_metric;
144+
double quantile;
144145

145146
//Methods
146147
APLRRegressor(size_t m=1000,double v=0.1,uint_fast32_t random_state=std::numeric_limits<uint_fast32_t>::lowest(),std::string family="gaussian",
147148
std::string link_function="identity", size_t n_jobs=0, double validation_ratio=0.2,double intercept=NAN_DOUBLE,
148149
size_t reserved_terms_times_num_x=100, size_t bins=300,size_t verbosity=0,size_t max_interaction_level=1,size_t max_interactions=100000,
149150
size_t min_observations_in_split=20, size_t ineligible_boosting_steps_added=10, size_t max_eligible_terms=5,double tweedie_power=1.5,
150-
std::string validation_tuning_metric="default");
151+
std::string validation_tuning_metric="default", double quantile=0.5);
151152
APLRRegressor(const APLRRegressor &other);
152153
~APLRRegressor();
153154
void fit(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight=VectorXd(0),const std::vector<std::string> &X_names={},const std::vector<size_t> &validation_set_indexes={},
@@ -174,15 +175,15 @@ class APLRRegressor
174175
APLRRegressor::APLRRegressor(size_t m,double v,uint_fast32_t random_state,std::string family,std::string link_function,size_t n_jobs,
175176
double validation_ratio,double intercept,size_t reserved_terms_times_num_x,size_t bins,size_t verbosity,size_t max_interaction_level,
176177
size_t max_interactions,size_t min_observations_in_split,size_t ineligible_boosting_steps_added,size_t max_eligible_terms,double tweedie_power,
177-
std::string validation_tuning_metric):
178+
std::string validation_tuning_metric, double quantile):
178179
reserved_terms_times_num_x{reserved_terms_times_num_x},intercept{intercept},m{m},v{v},
179180
family{family},link_function{link_function},validation_ratio{validation_ratio},n_jobs{n_jobs},random_state{random_state},
180181
bins{bins},verbosity{verbosity},max_interaction_level{max_interaction_level},
181182
intercept_steps{VectorXd(0)},max_interactions{max_interactions},interactions_eligible{0},validation_error_steps{VectorXd(0)},
182183
min_observations_in_split{min_observations_in_split},ineligible_boosting_steps_added{ineligible_boosting_steps_added},
183184
max_eligible_terms{max_eligible_terms},number_of_base_terms{0},tweedie_power{tweedie_power},min_training_prediction_or_response{NAN_DOUBLE},
184185
max_training_prediction_or_response{NAN_DOUBLE}, validation_tuning_metric{validation_tuning_metric},
185-
validation_indexes{std::vector<size_t>(0)}
186+
validation_indexes{std::vector<size_t>(0)}, quantile{quantile}
186187
{
187188
}
188189

@@ -198,7 +199,7 @@ APLRRegressor::APLRRegressor(const APLRRegressor &other):
198199
max_eligible_terms{other.max_eligible_terms},number_of_base_terms{other.number_of_base_terms},
199200
feature_importance{other.feature_importance},tweedie_power{other.tweedie_power},min_training_prediction_or_response{other.min_training_prediction_or_response},
200201
max_training_prediction_or_response{other.max_training_prediction_or_response},validation_tuning_metric{other.validation_tuning_metric},
201-
validation_indexes{other.validation_indexes}
202+
validation_indexes{other.validation_indexes}, quantile{other.quantile}
202203
{
203204
}
204205

@@ -250,6 +251,8 @@ void APLRRegressor::throw_error_if_family_does_not_exist()
250251
family_exists=true;
251252
else if(family=="mae")
252253
family_exists=true;
254+
else if(family=="quantile")
255+
family_exists=true;
253256
if(!family_exists)
254257
throw std::runtime_error("Family "+family+" is not available in APLR.");
255258
}
@@ -613,6 +616,18 @@ VectorXd APLRRegressor::calculate_neg_gradient_current(const VectorXd &sample_we
613616
double mae{calculate_errors(y_train,predictions_current,sample_weight_train,"mae").mean()};
614617
output=(y_train.array() - predictions_current.array()).sign()*mae;
615618
}
619+
else if(family=="quantile")
620+
{
621+
double mae{calculate_errors(y_train,predictions_current,sample_weight_train,"mae").mean()};
622+
output=(y_train.array() - predictions_current.array()).sign()*mae;
623+
for (Eigen::Index i = 0; i < y_train.size(); ++i)
624+
{
625+
if(y_train[i]<predictions_current[i])
626+
output[i] *= 1-quantile;
627+
else
628+
output[i] *= quantile;
629+
}
630+
}
616631

617632
if(link_function!="identity")
618633
output=output.array()*differentiate_predictions().array();
@@ -1056,7 +1071,7 @@ void APLRRegressor::calculate_and_validate_validation_error(size_t boosting_step
10561071
void APLRRegressor::calculate_validation_error(size_t boosting_step, const VectorXd &predictions)
10571072
{
10581073
if(validation_tuning_metric=="default")
1059-
validation_error_steps[boosting_step]=calculate_mean_error(calculate_errors(y_validation,predictions,sample_weight_validation,family,tweedie_power,group_validation,unique_groups_validation),sample_weight_validation);
1074+
validation_error_steps[boosting_step]=calculate_mean_error(calculate_errors(y_validation,predictions,sample_weight_validation,family,tweedie_power,group_validation,unique_groups_validation,quantile),sample_weight_validation);
10601075
else if(validation_tuning_metric=="mse")
10611076
validation_error_steps[boosting_step]=calculate_mean_error(calculate_errors(y_validation,predictions,sample_weight_validation,FAMILY_GAUSSIAN),sample_weight_validation);
10621077
else if(validation_tuning_metric=="mae")

cpp/functions.h

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,22 @@ VectorXd calculate_absolute_errors(const VectorXd &y,const VectorXd &predicted)
133133
return errors;
134134
}
135135

136-
VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const VectorXd &sample_weight=VectorXd(0),const std::string &family="gaussian",double tweedie_power=1.5, const VectorXi &group=VectorXi(0), const std::set<int> &unique_groups={})
136+
VectorXd calculate_quantile_errors(const VectorXd &y,const VectorXd &predicted, double quantile)
137+
{
138+
VectorXd errors{calculate_absolute_errors(y,predicted)};
139+
for (Eigen::Index i = 0; i < y.size(); ++i)
140+
{
141+
if(y[i]<predicted[i])
142+
errors[i] *= 1-quantile;
143+
else
144+
errors[i] *= quantile;
145+
}
146+
147+
return errors;
148+
}
149+
150+
VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const VectorXd &sample_weight=VectorXd(0),const std::string &family="gaussian",
151+
double tweedie_power=1.5, const VectorXi &group=VectorXi(0), const std::set<int> &unique_groups={}, double quantile=0.5)
137152
{
138153
VectorXd errors;
139154
if(family=="gaussian")
@@ -150,6 +165,8 @@ VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const Vect
150165
errors=calculate_group_gaussian_errors(y,predicted,group,unique_groups);
151166
else if(family=="mae")
152167
errors=calculate_absolute_errors(y,predicted);
168+
else if(family=="quantile")
169+
errors=calculate_quantile_errors(y,predicted,quantile);
153170

154171
if(sample_weight.size()>0)
155172
errors=errors.array()*sample_weight.array();

cpp/pythonbinding.cpp

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,16 @@ namespace py = pybind11;
1111

1212
PYBIND11_MODULE(aplr_cpp, m) {
1313
py::class_<APLRRegressor>(m, "APLRRegressor",py::module_local())
14-
.def(py::init<int&, double&, int&, std::string&,std::string&,int&,double&,double&,int&,int&,int&,int&,int&,int&,int&,int&,double&,std::string&>(),
14+
.def(py::init<int&, double&, int&, std::string&,std::string&,int&,double&,double&,int&,int&,int&,int&,int&,int&,int&,int&,double&,std::string&,
15+
double&>(),
1516
py::arg("m")=1000,py::arg("v")=0.1,py::arg("random_state")=0,py::arg("family")="gaussian",py::arg("link_function")="identity",
1617
py::arg("n_jobs")=0,py::arg("validation_ratio")=0.2,py::arg("intercept")=NAN_DOUBLE,
1718
py::arg("reserved_terms_times_num_x")=100,py::arg("bins")=300,py::arg("verbosity")=0,
1819
py::arg("max_interaction_level")=1,py::arg("max_interactions")=100000,py::arg("min_observations_in_split")=20,
1920
py::arg("ineligible_boosting_steps_added")=10,py::arg("max_eligible_terms")=5,
2021
py::arg("tweedie_power")=1.5,
21-
py::arg("validation_tuning_metric")="default"
22+
py::arg("validation_tuning_metric")="default",
23+
py::arg("quantile")=0.5
2224
)
2325
.def("fit", &APLRRegressor::fit,py::arg("X"),py::arg("y"),py::arg("sample_weight")=VectorXd(0),py::arg("X_names")=std::vector<std::string>(),
2426
py::arg("validation_set_indexes")=std::vector<size_t>(),py::arg("prioritized_predictors_indexes")=std::vector<size_t>(),
@@ -67,24 +69,25 @@ PYBIND11_MODULE(aplr_cpp, m) {
6769
.def_readwrite("max_training_prediction_or_response",&APLRRegressor::max_training_prediction_or_response)
6870
.def_readwrite("validation_tuning_metric",&APLRRegressor::validation_tuning_metric)
6971
.def_readwrite("validation_indexes",&APLRRegressor::validation_indexes)
72+
.def_readwrite("quantile",&APLRRegressor::quantile)
7073
.def(py::pickle(
7174
[](const APLRRegressor &a) { // __getstate__
7275
/* Return a tuple that fully encodes the state of the object */
7376
return py::make_tuple(a.m,a.v,a.random_state,a.family,a.n_jobs,a.validation_ratio,a.intercept,a.bins,a.verbosity,
7477
a.max_interaction_level,a.max_interactions,a.validation_error_steps,a.term_names,a.term_coefficients,a.terms,a.intercept_steps,
7578
a.interactions_eligible,a.min_observations_in_split,a.ineligible_boosting_steps_added,a.max_eligible_terms,
7679
a.number_of_base_terms,a.feature_importance,a.link_function,a.tweedie_power,a.min_training_prediction_or_response,a.max_training_prediction_or_response,
77-
a.validation_tuning_metric,a.validation_indexes);
80+
a.validation_tuning_metric,a.validation_indexes,a.quantile);
7881
},
7982
[](py::tuple t) { // __setstate__
80-
if (t.size() != 28)
83+
if (t.size() != 29)
8184
throw std::runtime_error("Invalid state!");
8285

8386
/* Create a new C++ instance */
8487
APLRRegressor a(t[0].cast<size_t>(),t[1].cast<double>(),t[2].cast<uint_fast32_t>(),t[3].cast<std::string>(),
8588
t[22].cast<std::string>(),t[4].cast<size_t>(),t[5].cast<double>(),
8689
t[6].cast<double>(),100,t[7].cast<size_t>(),t[8].cast<size_t>(),t[9].cast<size_t>(),t[10].cast<double>(),t[17].cast<size_t>(),
87-
t[23].cast<double>());
90+
t[23].cast<double>(),t[28].cast<double>());
8891

8992
a.validation_error_steps=t[11].cast<VectorXd>();
9093
a.term_names=t[12].cast<std::vector<std::string>>();

cpp/test ALRRegressor quantile.cpp

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#include <iostream>
2+
#include "term.h"
3+
#include "../dependencies/eigen-master/Eigen/Dense"
4+
#include <vector>
5+
#include <numeric>
6+
#include "APLRRegressor.h"
7+
#include <cmath>
8+
9+
10+
using namespace Eigen;
11+
12+
int main()
13+
{
14+
std::vector<bool> tests;
15+
tests.reserve(1000);
16+
17+
//Model
18+
APLRRegressor model{APLRRegressor()};
19+
model.m=100;
20+
model.v=1.0;
21+
model.bins=10;
22+
model.n_jobs=1;
23+
model.family="quantile";
24+
model.verbosity=3;
25+
model.max_interaction_level=100;
26+
model.max_interactions=30;
27+
model.min_observations_in_split=50;
28+
model.ineligible_boosting_steps_added=10;
29+
model.max_eligible_terms=5;
30+
model.quantile=0.5;
31+
32+
//Data
33+
MatrixXd X_train{load_csv_into_eigen_matrix<MatrixXd>("data/X_train.csv")};
34+
MatrixXd X_test{load_csv_into_eigen_matrix<MatrixXd>("data/X_test.csv")};
35+
VectorXd y_train{load_csv_into_eigen_matrix<MatrixXd>("data/y_train.csv")};
36+
VectorXd y_test{load_csv_into_eigen_matrix<MatrixXd>("data/y_test.csv")};
37+
38+
VectorXd sample_weight{VectorXd::Constant(y_train.size(),1.0)};
39+
40+
std::cout<<X_train;
41+
42+
//Fitting
43+
//model.fit(X_train,y_train);
44+
model.fit(X_train,y_train,sample_weight);
45+
//model.fit(X_train,y_train,sample_weight,{},{0,1,2,3,4,5,10,static_cast<size_t>(y_train.size()-1)});
46+
std::cout<<"feature importance\n"<<model.feature_importance<<"\n\n";
47+
48+
VectorXd predictions{model.predict(X_test)};
49+
MatrixXd li{model.calculate_local_feature_importance(X_test)};
50+
51+
//Saving results
52+
save_as_csv_file("data/output.csv",predictions);
53+
54+
std::cout<<predictions.mean()<<"\n\n";
55+
tests.push_back(is_approximately_equal(predictions.mean(),23.7883,0.00001));
56+
57+
//Test summary
58+
std::cout<<"\n\nTest summary\n"<<"Passed "<<std::accumulate(tests.begin(),tests.end(),0)<<" out of "<<tests.size()<<" tests.";
59+
}

examples/train_aplr_cross_validation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131

3232
#Training model
3333
param_grid = {"max_interaction_level":[0,1,2,3,100],"min_observations_in_split":[1, 20, 50, 100, 200]}
34-
family="gaussian" #other available families are binomial, poisson, gamma, tweedie, group_gaussian and mae.
34+
family="gaussian" #other available families are binomial, poisson, gamma, tweedie, group_gaussian, mae and quantile.
3535
link_function="identity" #other available link functions are logit and log.
3636
grid_search_cv = GridSearchCV(APLRRegressor(random_state=random_state,verbosity=1,m=1000,v=0.1,family=family,link_function=link_function),param_grid,cv=5,n_jobs=4,scoring="neg_mean_squared_error")
3737
grid_search_cv.fit(data_train[predictors].values,data_train[response].values)

0 commit comments

Comments
 (0)