Skip to content

Commit 3827e2d

Browse files
added group_gaussian family
1 parent c246776 commit 3827e2d

File tree

10 files changed

+193
-21
lines changed

10 files changed

+193
-21
lines changed

API_REFERENCE.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The learning rate. Must be greater than zero and not more than one. The higher t
1414
Used to randomly split training observations into training and validation if ***validation_set_indexes*** is not specified when fitting.
1515

1616
#### family (default = "gaussian")
17-
Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma" and "tweedie". This is used together with ***link_function***.
17+
Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma", "tweedie" and "group_gaussian". This is used together with ***link_function***. When ***family*** is "group_gaussian" then the "group" argument in the ***fit*** method must be provided. In the latter case APLR will try to minimize group MSE when training the model.
1818

1919
#### link_function (default = "identity")
2020
Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***family*** "gaussian" and ***link_function*** "identity". For logistic regression use ***family*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma" or "tweedie" ***family***, depending on the data. The ***family*** "poisson", "gamma" or "tweedie" should only be used with the "log" ***link_function***. Inappropriate combinations of ***family*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge.
@@ -55,7 +55,7 @@ Specifies the variance power for the "tweedie" ***family***.
5555
#### validation_tuning_metric (default = "default")
5656
Specifies which metric to use for validating the model and tuning ***m***. Available options are "default" (using the same methodology as when calculating the training error), "mse", "mae", "negative_gini" and "rankability". The default is often a choice that fits well with respect to the ***family*** chosen. However, if you want to use ***family*** or ***tweedie_power*** as tuning parameters then the default is not suitable. "rankability" uses a methodology similar to the one described in https://towardsdatascience.com/how-to-calculate-roc-auc-score-for-regression-models-c0be4fdf76bb except that the metric is inverted and can be weighted by sample weights.
5757

58-
## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[], monotonic_constraints:List[int]=[])
58+
## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[], monotonic_constraints:List[int]=[], group:npt.ArrayLike = np.empty(0))
5959

6060
***This method fits the model to data.***
6161

@@ -82,6 +82,9 @@ An optional list of integers specifying the indexes of predictors (columns) in *
8282
#### monotonic_constraints
8383
An optional list of integers specifying monotonic constraints on model terms. For example, if there are three predictors in ***X***, then monotonic_constraints = [1,0,-1] means that 1) the first predictor in ***X*** cannot be used in interaction terms and all terms using the first predictor in ***X*** as a main effect must have positive regression coefficients, 2) there are no monotonic constraints on terms using the second predictor in ***X***, and 3) the third predictor in ***X*** cannot be used in interaction terms and all terms using the third predictor in ***X*** as a main effect must have negative regression coefficients.
8484

85+
#### group
86+
A numpy vector of integers that is used when ***family*** is "group_gaussian". For example, ***group*** may represent year (could be useful in a time series model).
87+
8588

8689
## Method: predict(X:npt.ArrayLike, cap_predictions_to_minmax_in_training:bool=True)
8790

aplr/aplr.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ def __set_params_cpp(self):
4848
self.APLRRegressor.tweedie_power=self.tweedie_power
4949
self.APLRRegressor.validation_tuning_metric=self.validation_tuning_metric
5050

51-
def fit(self, X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[], monotonic_constraints:List[int]=[]):
51+
def fit(self, X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[], monotonic_constraints:List[int]=[],group:npt.ArrayLike = np.empty(0)):
5252
self.__set_params_cpp()
53-
self.APLRRegressor.fit(X,y,sample_weight,X_names,validation_set_indexes,prioritized_predictors_indexes,monotonic_constraints)
53+
self.APLRRegressor.fit(X,y,sample_weight,X_names,validation_set_indexes,prioritized_predictors_indexes,monotonic_constraints,group)
5454

5555
def predict(self, X:npt.ArrayLike, cap_predictions_to_minmax_in_training:bool=True)->npt.ArrayLike:
5656
return self.APLRRegressor.predict(X, cap_predictions_to_minmax_in_training)

cpp/APLRRegressor.h

Lines changed: 59 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,20 @@ class APLRRegressor
4646
std::vector<size_t> predictor_indexes;
4747
std::vector<size_t> prioritized_predictors_indexes;
4848
std::vector<int> monotonic_constraints;
49+
VectorXi group_train;
50+
VectorXi group_validation;
51+
std::set<int> unique_groups_train;
52+
std::set<int> unique_groups_validation;
4953

5054
//Methods
5155
void validate_input_to_fit(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight,const std::vector<std::string> &X_names,
5256
const std::vector<size_t> &validation_set_indexes, const std::vector<size_t> &prioritized_predictors_indexes,
53-
const std::vector<int> &monotonic_constraints);
57+
const std::vector<int> &monotonic_constraints, const VectorXi &group);
5458
void throw_error_if_validation_set_indexes_has_invalid_indexes(const VectorXd &y, const std::vector<size_t> &validation_set_indexes);
5559
void throw_error_if_prioritized_predictors_indexes_has_invalid_indexes(const MatrixXd &X, const std::vector<size_t> &prioritized_predictors_indexes);
5660
void throw_error_if_monotonic_constraints_has_invalid_indexes(const MatrixXd &X, const std::vector<int> &monotonic_constraints);
57-
void define_training_and_validation_sets(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight, const std::vector<size_t> &validation_set_indexes);
61+
void define_training_and_validation_sets(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight,
62+
const std::vector<size_t> &validation_set_indexes, const VectorXi &group);
5863
void initialize(const std::vector<size_t> &prioritized_predictors_indexes, const std::vector<int> &monotonic_constraints);
5964
bool check_if_base_term_has_only_one_unique_value(size_t base_term);
6065
void add_term_to_terms_eligible_current(Term &term);
@@ -146,7 +151,8 @@ class APLRRegressor
146151
APLRRegressor(const APLRRegressor &other);
147152
~APLRRegressor();
148153
void fit(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight=VectorXd(0),const std::vector<std::string> &X_names={},const std::vector<size_t> &validation_set_indexes={},
149-
const std::vector<size_t> &prioritized_predictors_indexes={}, const std::vector<int> &monotonic_constraints={});
154+
const std::vector<size_t> &prioritized_predictors_indexes={}, const std::vector<int> &monotonic_constraints={},
155+
const VectorXi &group=VectorXi(0));
150156
VectorXd predict(const MatrixXd &X, bool cap_predictions_to_minmax_in_training=true);
151157
void set_term_names(const std::vector<std::string> &X_names);
152158
MatrixXd calculate_local_feature_importance(const MatrixXd &X);
@@ -207,13 +213,13 @@ APLRRegressor::~APLRRegressor()
207213
//invalidating validation_ratio. The rest of indexes are used to train.
208214
void APLRRegressor::fit(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight,const std::vector<std::string> &X_names,
209215
const std::vector<size_t> &validation_set_indexes,const std::vector<size_t> &prioritized_predictors_indexes,
210-
const std::vector<int> &monotonic_constraints)
216+
const std::vector<int> &monotonic_constraints, const VectorXi &group)
211217
{
212218
throw_error_if_family_does_not_exist();
213219
throw_error_if_link_function_does_not_exist();
214220
throw_error_if_tweedie_power_is_invalid();
215-
validate_input_to_fit(X,y,sample_weight,X_names,validation_set_indexes,prioritized_predictors_indexes,monotonic_constraints);
216-
define_training_and_validation_sets(X,y,sample_weight,validation_set_indexes);
221+
validate_input_to_fit(X,y,sample_weight,X_names,validation_set_indexes,prioritized_predictors_indexes,monotonic_constraints,group);
222+
define_training_and_validation_sets(X,y,sample_weight,validation_set_indexes,group);
217223
scale_training_observations_if_using_log_link_function();
218224
initialize(prioritized_predictors_indexes, monotonic_constraints);
219225
execute_boosting_steps();
@@ -239,7 +245,9 @@ void APLRRegressor::throw_error_if_family_does_not_exist()
239245
else if(family=="gamma")
240246
family_exists=true;
241247
else if(family=="tweedie")
242-
family_exists=true;
248+
family_exists=true;
249+
else if(family=="group_gaussian")
250+
family_exists=true;
243251
if(!family_exists)
244252
throw std::runtime_error("Family "+family+" is not available in APLR.");
245253
}
@@ -268,7 +276,7 @@ void APLRRegressor::throw_error_if_tweedie_power_is_invalid()
268276

269277
void APLRRegressor::validate_input_to_fit(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight,
270278
const std::vector<std::string> &X_names, const std::vector<size_t> &validation_set_indexes,
271-
const std::vector<size_t> &prioritized_predictors_indexes, const std::vector<int> &monotonic_constraints)
279+
const std::vector<size_t> &prioritized_predictors_indexes, const std::vector<int> &monotonic_constraints, const VectorXi &group)
272280
{
273281
if(X.rows()!=y.size()) throw std::runtime_error("X and y must have the same number of rows.");
274282
if(X.rows()<2) throw std::runtime_error("X and y cannot have less than two rows.");
@@ -281,6 +289,8 @@ void APLRRegressor::validate_input_to_fit(const MatrixXd &X,const VectorXd &y,co
281289
throw_error_if_monotonic_constraints_has_invalid_indexes(X, monotonic_constraints);
282290
throw_error_if_response_contains_invalid_values(y);
283291
throw_error_if_sample_weight_contains_invalid_values(y, sample_weight);
292+
bool group_is_of_incorrect_size{family=="group_gaussian" && group.rows()!=y.rows()};
293+
if(group_is_of_incorrect_size) throw std::runtime_error("When family is group_gaussian then y and group must have the same number of rows.");
284294
}
285295

286296
void APLRRegressor::throw_error_if_validation_set_indexes_has_invalid_indexes(const VectorXd &y, const std::vector<size_t> &validation_set_indexes)
@@ -381,7 +391,8 @@ void APLRRegressor::throw_error_if_sample_weight_contains_invalid_values(const V
381391
}
382392
}
383393

384-
void APLRRegressor::define_training_and_validation_sets(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight, const std::vector<size_t> &validation_set_indexes)
394+
void APLRRegressor::define_training_and_validation_sets(const MatrixXd &X,const VectorXd &y,const VectorXd &sample_weight,
395+
const std::vector<size_t> &validation_set_indexes, const VectorXi &group)
385396
{
386397
size_t y_size{static_cast<size_t>(y.size())};
387398
std::vector<size_t> train_indexes;
@@ -440,6 +451,16 @@ void APLRRegressor::define_training_and_validation_sets(const MatrixXd &X,const
440451
sample_weight_train[i]=sample_weight[train_indexes[i]];
441452
}
442453
}
454+
bool groups_are_provided{group.size()>0};
455+
if(groups_are_provided)
456+
{
457+
group_train.resize(train_indexes.size());
458+
for (size_t i = 0; i < train_indexes.size(); ++i)
459+
{
460+
group_train[i]=group[train_indexes[i]];
461+
}
462+
unique_groups_train = get_unique_integers(group_train);
463+
}
443464
//Populating test matrices
444465
for (size_t i = 0; i < validation_indexes.size(); ++i)
445466
{
@@ -454,6 +475,15 @@ void APLRRegressor::define_training_and_validation_sets(const MatrixXd &X,const
454475
sample_weight_validation[i]=sample_weight[validation_indexes[i]];
455476
}
456477
}
478+
if(groups_are_provided)
479+
{
480+
group_validation.resize(validation_indexes.size());
481+
for (size_t i = 0; i < validation_indexes.size(); ++i)
482+
{
483+
group_validation[i]=group[validation_indexes[i]];
484+
}
485+
unique_groups_validation = get_unique_integers(group_validation);
486+
}
457487
}
458488

459489
void APLRRegressor::scale_training_observations_if_using_log_link_function()
@@ -561,6 +591,21 @@ VectorXd APLRRegressor::calculate_neg_gradient_current()
561591
output=(y_train.array() - predictions_current.array()) / predictions_current.array() / predictions_current.array();
562592
else if(family=="tweedie")
563593
output=(y_train.array()-predictions_current.array()).array() * predictions_current.array().pow(-tweedie_power);
594+
else if(family=="group_gaussian")
595+
{
596+
GroupData group_residuals_and_count{calculate_group_errors_and_count(y_train,predictions_current,group_train,unique_groups_train)};
597+
598+
for(int unique_group_value:unique_groups_train)
599+
{
600+
group_residuals_and_count.error[unique_group_value] /= group_residuals_and_count.count[unique_group_value];
601+
}
602+
603+
output = VectorXd(y_train.rows());
604+
for (Eigen::Index i = 0; i < y_train.size(); ++i)
605+
{
606+
output[i] = group_residuals_and_count.error[group_train[i]];
607+
}
608+
}
564609

565610
if(link_function!="identity")
566611
output=output.array()*differentiate_predictions().array();
@@ -1004,7 +1049,7 @@ void APLRRegressor::calculate_and_validate_validation_error(size_t boosting_step
10041049
void APLRRegressor::calculate_validation_error(size_t boosting_step, const VectorXd &predictions)
10051050
{
10061051
if(validation_tuning_metric=="default")
1007-
validation_error_steps[boosting_step]=calculate_mean_error(calculate_errors(y_validation,predictions,sample_weight_validation,family,tweedie_power),sample_weight_validation);
1052+
validation_error_steps[boosting_step]=calculate_mean_error(calculate_errors(y_validation,predictions,sample_weight_validation,family,tweedie_power,group_validation,unique_groups_validation),sample_weight_validation);
10081053
else if(validation_tuning_metric=="mse")
10091054
validation_error_steps[boosting_step]=calculate_mean_error(calculate_errors(y_validation,predictions,sample_weight_validation,FAMILY_GAUSSIAN),sample_weight_validation);
10101055
else if(validation_tuning_metric=="mae")
@@ -1274,6 +1319,10 @@ void APLRRegressor::cleanup_after_fit()
12741319
predictor_indexes.clear();
12751320
prioritized_predictors_indexes.clear();
12761321
monotonic_constraints.clear();
1322+
group_train.resize(0);
1323+
group_validation.resize(0);
1324+
unique_groups_train.clear();
1325+
unique_groups_validation.clear();
12771326
}
12781327

12791328
VectorXd APLRRegressor::predict(const MatrixXd &X, bool cap_predictions_to_minmax_in_training)

cpp/functions.h

Lines changed: 55 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
#pragma once
22
#include <limits>
3-
#include "../dependencies/eigen-master/Eigen/Dense"
43
#include <numeric> //std::iota
54
#include <algorithm> //std::sort, std::stable_sort
65
#include <vector>
@@ -9,6 +8,9 @@
98
#include <thread>
109
#include <future>
1110
#include <random>
11+
#include <set>
12+
#include <map>
13+
#include "../dependencies/eigen-master/Eigen/Dense"
1214
#include "constants.h"
1315

1416
using namespace Eigen;
@@ -37,6 +39,12 @@ static bool is_approximately_zero(TReal a, TReal tolerance = std::numeric_limits
3739
return false;
3840
}
3941

42+
std::set<int> get_unique_integers(const VectorXi &int_vector)
43+
{
44+
std::set<int> unique_integers{int_vector.begin(),int_vector.end()};
45+
return unique_integers;
46+
}
47+
4048
double set_error_to_infinity_if_invalid(double error)
4149
{
4250
bool error_is_invalid{!std::isfinite(error)};
@@ -77,7 +85,49 @@ VectorXd calculate_tweedie_errors(const VectorXd &y,const VectorXd &predicted,do
7785
return errors;
7886
}
7987

80-
VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const VectorXd &sample_weight=VectorXd(0),const std::string &family="gaussian",double tweedie_power=1.5)
88+
struct GroupData
89+
{
90+
std::map<int,double> error;
91+
std::map<int,size_t> count;
92+
};
93+
94+
GroupData calculate_group_errors_and_count(const VectorXd &y,const VectorXd &predicted,const VectorXi &group, const std::set<int> &unique_groups)
95+
{
96+
GroupData group_data;
97+
for(int unique_group_value:unique_groups)
98+
{
99+
group_data.error[unique_group_value]=0.0;
100+
group_data.count[unique_group_value]=0;
101+
}
102+
for (Eigen::Index i = 0; i < group.size(); ++i)
103+
{
104+
group_data.error[group[i]] += y[i] - predicted[i];
105+
group_data.count[group[i]] += 1;
106+
}
107+
return group_data;
108+
}
109+
110+
VectorXd calculate_group_gaussian_errors(const VectorXd &y,const VectorXd &predicted,const VectorXi &group, const std::set<int> &unique_groups)
111+
{
112+
GroupData group_residuals_and_count{calculate_group_errors_and_count(y,predicted,group,unique_groups)};
113+
114+
for(int unique_group_value:unique_groups)
115+
{
116+
group_residuals_and_count.error[unique_group_value] *= group_residuals_and_count.error[unique_group_value];
117+
group_residuals_and_count.error[unique_group_value] /= group_residuals_and_count.count[unique_group_value];
118+
}
119+
120+
VectorXd errors(y.rows());
121+
for (Eigen::Index i = 0; i < y.size(); ++i)
122+
{
123+
errors[i] = group_residuals_and_count.error[group[i]];
124+
}
125+
126+
return errors;
127+
}
128+
129+
130+
VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const VectorXd &sample_weight=VectorXd(0),const std::string &family="gaussian",double tweedie_power=1.5, const VectorXi &group=VectorXi(0), const std::set<int> &unique_groups={})
81131
{
82132
VectorXd errors;
83133
if(family=="gaussian")
@@ -90,7 +140,9 @@ VectorXd calculate_errors(const VectorXd &y,const VectorXd &predicted,const Vect
90140
errors=calculate_gamma_errors(y,predicted);
91141
else if(family=="tweedie")
92142
errors=calculate_tweedie_errors(y,predicted,tweedie_power);
93-
143+
else if(family=="group_gaussian")
144+
errors=calculate_group_gaussian_errors(y,predicted,group,unique_groups);
145+
94146
if(sample_weight.size()>0)
95147
errors=errors.array()*sample_weight.array();
96148

cpp/pythonbinding.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ PYBIND11_MODULE(aplr_cpp, m) {
2222
)
2323
.def("fit", &APLRRegressor::fit,py::arg("X"),py::arg("y"),py::arg("sample_weight")=VectorXd(0),py::arg("X_names")=std::vector<std::string>(),
2424
py::arg("validation_set_indexes")=std::vector<size_t>(),py::arg("prioritized_predictors_indexes")=std::vector<size_t>(),
25-
py::arg("monotonic_constraints")=std::vector<int>(),py::call_guard<py::scoped_ostream_redirect,py::scoped_estream_redirect>())
25+
py::arg("monotonic_constraints")=std::vector<int>(),py::arg("group")=VectorXi(0),
26+
py::call_guard<py::scoped_ostream_redirect,py::scoped_estream_redirect>())
2627
.def("predict", &APLRRegressor::predict,py::arg("X"),py::arg("bool cap_predictions_to_minmax_in_training")=true)
2728
.def("set_term_names", &APLRRegressor::set_term_names,py::arg("X_names"))
2829
.def("calculate_local_feature_importance",&APLRRegressor::calculate_local_feature_importance,py::arg("X"))

0 commit comments

Comments
 (0)