Skip to content

Commit 1a9f5b4

Browse files
Merge pull request #13 from ottenbreit-data-science/f
prioritized predictors
2 parents 6a71069 + c160156 commit 1a9f5b4

File tree

9 files changed

+224
-172
lines changed

9 files changed

+224
-172
lines changed

API_REFERENCE.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ Species the variance power for the "tweedie" ***family***.
5656
APLR calculates a tuning metric, mean squared error for groups of observations in the validation set. This metric is provided by the method ***get_validation_group_mse()***. The metric may be useful for tuning ***tweedie_power*** and to some extent ***family*** or ***link_function***. The reasoning behind this is that mean squared error (MSE) is often appropriate for evaluating goodness of fit on approximately normally distributed data. The mean of a group of observations is approximately normally distributed according to the Central Limit Theorem (CLT) if there are enough observations in the group, regardless of how individual observations are distributed. Ideally, ***group_size_for_validation_group_mse*** should be large enough so that the Central Limit Theorem holds (at least 30, but the default of 100 is a safer choice). Also, the number of observations in the validation set should be substantially higher than ***group_size_for_validation_group_mse***.
5757

5858

59-
## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[])
59+
## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[])
6060

6161
***This method fits the model to data.***
6262

@@ -77,6 +77,9 @@ An optional list of strings containing names for each predictor in ***X***. Nami
7777
#### validation_set_indexes
7878
An optional list of integers specifying the indexes of observations to be used for validation instead of training. If this is specified then ***validation_ratio*** is not used. Specifying ***validation_set_indexes*** may be useful for example when modelling time series data (you can place more recent observations in the validation set).
7979

80+
#### prioritized_predictors_indexes
81+
An optional list of integers specifying the indexes of predictors (columns) in ***X*** that should be prioritized. Terms of the prioritized predictors will enter the model as long as they reduce the training error and do not contain too few effective observations. They will also be updated more often.
82+
8083

8184
## Method: predict(X:npt.ArrayLike, cap_predictions_to_minmax_in_training:bool=True)
8285

aplr/aplr.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ def __set_params_cpp(self):
4848
self.APLRRegressor.tweedie_power=self.tweedie_power
4949
self.APLRRegressor.group_size_for_validation_group_mse=self.group_size_for_validation_group_mse
5050

51-
def fit(self, X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[]):
51+
def fit(self, X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[], prioritized_predictors_indexes:List[int]=[]):
5252
self.__set_params_cpp()
53-
self.APLRRegressor.fit(X,y,sample_weight,X_names,validation_set_indexes)
53+
self.APLRRegressor.fit(X,y,sample_weight,X_names,validation_set_indexes,prioritized_predictors_indexes)
5454

5555
def predict(self, X:npt.ArrayLike, cap_predictions_to_minmax_in_training:bool=True)->npt.ArrayLike:
5656
return self.APLRRegressor.predict(X, cap_predictions_to_minmax_in_training)

cpp/APLRRegressor.h

Lines changed: 187 additions & 151 deletions
Large diffs are not rendered by default.

cpp/pythonbinding.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ PYBIND11_MODULE(aplr_cpp, m) {
2020
py::arg("tweedie_power")=1.5,
2121
py::arg("group_size_for_validation_group_mse")=100)
2222
.def("fit", &APLRRegressor::fit,py::arg("X"),py::arg("y"),py::arg("sample_weight")=VectorXd(0),py::arg("X_names")=std::vector<std::string>(),
23-
py::arg("validation_set_indexes")=std::vector<size_t>(),py::call_guard<py::scoped_ostream_redirect,py::scoped_estream_redirect>())
23+
py::arg("validation_set_indexes")=std::vector<size_t>(),py::arg("prioritized_predictors_indexes")=std::vector<size_t>(),
24+
py::call_guard<py::scoped_ostream_redirect,py::scoped_estream_redirect>())
2425
.def("predict", &APLRRegressor::predict,py::arg("X"),py::arg("bool cap_predictions_to_minmax_in_training")=true)
2526
.def("set_term_names", &APLRRegressor::set_term_names,py::arg("X_names"))
2627
.def("calculate_local_feature_importance",&APLRRegressor::calculate_local_feature_importance,py::arg("X"))

cpp/term.h

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -685,16 +685,10 @@ size_t Term::get_interaction_level(size_t previous_int_level)
685685
}
686686

687687

688-
//Distribution of terms to multiple cores
689-
std::vector<std::vector<size_t>> distribute_terms_to_cores(std::vector<Term> &terms,size_t n_jobs)
688+
std::vector<std::vector<size_t>> distribute_terms_indexes_to_cores(std::vector<size_t> &term_indexes,size_t n_jobs)
690689
{
691690
//Determing number of terms actually eligible
692-
size_t num_eligible_terms{0};
693-
for (size_t i = 0; i < terms.size(); ++i)
694-
{
695-
if(terms[i].ineligible_boosting_steps==0)
696-
++num_eligible_terms;
697-
}
691+
size_t num_eligible_terms{term_indexes.size()};
698692

699693
//Determining how many items to evaluate per core
700694
size_t available_cores{static_cast<size_t>(std::thread::hardware_concurrency())};
@@ -713,13 +707,10 @@ std::vector<std::vector<size_t>> distribute_terms_to_cores(std::vector<Term> &te
713707
//Distributing
714708
size_t core{0};
715709
size_t count{0};
716-
for (size_t i = 0; i < terms.size(); ++i) //for each term
710+
for (size_t i = 0; i < term_indexes.size(); ++i) //for each term
717711
{
718-
if(terms[i].ineligible_boosting_steps==0) //if can be distributed to cores
719-
{
720-
output[core].push_back(i);
721-
++count;
722-
}
712+
output[core].push_back(i);
713+
++count;
723714
if(count>=units_per_core)
724715
{
725716
if(core<available_cores-1)
@@ -737,4 +728,18 @@ std::vector<std::vector<size_t>> distribute_terms_to_cores(std::vector<Term> &te
737728
}
738729

739730
return output;
731+
}
732+
733+
std::vector<size_t> create_term_indexes(std::vector<Term> &terms)
734+
{
735+
std::vector<size_t> term_indexes;
736+
term_indexes.reserve(terms.size());
737+
for (size_t i = 0; i < terms.size(); ++i)
738+
{
739+
bool term_is_eligible{terms[i].ineligible_boosting_steps==0};
740+
if(term_is_eligible)
741+
term_indexes.push_back(i);
742+
}
743+
term_indexes.shrink_to_fit();
744+
return term_indexes;
740745
}

cpp/test ALRRegressor.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,8 @@ int main()
4141
//Fitting
4242
//model.fit(X_train,y_train);
4343
//model.fit(X_train,y_train,sample_weight);
44-
model.fit(X_train,y_train,sample_weight,{},{0,1,2,3,4,5,10,static_cast<size_t>(y_train.size()-1)});
44+
//model.fit(X_train,y_train,sample_weight,{},{0,1,2,3,4,5,10,static_cast<size_t>(y_train.size()-1)});
45+
model.fit(X_train,y_train,sample_weight,{},{0,1,2,3,4,5,10,static_cast<size_t>(y_train.size()-1)},{1,8});
4546
std::cout<<"feature importance\n"<<model.feature_importance<<"\n\n";
4647

4748
VectorXd predictions{model.predict(X_test)};
@@ -51,7 +52,7 @@ int main()
5152
save_data("data/output.csv",predictions);
5253

5354
std::cout<<predictions.mean()<<"\n\n";
54-
tests.push_back(is_approximately_equal(predictions.mean(),23.7858,0.00001));
55+
tests.push_back(is_approximately_equal(predictions.mean(),23.6889,0.00001));
5556

5657
//std::cout<<model.validation_error_steps<<"\n\n";
5758

examples/train_aplr_cross_validation.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@
1717
#This means that if you have missing values in the data then you need to either drop rows with missing data or impute them.
1818
#This also means that if you have a categorical text variable then you need to convert it to for example dummy variables for each category.
1919

20+
#Please also note that APLR may be vulnerable to outliers in predictor values. If you experience this problem then please consider winsorising
21+
#the predictors (or similar methods) before passing them to APLR.
22+
2023
#Randomly splitting data into training and test sets
2124
data_train, data_test = train_test_split(data, test_size=0.3, random_state=random_state)
2225
del data

examples/train_aplr_validation.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@
1818
#This means that if you have missing values in the data then you need to either drop rows with missing data or impute them.
1919
#This also means that if you have a categorical text variable then you need to convert it to for example dummy variables for each category.
2020

21+
#Please also note that APLR may be vulnerable to outliers in predictor values. If you experience this problem then please consider winsorising
22+
#the predictors (or similar methods) before passing them to APLR.
23+
2124
#Randomly splitting data into training and test sets
2225
data_train, data_test = train_test_split(data, test_size=0.3, random_state=random_state)
2326
del data

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
setuptools.setup(
1717
name='aplr',
18-
version='1.9.0',
18+
version='1.10.0',
1919
description='Automatic Piecewise Linear Regression',
2020
ext_modules=[sfc_module],
2121
author="Mathias von Ottenbreit",

0 commit comments

Comments
 (0)