Skip to content

Commit 343fe99

Browse files
Added a general tweedie family and link function instead of a couple specific implementations. Also added an API reference.
1 parent f8fa60d commit 343fe99

11 files changed

+270
-97
lines changed

API_REFERENCE.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# APLRRegressor
2+
3+
## class aplr.APLRRegressor(m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5)
4+
5+
### Constructor parameters
6+
7+
#### m (default = 1000)
8+
The maximum number of boosting steps. If validation error does not flatten out at the end of the ***m***th boosting step, then try increasing it (or alternatively increase the learning rate).
9+
10+
#### v (default = 0.1)
11+
The learning rate. Must be greater than zero and not more than one. The higher the faster the algorithm learns and the lower ***m*** is required. However, empirical evidence suggests that ***v <= 0.1*** gives better results. If the algorithm learns too fast (requires few boosting steps to converge) then try lowering the learning rate. Computational costs can be reduced by increasing the learning rate while simultaneously decreasing ***m***, potentially at the expense of predictiveness.
12+
13+
#### random_state (default = 0)
14+
Used to randomly split training observations into training and validation if ***validation_set_indexes*** is not specified when fitting.
15+
16+
#### family (default = "gaussian")
17+
Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma" and "tweedie". This is used together with ***link_function***.
18+
19+
#### link_function (default = "identity")
20+
Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit", "log", "inverse" and "tweedie". These are canonical link functions for the "gaussian", "binomial", "poisson", "gamma" and "tweedie" ***family*** respectively. Canonical links usually work fine given that the data is appropriate for the selected combination of ***family*** and ***link_function***. Other combinations of ***family*** and ***link_function*** may or may not work (the model may fit poorly to the data if the wrong combination is used).
21+
22+
#### n_jobs (default = 0)
23+
Multi-threading parameter. If ***0*** then uses all available cores for multi-threading. Any other positive integer specifies the number of cores to use (***1*** means single-threading).
24+
25+
#### validation_ratio (default = 0.2)
26+
The ratio of training observations to use for validation instead of training. The number of boosting steps is automatically tuned to minimize validation error.
27+
28+
#### intercept (default = nan)
29+
Specifies the intercept term of the model if you want to predict before doing any training. However, when the ***fit*** method is run then the intercept is estimated based on data and whatever was specified as ***intercept*** when instantiating ***APLRRegressor*** gets overwritten.
30+
31+
#### bins (default = 300)
32+
Specifies the maximum number of bins to discretize the data into when searching for the best split. The default value works well according to empirical results. This hyperparameter is intended for reducing computational costs.
33+
34+
#### max_interaction_level (default = 1)
35+
Specifies the maximum allowed depth of interaction terms. ***0*** means that interactions are not allowed. This hyperparameter should be tuned. Please note that occasionally a too high value could produce a model that performs poorly on an independent test set despite looking good when tuning hyperparameters. If this happens then reduce ***max_interaction_level*** until the problem disappears.
36+
37+
#### max_interactions (default = 100000)
38+
The maximum number of interactions allowed. A lower value may be used to reduce computational time.
39+
40+
#### min_observations_in_split (default = 20)
41+
The minimum effective number of observations that a term in the model must rely on. This hyperparameter should be tuned. Larger values are more appropriate for larger datasets. Larger values result in more robust models (lower variance), potentially at the expense of increased bias.
42+
43+
#### ineligible_boosting_steps_added (default = 10)
44+
Controls how many boosting steps a term that becomes ineligible has to remain ineligible. The default value works well according to empirical results. This hyperparameter is intended for reducing computational costs.
45+
46+
#### max_eligible_terms (default = 5)
47+
Limits 1) the number of terms already in the model that can be considered as interaction partners in a boosting step and 2) how many terms remain eligible in the next boosting step. The default value works well according to empirical results. This hyperparameter is intended for reducing computational costs.
48+
49+
#### verbosity (default = 0)
50+
***0*** does not print progress reports during fitting. ***1*** prints a summary after running the ***fit*** method. ***2*** prints a summary after each boosting step.
51+
52+
#### tweedie_power (default = 1.5)
53+
Species the variance power for the "tweedie" ***family*** and ***link_function***.
54+
55+
56+
## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[])
57+
58+
***This method fits the model to data.***
59+
60+
### Parameters
61+
62+
#### X
63+
A numpy matrix with predictor values.
64+
65+
#### y
66+
A numpy vector with response values.
67+
68+
#### sample_weight
69+
An optional numpy vector with sample weights. If not specified then the observations are weighted equally.
70+
71+
#### X_names
72+
An optional list of strings containing names for each predictor in ***X***. Naming predictors may increase model readability because model terms get names based on ***X_names***.
73+
74+
#### validation_set_indexes
75+
An optional list of integers specifying the indexes of observations to be used for validation instead of training. If this is specified then ***validation_ratio*** is not used. Specifying ***validation_set_indexes*** may be useful for example when modelling time series data (you can place more recent observations in the validation set).
76+
77+
78+
## Method: predict(X:npt.ArrayLike)
79+
80+
***Returns a numpy vector containing predictions of the data in X. Requires that the model has been fitted with the fit method.***
81+
82+
### Parameters
83+
84+
#### X
85+
A numpy matrix with predictor values.
86+
87+
88+
## Method: set_term_names(X_names:List[str])
89+
90+
***This method sets the names of terms based on X_names.***
91+
92+
### Parameters
93+
94+
#### X_names
95+
A list of strings containing names for each predictor in the ***X*** matrix that the model was trained on.
96+
97+
98+
## Method: calculate_local_feature_importance(X:npt.ArrayLike)
99+
100+
***Returns a numpy matrix containing local feature importance for new data by each predictor in X.***
101+
102+
### Parameters
103+
104+
#### X
105+
A numpy matrix with predictor values.
106+
107+
108+
## Method: calculate_local_feature_importance_for_terms(X:npt.ArrayLike)
109+
110+
***Returns a numpy matrix containing local feature importance for new data by each term in the model.***
111+
112+
### Parameters
113+
114+
#### X
115+
A numpy matrix with predictor values.
116+
117+
118+
## Method: calculate_terms(X:npt.ArrayLike)
119+
120+
***Returns a numpy matrix containing values of model terms calculated on X.***
121+
122+
### Parameters
123+
124+
#### X
125+
A numpy matrix with predictor values.
126+
127+
128+
## Method: get_term_names()
129+
130+
***Returns a list of strings containing term names.***
131+
132+
133+
## Method: get_term_coefficients()
134+
135+
***Returns a numpy vector containing term regression coefficients.***
136+
137+
138+
## Method: get_term_coefficient_steps(term_index:int)
139+
140+
***Returns a numpy vector containing regression coefficients by each boosting step for the term selected.***
141+
142+
### Parameters
143+
144+
#### term_index
145+
The index of the term selected. So ***0*** is the first term, ***1*** is the second term and so on.
146+
147+
148+
## Method: get_validation_error_steps()
149+
150+
***Returns a numpy vector containing the validation error by boosting step. Use this to determine if the maximum number of boosting steps (m) or learning rate (v) should be changed.***
151+
152+
153+
## Method: get_feature_importance()
154+
155+
***Returns a numpy vector containing the feature importance (estimated on the validation set) of each predictor.***
156+
157+
158+
## Method: get_intercept()
159+
160+
***Returns the regression coefficient of the intercept term.***
161+
162+
163+
## Method: get_intercept_steps()
164+
165+
***Returns a numpy vector containing the regression coefficients of the intercept term by boosting step.***
166+
167+
168+
## Method: get_m()
169+
170+
***Returns the number of boosting steps in the model (the value that minimized validation error).***

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,19 @@
22
Automatic Piecewise Linear Regression.
33

44
# About
5-
Build predictive and interpretable parametric machine learning models in Python based on the Automatic Piecewise Linear Regression (APLR) methodology developed by Mathias von Ottenbreit. APLR is often able to compete with Random Forest on predictiveness, but unlike Random Forest and other tree-based methods APLR is interpretable. See the documentation folder for more information.
5+
Build predictive and interpretable parametric machine learning models in Python based on the Automatic Piecewise Linear Regression (APLR) methodology developed by Mathias von Ottenbreit. APLR is often able to compete with Random Forest on predictiveness, but unlike Random Forest and other tree-based methods APLR is interpretable. See the ***documentation*** folder for more information.
66

77
# How to install
8-
pip install aplr
8+
***pip install aplr***
99

1010
# Availability
1111
Currently available for Windows and most Linux distributions.
1212

1313
# How to use
14-
Please see the two example Python scripts in the examples folder. They cover common use cases, but not all of the functionality in this package. For example, fitting with user-specified observation weights is possible but the example scripts do not use this functionality.
14+
Please see the two example Python scripts in the ***examples*** folder. They cover common use cases, but not all of the functionality in this package. For example, fitting with user-specified observation weights is possible but the example scripts do not use this functionality.
1515

1616
# Sponsorship
1717
Please consider sponsoring Ottenbreit Data Science by clicking on the Sponsor button. Sufficient funding will enable maintenance of APLR and further development.
1818

1919
# API reference
20-
A thorough API reference will be provided in the future.
20+
Please see ***API_REFERENCE.md***

aplr/aplr.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66

77
class APLRRegressor():
8-
def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=100, max_interactions:int=0, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0):
8+
def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5):
99
self.m=m
1010
self.v=v
1111
self.random_state=random_state
@@ -21,6 +21,7 @@ def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaus
2121
self.ineligible_boosting_steps_added=ineligible_boosting_steps_added
2222
self.max_eligible_terms=max_eligible_terms
2323
self.verbosity=verbosity
24+
self.tweedie_power=tweedie_power
2425

2526
#Creating aplr_cpp and setting parameters
2627
self.APLRRegressor=aplr_cpp.APLRRegressor()
@@ -43,6 +44,7 @@ def __set_params_cpp(self):
4344
self.APLRRegressor.ineligible_boosting_steps_added=self.ineligible_boosting_steps_added
4445
self.APLRRegressor.max_eligible_terms=self.max_eligible_terms
4546
self.APLRRegressor.verbosity=self.verbosity
47+
self.APLRRegressor.tweedie_power=self.tweedie_power
4648

4749
def fit(self, X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[]):
4850
self.__set_params_cpp()
@@ -89,7 +91,7 @@ def get_m(self)->int:
8991

9092
#For sklearn
9193
def get_params(self, deep=True):
92-
return {"m": self.m, "v": self.v,"random_state":self.random_state,"family":self.family,"link_function":self.link_function,"n_jobs":self.n_jobs,"validation_ratio":self.validation_ratio,"intercept":self.intercept,"bins":self.bins,"max_interaction_level":self.max_interaction_level,"max_interactions":self.max_interactions,"verbosity":self.verbosity,"min_observations_in_split":self.min_observations_in_split,"ineligible_boosting_steps_added":self.ineligible_boosting_steps_added,"max_eligible_terms":self.max_eligible_terms}
94+
return {"m": self.m, "v": self.v,"random_state":self.random_state,"family":self.family,"link_function":self.link_function,"n_jobs":self.n_jobs,"validation_ratio":self.validation_ratio,"intercept":self.intercept,"bins":self.bins,"max_interaction_level":self.max_interaction_level,"max_interactions":self.max_interactions,"verbosity":self.verbosity,"min_observations_in_split":self.min_observations_in_split,"ineligible_boosting_steps_added":self.ineligible_boosting_steps_added,"max_eligible_terms":self.max_eligible_terms,"tweedie_power":self.tweedie_power}
9395

9496
#For sklearn
9597
def set_params(self, **parameters):

0 commit comments

Comments
 (0)