ottenbreit-data-science
diff --git a/‎API_REFERENCE.md‎
Lines changed: 170 additions & 0 deletions b/‎API_REFERENCE.md‎
Lines changed: 170 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 4 additions & 4 deletions b/‎README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎aplr/aplr.py‎
Lines changed: 4 additions & 2 deletions b/‎aplr/aplr.py‎
Lines changed: 4 additions & 2 deletions
@@ -0,0 +1,170 @@
+# APLRRegressor
+
+## class aplr.APLRRegressor(m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5)
+
+### Constructor parameters
+
+#### m (default = 1000)
+The maximum number of boosting steps. If validation error does not flatten out at the end of the ***m***th boosting step, then try increasing it (or alternatively increase the learning rate).
+
+#### v (default = 0.1)
+The learning rate. Must be greater than zero and not more than one. The higher the faster the algorithm learns and the lower ***m*** is required. However, empirical evidence suggests that ***v <= 0.1*** gives better results. If the algorithm learns too fast (requires few boosting steps to converge) then try lowering the learning rate. Computational costs can be reduced by increasing the learning rate while simultaneously decreasing ***m***, potentially at the expense of predictiveness.
+
+#### random_state (default = 0)
+Used to randomly split training observations into training and validation if ***validation_set_indexes*** is not specified when fitting.
+
+#### family (default = "gaussian")
+Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma" and "tweedie". This is used together with ***link_function***.
+
+#### link_function (default = "identity")
+Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit", "log", "inverse" and "tweedie". These are canonical link functions for the "gaussian", "binomial", "poisson", "gamma" and "tweedie" ***family*** respectively. Canonical links usually work fine given that the data is appropriate for the selected combination of ***family*** and ***link_function***. Other combinations of ***family*** and ***link_function*** may or may not work (the model may fit poorly to the data if the wrong combination is used).
+
+#### n_jobs (default = 0)
+Multi-threading parameter. If ***0*** then uses all available cores for multi-threading. Any other positive integer specifies the number of cores to use (***1*** means single-threading).
+
+#### validation_ratio (default = 0.2)
+The ratio of training observations to use for validation instead of training. The number of boosting steps is automatically tuned to minimize validation error.
+
+#### intercept (default = nan)
+Specifies the intercept term of the model if you want to predict before doing any training. However, when the ***fit*** method is run then the intercept is estimated based on data and whatever was specified as ***intercept*** when instantiating ***APLRRegressor*** gets overwritten.
+
+#### bins (default = 300)
+Specifies the maximum number of bins to discretize the data into when searching for the best split. The default value works well according to empirical results. This hyperparameter is intended for reducing computational costs.
+
+#### max_interaction_level (default = 1)
+Specifies the maximum allowed depth of interaction terms. ***0*** means that interactions are not allowed. This hyperparameter should be tuned. Please note that occasionally a too high value could produce a model that performs poorly on an independent test set despite looking good when tuning hyperparameters. If this happens then reduce ***max_interaction_level*** until the problem disappears.
+
+#### max_interactions (default = 100000)
+The maximum number of interactions allowed. A lower value may be used to reduce computational time.
+
+#### min_observations_in_split (default = 20)
+The minimum effective number of observations that a term in the model must rely on. This hyperparameter should be tuned. Larger values are more appropriate for larger datasets. Larger values result in more robust models (lower variance), potentially at the expense of increased bias.
+
+#### ineligible_boosting_steps_added (default = 10)
+Controls how many boosting steps a term that becomes ineligible has to remain ineligible. The default value works well according to empirical results. This hyperparameter is intended for reducing computational costs.
+
+#### max_eligible_terms (default = 5)
+Limits 1) the number of terms already in the model that can be considered as interaction partners in a boosting step and 2) how many terms remain eligible in the next boosting step. The default value works well according to empirical results. This hyperparameter is intended for reducing computational costs.
+
+#### verbosity (default = 0)
+***0*** does not print progress reports during fitting. ***1*** prints a summary after running the ***fit*** method. ***2*** prints a summary after each boosting step.
+
+#### tweedie_power (default = 1.5)
+Species the variance power for the "tweedie" ***family*** and ***link_function***.
+
+
+## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[])
+
+***This method fits the model to data.***
+
+### Parameters
+
+#### X
+A numpy matrix with predictor values.
+
+#### y
+A numpy vector with response values.
+
+#### sample_weight
+An optional numpy vector with sample weights. If not specified then the observations are weighted equally.
+
+#### X_names
+An optional list of strings containing names for each predictor in ***X***. Naming predictors may increase model readability because model terms get names based on ***X_names***.
+
+#### validation_set_indexes
+An optional list of integers specifying the indexes of observations to be used for validation instead of training. If this is specified then ***validation_ratio*** is not used. Specifying ***validation_set_indexes*** may be useful for example when modelling time series data (you can place more recent observations in the validation set).
+
+
+## Method: predict(X:npt.ArrayLike)
+
+***Returns a numpy vector containing predictions of the data in X. Requires that the model has been fitted with the fit method.***
+
+### Parameters
+
+#### X
+A numpy matrix with predictor values.
+
+
+## Method: set_term_names(X_names:List[str])
+
+***This method sets the names of terms based on X_names.***
+
+### Parameters
+
+#### X_names
+A list of strings containing names for each predictor in the ***X*** matrix that the model was trained on.
+
+
+## Method: calculate_local_feature_importance(X:npt.ArrayLike)
+
+***Returns a numpy matrix containing local feature importance for new data by each predictor in X.***
+
+### Parameters
+
+#### X
+A numpy matrix with predictor values.
+
+
+## Method: calculate_local_feature_importance_for_terms(X:npt.ArrayLike)
+
+***Returns a numpy matrix containing local feature importance for new data by each term in the model.***
+
+### Parameters
+
+#### X
+A numpy matrix with predictor values.
+
+
+## Method: calculate_terms(X:npt.ArrayLike)
+
+***Returns a numpy matrix containing values of model terms calculated on X.***
+
+### Parameters
+
+#### X
+A numpy matrix with predictor values.
+
+
+## Method: get_term_names()
+
+***Returns a list of strings containing term names.***
+
+
+## Method: get_term_coefficients()
+
+***Returns a numpy vector containing term regression coefficients.***
+
+
+## Method: get_term_coefficient_steps(term_index:int)
+
+***Returns a numpy vector containing regression coefficients by each boosting step for the term selected.***
+
+### Parameters
+
+#### term_index
+The index of the term selected. So ***0*** is the first term, ***1*** is the second term and so on.
+
+
+## Method: get_validation_error_steps()
+
+***Returns a numpy vector containing the validation error by boosting step. Use this to determine if the maximum number of boosting steps (m) or learning rate (v) should be changed.***
+
+
+## Method: get_feature_importance()
+
+***Returns a numpy vector containing the feature importance (estimated on the validation set) of each predictor.***
+
+
+## Method: get_intercept()
+
+***Returns the regression coefficient of the intercept term.***
+
+
+## Method: get_intercept_steps()
+
+***Returns a numpy vector containing the regression coefficients of the intercept term by boosting step.***
+
+
+## Method: get_m()
+
+***Returns the number of boosting steps in the model (the value that minimized validation error).***
@@ -2,19 +2,19 @@
 Automatic Piecewise Linear Regression.
 
 # About
-Build predictive and interpretable parametric machine learning models in Python based on the Automatic Piecewise Linear Regression (APLR) methodology developed by Mathias von Ottenbreit. APLR is often able to compete with Random Forest on predictiveness, but unlike Random Forest and other tree-based methods APLR is interpretable. See the documentation folder for more information. 
+Build predictive and interpretable parametric machine learning models in Python based on the Automatic Piecewise Linear Regression (APLR) methodology developed by Mathias von Ottenbreit. APLR is often able to compete with Random Forest on predictiveness, but unlike Random Forest and other tree-based methods APLR is interpretable. See the ***documentation*** folder for more information. 
 
 # How to install
-pip install aplr
+***pip install aplr***
 
 # Availability
 Currently available for Windows and most Linux distributions.
 
 # How to use
-Please see the two example Python scripts in the examples folder. They cover common use cases, but not all of the functionality in this package. For example, fitting with user-specified observation weights is possible but the example scripts do not use this functionality.
+Please see the two example Python scripts in the ***examples*** folder. They cover common use cases, but not all of the functionality in this package. For example, fitting with user-specified observation weights is possible but the example scripts do not use this functionality.
 
 # Sponsorship
 Please consider sponsoring Ottenbreit Data Science by clicking on the Sponsor button. Sufficient funding will enable maintenance of APLR and further development.
 
 # API reference
-A thorough API reference will be provided in the future.
+Please see ***API_REFERENCE.md***
@@ -5,7 +5,7 @@
 
 
 class APLRRegressor():
-    def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=100, max_interactions:int=0, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0):
+    def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaussian", link_function:str="identity", n_jobs:int=0, validation_ratio:float=0.2, intercept:float=np.nan, bins:int=300, max_interaction_level:int=1, max_interactions:int=100000, min_observations_in_split:int=20, ineligible_boosting_steps_added:int=10, max_eligible_terms:int=5, verbosity:int=0, tweedie_power:float=1.5):
         self.m=m
         self.v=v
         self.random_state=random_state
@@ -21,6 +21,7 @@ def __init__(self, m:int=1000, v:float=0.1, random_state:int=0, family:str="gaus
         self.ineligible_boosting_steps_added=ineligible_boosting_steps_added
         self.max_eligible_terms=max_eligible_terms
         self.verbosity=verbosity
+        self.tweedie_power=tweedie_power
 
         #Creating aplr_cpp and setting parameters
         self.APLRRegressor=aplr_cpp.APLRRegressor()
@@ -43,6 +44,7 @@ def __set_params_cpp(self):
         self.APLRRegressor.ineligible_boosting_steps_added=self.ineligible_boosting_steps_added
         self.APLRRegressor.max_eligible_terms=self.max_eligible_terms
         self.APLRRegressor.verbosity=self.verbosity
+        self.APLRRegressor.tweedie_power=self.tweedie_power
 
     def fit(self, X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[]):
         self.__set_params_cpp()
@@ -89,7 +91,7 @@ def get_m(self)->int:
 
     #For sklearn
     def get_params(self, deep=True):
-        return {"m": self.m, "v": self.v,"random_state":self.random_state,"family":self.family,"link_function":self.link_function,"n_jobs":self.n_jobs,"validation_ratio":self.validation_ratio,"intercept":self.intercept,"bins":self.bins,"max_interaction_level":self.max_interaction_level,"max_interactions":self.max_interactions,"verbosity":self.verbosity,"min_observations_in_split":self.min_observations_in_split,"ineligible_boosting_steps_added":self.ineligible_boosting_steps_added,"max_eligible_terms":self.max_eligible_terms}
+        return {"m": self.m, "v": self.v,"random_state":self.random_state,"family":self.family,"link_function":self.link_function,"n_jobs":self.n_jobs,"validation_ratio":self.validation_ratio,"intercept":self.intercept,"bins":self.bins,"max_interaction_level":self.max_interaction_level,"max_interactions":self.max_interactions,"verbosity":self.verbosity,"min_observations_in_split":self.min_observations_in_split,"ineligible_boosting_steps_added":self.ineligible_boosting_steps_added,"max_eligible_terms":self.max_eligible_terms,"tweedie_power":self.tweedie_power}
 
     #For sklearn
     def set_params(self, **parameters):