Updated API reference and code example

mathias-von-ottenbreit · mathias-von-ottenbreit · commit 544d6947e2dd · 2022-12-23T20:50:18.000+01:00
diff --git a/API_REFERENCE.md b/API_REFERENCE.md
@@ -14,10 +14,10 @@ The learning rate. Must be greater than zero and not more than one. The higher t
 Used to randomly split training observations into training and validation if ***validation_set_indexes*** is not specified when fitting.
 
 #### family (default = "gaussian")
-Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma" and "tweedie". This is used together with ***link_function***. ***family*** is not intended to be a tuning parameter because it defines how the loss function is calculated. However, if you wish to tune it then the method ***get_validation_group_mse()*** provides a useful tuning metric.
+Determines the loss function used. Allowed values are "gaussian", "binomial", "poisson", "gamma" and "tweedie". This is used together with ***link_function***. 
 
 #### link_function (default = "identity")
-Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***family*** "gaussian" and ***link_function*** "identity". For logistic regression use ***family*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma" or "tweedie" ***family***, depending on the data. The ***family*** "poisson", "gamma" or "tweedie" should only be used with the "log" ***link_function***. Inappropriate combinations of ***family*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge. ***link_function*** is not intended to be a tuning parameter because it defines the model structure. However, if you wish to tune it then the method ***get_validation_group_mse()*** provides a useful tuning metric.
+Determines how the linear predictor is transformed to predictions. Allowed values are "identity", "logit" and "log". For an ordinary regression model use ***family*** "gaussian" and ***link_function*** "identity". For logistic regression use ***family*** "binomial" and ***link_function*** "logit". For a multiplicative model use the "log" ***link_function***. The "log" ***link_function*** often works best with a "poisson", "gamma" or "tweedie" ***family***, depending on the data. The ***family*** "poisson", "gamma" or "tweedie" should only be used with the "log" ***link_function***. Inappropriate combinations of ***family*** and ***link_function*** may result in a warning message when fitting the model and/or a poor model fit. Please note that values other than "identity" typically require a significantly higher ***m*** (or ***v***) in order to converge.
 
 #### n_jobs (default = 0)
 Multi-threading parameter. If ***0*** then uses all available cores for multi-threading. Any other positive integer specifies the number of cores to use (***1*** means single-threading).
@@ -50,10 +50,10 @@ Limits 1) the number of terms already in the model that can be considered as int
 ***0*** does not print progress reports during fitting. ***1*** prints a summary after running the ***fit*** method. ***2*** prints a summary after each boosting step.
 
 #### tweedie_power (default = 1.5)
-Species the variance power for the "tweedie" ***family***. It can be useful to tune this hyperparameter. The method ***get_validation_group_mse()*** provides a tuning metric that ***tweedie_power*** can be tuned on.
+Species the variance power for the "tweedie" ***family***. It can be useful to tune this hyperparameter. The method ***get_validation_group_mse()*** provides an experimental tuning metric for this.
 
 #### group_size_for_validation_group_mse (default = 100)
-APLR calculates mean squared error on grouped data in the validation set. This can be useful for comparing models that have different ***family*** or ***tweedie_power*** parameters. The maximum number of observations in each group is specified by  ***group_size_for_validation_group_mse***. Some of the observations with the lowest or highest response values will belong to groups with less than   ***group_size_for_validation_group_mse*** observations. The minimum number of observations in a group is ***group_size_for_validation_group_mse/2***. If ***group_size_for_validation_group_mse*** is equal to or higher than the number of observations in the validation set, then there will only be one group (in this case the grouped validation MSE is not so useful). ***group_size_for_validation_group_mse*** should be large enough so that the Central Limit Theorem holds (at least 60, but 100 is a safer choice). Also, the number of observations in the validation set should be substantially higher than ***group_size_for_validation_group_mse*** for group validation MSE to be useful.
+APLR calculates an experimental tuning metric, mean squared error on grouped data in the validation set. This may be useful for tuning ***tweedie_power***. The maximum number of observations in a group is specified by ***group_size_for_validation_group_mse***. The minimum number of observations in a group is approximately half of that. If ***group_size_for_validation_group_mse*** is equal to or higher than the number of observations in the validation set, then there will only be one group (in this case the grouped validation MSE is less useful). ***group_size_for_validation_group_mse*** should be large enough so that the Central Limit Theorem holds (at least 60, but 100 is a safer choice). Also, the number of observations in the validation set should be substantially higher than ***group_size_for_validation_group_mse***.
 
 
 ## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[])
diff --git a/examples/train_aplr_validation.py b/examples/train_aplr_validation.py
@@ -38,7 +38,7 @@
     model = APLRRegressor(random_state=random_state,verbosity=3,m=1000,v=0.1,family=family,link_function=link_function,**params) 
     model.fit(data_train[predictors].values,data_train[response].values,X_names=predictors)
     validation_error_for_this_model=np.min(model.get_validation_error_steps())
-    #validation_error_for_this_model=model.get_validation_group_mse() #Use this if you wish to tune tweedie_power, family or link_function.
+    #validation_error_for_this_model=model.get_validation_group_mse() #You may try this experimental metric to tune tweedie_power
     validation_results_for_this_model=pd.DataFrame(model.get_params(),index=[0])
     validation_results_for_this_model["validation_error"]=validation_error_for_this_model
     validation_results=pd.concat([validation_results,validation_results_for_this_model])