documentation

mathias-von-ottenbreit · mathias-von-ottenbreit · commit 3460c31b1f3d · 2022-12-24T12:50:29.000+01:00
diff --git a/API_REFERENCE.md b/API_REFERENCE.md
@@ -50,10 +50,10 @@ Limits 1) the number of terms already in the model that can be considered as int
 ***0*** does not print progress reports during fitting. ***1*** prints a summary after running the ***fit*** method. ***2*** prints a summary after each boosting step.
 
 #### tweedie_power (default = 1.5)
-Species the variance power for the "tweedie" ***family***. It can be useful to tune this hyperparameter. The method ***get_validation_group_mse()*** provides an experimental tuning metric for this.
+Species the variance power for the "tweedie" ***family***.
 
 #### group_size_for_validation_group_mse (default = 100)
-APLR calculates an experimental tuning metric, mean squared error on grouped data in the validation set. This may be useful for tuning ***tweedie_power***. The maximum number of observations in a group is specified by ***group_size_for_validation_group_mse***. The minimum number of observations in a group is approximately half of that. If ***group_size_for_validation_group_mse*** is equal to or higher than the number of observations in the validation set, then there will only be one group (in this case the grouped validation MSE is less useful). ***group_size_for_validation_group_mse*** should be large enough so that the Central Limit Theorem holds (at least 60, but 100 is a safer choice). Also, the number of observations in the validation set should be substantially higher than ***group_size_for_validation_group_mse***.
+APLR calculates a tuning metric, mean squared error for groups of observations in the validation set. This metric is provided by the method ***get_validation_group_mse()***. The metric may be useful for tuning ***tweedie_power*** and to some extent ***family*** or ***link_function***. The reasoning behind this is that while mean squared error (MSE) could be inappropriate for evaluating for example tweedie distributed responses, MSE is often appropriate for evaluating normally distributed data. The sum response of a group of observations is approximately normally distributed according to the Central Limit Theorem (CLT) if there are enough observations in the group, even if the response for an individual observation has a different probability distribution. Ideally, ***group_size_for_validation_group_mse*** should be large enough so that the Central Limit Theorem holds (at least 30, but the default of 100 is a safer choice). Also, the number of observations in the validation set should be substantially higher than ***group_size_for_validation_group_mse***.
 
 
 ## Method: fit(X:npt.ArrayLike, y:npt.ArrayLike, sample_weight:npt.ArrayLike = np.empty(0), X_names:List[str]=[], validation_set_indexes:List[int]=[])
diff --git a/examples/train_aplr_validation.py b/examples/train_aplr_validation.py
@@ -38,7 +38,7 @@
     model = APLRRegressor(random_state=random_state,verbosity=3,m=1000,v=0.1,family=family,link_function=link_function,**params) 
     model.fit(data_train[predictors].values,data_train[response].values,X_names=predictors)
     validation_error_for_this_model=np.min(model.get_validation_error_steps())
-    #validation_error_for_this_model=model.get_validation_group_mse() #Experimental metric for tuning tweedie_power
+    #validation_error_for_this_model=model.get_validation_group_mse() #Metric that may be useful for tuning tweedie_power, family or link_function.
     validation_results_for_this_model=pd.DataFrame(model.get_params(),index=[0])
     validation_results_for_this_model["validation_error"]=validation_error_for_this_model
     validation_results=pd.concat([validation_results,validation_results_for_this_model])