FAQ (minor modifications 3)

vecxoz · web-flow · commit e20fde07eca4 · 2018-07-17T12:56:22.000+03:00
[skip ci]
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 [![PyPI pyversions](https://img.shields.io/pypi/pyversions/vecstack.svg)](https://pypi.python.org/pypi/vecstack/)
 
 # vecstack
-Python package for stacking featuring lightweight ***functional API*** and fully compatible ***scikit-learn API***  
+Python package for stacking (stacked generalization) featuring lightweight ***functional API*** and fully compatible ***scikit-learn API***  
 Convenient way to automate OOF computation, prediction and bagging using any number of models  
 
 * [Functional API](https://github.com/vecxoz/vecstack#usage-functional-api):
@@ -35,7 +35,7 @@ Convenient way to automate OOF computation, prediction and bagging using any num
     * [Scikit-learn API](https://github.com/vecxoz/vecstack#usage-scikit-learn-api)
 * Tutorials:
     * [Stacking concept + Pictures + Stacking implementation from scratch](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb)
-* Examples:
+* Examples (all examples are valid for both API with little [difference in parameters](https://github.com/vecxoz/vecstack#21-how-do-parameters-of-stacking-function-and-stackingtransformer-correspond)):
     * Functional API:
         * [Regression](https://github.com/vecxoz/vecstack/blob/master/examples/01_regression.ipynb)
         * [Classification with class labels](https://github.com/vecxoz/vecstack/blob/master/examples/02_classification_with_class_labels.ipynb)
@@ -113,14 +113,14 @@ S_test = stack.transform(X_test)
 4.  [What is stacking?](https://github.com/vecxoz/vecstack#4-what-is-stacking)
 5.  [What about stacking name?](https://github.com/vecxoz/vecstack#5-what-about-stacking-name)
 6.  [Do I need stacking at all?](https://github.com/vecxoz/vecstack#6-do-i-need-stacking-at-all)
-7.  [Can you explain stacking in 10 lines of code?](https://github.com/vecxoz/vecstack#7-can-you-explain-stacking-in-10-lines-of-code)
+7.  [Can you explain stacking (stacked generalization) in 10 lines of code?](https://github.com/vecxoz/vecstack#7-can-you-explain-stacking-stacked-generalization-in-10-lines-of-code)
 8.  [Why do I need complicated inner procedure for stacking?](https://github.com/vecxoz/vecstack#8-why-do-i-need-complicated-inner-procedure-for-stacking)
-9.  [I want to implement stacking from scratch. Can you help me?](https://github.com/vecxoz/vecstack#9-i-want-to-implement-stacking-from-scratch-can-you-help-me)
+9.  [I want to implement stacking (stacked generalization) from scratch. Can you help me?](https://github.com/vecxoz/vecstack#9-i-want-to-implement-stacking-stacked-generalization-from-scratch-can-you-help-me)
 10. [What is OOF?](https://github.com/vecxoz/vecstack#10-what-is-oof)
 11. [What are *estimator*, *learner*, *model*?](https://github.com/vecxoz/vecstack#11-what-are-estimator-learner-model)
 12. [What is *blending*? How is it related to stacking?](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking)
-13. [How to optimize weights for blending?](https://github.com/vecxoz/vecstack#13-how-to-optimize-weights-for-blending)
-14. [What is better: *blending* (weighted average) or *stacking* (2nd level model)?](https://github.com/vecxoz/vecstack#14-what-is-better-blending-weighted-average-or-stacking-2nd-level-model)
+13. [How to optimize weights for weighted average?](https://github.com/vecxoz/vecstack#13-how-to-optimize-weights-for-weighted-average)
+14. [What is better: weighted average for current level or additional level?](https://github.com/vecxoz/vecstack#14-what-is-better-weighted-average-for-current-level-or-additional-level)
 15. [What is *bagging*? How is it related to stacking?](https://github.com/vecxoz/vecstack#15-what-is-bagging-how-is-it-related-to-stacking)
 16. [How many models should I use on a given stacking level?](https://github.com/vecxoz/vecstack#16-how-many-models-should-i-use-on-a-given-stacking-level)
 17. [How many stacking levels should I use?](https://github.com/vecxoz/vecstack#17-how-many-stacking-levels-should-i-use)
@@ -162,21 +162,25 @@ Just give me a star in the top right corner of the repository page.
 
 ### 4. What is stacking?
 
-Stacking is a machine learning ensembling technique.  
+Stacking (stacked generalization) is a machine learning ensembling technique.  
 Main idea is to use predictions as features.  
-More specifically we predict train set (in CV-like fashion) and test set using some 1st level model(s), and then use these predictions as features for 2nd level model. You can find more details (concept, pictures, code) in [stacking tutorial](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb). Also check out Wikipedia article about [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking).  
+More specifically we predict train set (in CV-like fashion) and test set using some 1st level model(s), and then use these predictions as features for 2nd level model. You can find more details (concept, pictures, code) in [stacking tutorial](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb).  
+Also make sure to check out: 
+* [Ensemble Learning](https://en.wikipedia.org/wiki/Ensemble_learning) ([Stacking](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) in Wikipedia
+* Classical [Kaggle Ensembling Guide](https://mlwave.com/kaggle-ensembling-guide/)
+* [Stacked Generalization](https://www.researchgate.net/publication/222467943_Stacked_Generalization) paper by David H. Wolpert
     
 ### 5. What about stacking name?
 
-Sometimes it is also called *stacked generalization*. The term is derived from the verb *to stack* (to put together, to put on top of each other). It implies that we put some models on top of other models, i.e. train some models on predictions of other models. From another point of view we can say that we stack predictions in order to use them as features.  
+Often it is also called *stacked generalization*. The term is derived from the verb *to stack* (to put together, to put on top of each other). It implies that we put some models on top of other models, i.e. train some models on predictions of other models. From another point of view we can say that we stack predictions in order to use them as features.  
 
 ### 6. Do I need stacking at all?
 
 It depends on specific business case. The main thing to know about stacking is that it requires ***significant computing resources***. [No Free Lunch Theorem](https://en.wikipedia.org/wiki/There_ain%27t_no_such_thing_as_a_free_lunch) applies as always. Stacking can give you an improvement but for certain price (deployment, computation, maintenance). Only experiment for given business case will give you an answer: is it worth an effort and money.  
 
-At current point large part of stacking users are participants of machine learning competitions. On Kaggle you can't go too far without stacking or [blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking). I can secretly tell you that at least top half of leaderboard in pretty much any competition uses stacking or [blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking) is some way. Stacking is less popular in production due to time and resource constraints, but I think it gains popularity.  
+At current point large part of stacking users are participants of machine learning competitions. On Kaggle you can't go too far without ensembling. I can secretly tell you that at least top half of leaderboard in pretty much any competition uses stacking in some way. Stacking is less popular in production due to time and resource constraints, but I think it gains popularity.  
    
-### 7. Can you explain stacking in 10 lines of code?
+### 7. Can you explain stacking (stacked generalization) in 10 lines of code?
 
 [Of course](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb)
     
@@ -196,93 +200,69 @@ final_prediction = model_L2.predict(S_test)
 
 Code above will give meaningless result. If we fit on `X_train` we can’t just predict `X_train`, because our 1st level model has already seen `X_train`, and its prediction will be overfitted. To avoid overfitting we perform cross-validation procedure and in each fold we predict out-of-fold (OOF) part of `X_train`. You can find more details (concept, pictures, code) in [stacking tutorial](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb).  
 
-### 9. I want to implement stacking from scratch. Can you help me?
+### 9. I want to implement stacking (stacked generalization) from scratch. Can you help me?
 
 [Not a problem](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb)  
     
 ### 10. What is OOF?
 
-OOF is abbreviation for out-of-fold prediction. It's also known as *OOF features*, *stacked features*, *stacking features*, etc. Basically it means predictions on the part of data that model haven't seen during training.  
+OOF is abbreviation for out-of-fold prediction. It's also known as *OOF features*, *stacked features*, *stacking features*, etc. Basically it means predictions for the part of train data that model haven't seen during training.  
     
 ### 11. What are *estimator*, *learner*, *model*?
 
 Basically it is the same thing meaning *machine learning algorithm*. Often these terms are used interchangeably.  
 Speaking about inner stacking mechanics, you should remember that when you have *single 1st level model* there will be at least `n_folds` separate models *trained in each CV fold* on different subsets of data. See [Q23](https://github.com/vecxoz/vecstack#23-how-to-estimate-stacking-training-time-and-number-of-models-which-will-be-built) for more details.  
 
 ### 12. What is *blending*? How is it related to stacking?
-    
-Basically it is the same thing. Both approaches use predictions as features, but final prediction on the 2nd (final) level is obtained differently.  
-* In *stacking* we train 2nd (final) level model (e.g. Linear Regression or Logistic Regression) using predictions of 1st level models as features.  
-* In *blending* we compute weighted average of predictions of 1st level models. Of course you can view weighted average as a model too.  
 
-Let's look at example.
+Basically it is the same thing. Both approaches use predictions as features.  
+Often this terms are used interchangably.  
+The difference is how we generate features (predictions) for the next level:  
+* *stacking*: perform cross-validation procedure and predict each part of train set (OOF)
+* *blending*: predict fixed holdout set
 
-```python
-# Using two 1st level models
-models_L1 = [
-    RandomForestRegressor(),
-    XGBRegressor(),
-]
-
-# Stacking
-S_train, S_test = stacking(models_L1, X_train, y_train, X_test)
-model_L2 = LinearRegression()
-model_L2 = model_L2.fit(S_train, y_train)
-final_predicition = model_L2.predict(S_test)
-
-# Simple blending (bagging) (all weights are equal, can be done without OOF)
-_, S_test = stacking(models_L1, X_train, y_train, X_test, mode='pred')
-final_predicition = np.mean(S_test, axis=1)
-
-# Advanced blending (weights are chosen based on OOF, i.e. optimized)
-# In the code below different weight combinations are computed by hand, but in practice
-# we use special optimization routines like scipy.optimize.minimize
-S_train, S_test = stacking(models_L1, X_train, y_train, X_test)
-# Combine predictions and compute score (columns correspond to 1st level models)
-y_pred_train = S_train[:, 0] * 0.4 + S_train[:, 1] * 0.6
-print(mean_absolute_error(y_train, y_pred_train))
-# Repeat with other weights
-y_pred_train = S_train[:, 0] * 0.3 + S_train[:, 1] * 0.7
-print(mean_absolute_error(y_train, y_pred_train))
-# Final prediction is computed with weights corresponding to the best score on train set
-final_predicition = S_test[:, 0] * 0.3 + S_test[:, 1] * 0.7
-```
+*vecstack* package supports only *stacking* i.e. cross-validation approach. For given `random_state` value (e.g. 42) folds (splits) will be the same across all estimators. See also [Q30](https://github.com/vecxoz/vecstack#30-do-folds-splits-have-to-be-the-same-across-estimators-and-stacking-levels-how-does-random_state-work).
 
-### 13. How to optimize weights for blending?
+### 13. How to optimize weights for weighted average?
     
 You can use for example:
 
 * `scipy.optimize.minimize`
 * `scipy.optimize.differential_evolution`
 
-### 14. What is better: *blending* (weighted average) or *stacking* (2nd level model)?
+### 14. What is better: weighted average for current level or additional level?
 
-By default you can start from blending. It is easier to apply and more chances that it will give good result. Then you can try stacking which potentially can outperform blending (but not always and not in an easy way). Experiment is your friend.  
+By default you can start from weighted average. It is easier to apply and more chances that it will give good result. Then you can try additional level which potentially can outperform weighted average (but not always and not in an easy way). Experiment is your friend.  
 
 ### 15. What is *bagging*? How is it related to stacking?
 
-[Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) or Bootstrap aggregating works as follows: generate subsets of training set, train estimator on these subsets and then find average of predictions. When we train several different algorithms on the same data and then find average we can call this bagging as well. See [simple blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking).  
+[Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) or Bootstrap aggregating works as follows: generate subsets of training set, train models on these subsets and then find average of predictions.  
+Also term *bagging* is often used to describe following approaches:
+* train several different models on the same data and average predictions
+* train same model with different random seeds on the same data and average predictions
+
+So if we run stacking and just average predictions - it is *bagging*.
     
 ### 16. How many models should I use on a given stacking level?
 
 ***Note 1:*** The best architecture can be found only by experiment.  
-***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (blending) is diversity - low correlation between models.  
+***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.  
 
 It depends on many factors like type of problem, type of data, quality of models, correlation of models, expected result, etc.  
 Some example configurations are listed below.  
 * Reasonable starting point:  
-    * `L1: 2-10 models -> L2: blend (weighted average) or single model`
+    * `L1: 2-10 models -> L2: weighted (rank) average or single model`
 * Then try to add more 1st level models and additional level:  
-    * `L1: 10-50 models -> L2: 2-10 models -> L3: blend (weighted average)`
+    * `L1: 10-50 models -> L2: 2-10 models -> L3: weighted (rank) average`
 * If you're crunching numbers at Kaggle and decided to go wild:  
-    * `L1: 100-inf models -> L2: 10-50 models -> L3: 2-10 models -> L4: blend (weighted average)`
+    * `L1: 100-inf models -> L2: 10-50 models -> L3: 2-10 models -> L4: weighted (rank) average`
 
 You can also find some winning stacking architectures on [Kaggle blog](http://blog.kaggle.com/), e.g.: [1st place in Homesite Quote Conversion](http://blog.kaggle.com/2016/04/08/homesite-quote-conversion-winners-write-up-1st-place-kazanova-faron-clobber/)  
     
 ### 17. How many stacking levels should I use?
 
 ***Note 1:*** The best architecture can be found only by experiment.  
-***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (blending) is diversity - low correlation between models.  
+***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.  
 
 For some example configurations see [Q16](https://github.com/vecxoz/vecstack#16-how-many-models-should-i-use-on-a-given-stacking-level)
 
@@ -292,7 +272,7 @@ Based on experiments and correlation (e.g. Pearson). Less correlated models give
 
 ### 19. I am trying hard but still can't beat my best single model with stacking. What is wrong?
 
-Nothing is wrong. Stacking is advanced complicated technique. It's hard to make it work. ***Solution:*** Try [blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking) first. Blending is much easier to apply and in most cases it will surely outperform your best model. If still no luck - then probably your models are highly correlated.  
+Nothing is wrong. Stacking is advanced complicated technique. It's hard to make it work. ***Solution:*** make sure to try weighted (rank) average first instead of additional level with some advanced models. Average is much easier to apply and in most cases it will surely outperform your best model. If still no luck - then probably your models are highly correlated.  
 
 ### 20. What should I choose: functional API (`stacking` function) or Scikit-learn API (`StackingTransformer`)?