Skip to content

Commit e20fde0

Browse files
authored
FAQ (minor modifications 3)
[skip ci]
1 parent 7d2e54b commit e20fde0

File tree

1 file changed

+38
-58
lines changed

1 file changed

+38
-58
lines changed

README.md

Lines changed: 38 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/vecstack.svg)](https://pypi.python.org/pypi/vecstack/)
66

77
# vecstack
8-
Python package for stacking featuring lightweight ***functional API*** and fully compatible ***scikit-learn API***
8+
Python package for stacking (stacked generalization) featuring lightweight ***functional API*** and fully compatible ***scikit-learn API***
99
Convenient way to automate OOF computation, prediction and bagging using any number of models
1010

1111
* [Functional API](https://github.com/vecxoz/vecstack#usage-functional-api):
@@ -35,7 +35,7 @@ Convenient way to automate OOF computation, prediction and bagging using any num
3535
* [Scikit-learn API](https://github.com/vecxoz/vecstack#usage-scikit-learn-api)
3636
* Tutorials:
3737
* [Stacking concept + Pictures + Stacking implementation from scratch](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb)
38-
* Examples:
38+
* Examples (all examples are valid for both API with little [difference in parameters](https://github.com/vecxoz/vecstack#21-how-do-parameters-of-stacking-function-and-stackingtransformer-correspond)):
3939
* Functional API:
4040
* [Regression](https://github.com/vecxoz/vecstack/blob/master/examples/01_regression.ipynb)
4141
* [Classification with class labels](https://github.com/vecxoz/vecstack/blob/master/examples/02_classification_with_class_labels.ipynb)
@@ -113,14 +113,14 @@ S_test = stack.transform(X_test)
113113
4. [What is stacking?](https://github.com/vecxoz/vecstack#4-what-is-stacking)
114114
5. [What about stacking name?](https://github.com/vecxoz/vecstack#5-what-about-stacking-name)
115115
6. [Do I need stacking at all?](https://github.com/vecxoz/vecstack#6-do-i-need-stacking-at-all)
116-
7. [Can you explain stacking in 10 lines of code?](https://github.com/vecxoz/vecstack#7-can-you-explain-stacking-in-10-lines-of-code)
116+
7. [Can you explain stacking (stacked generalization) in 10 lines of code?](https://github.com/vecxoz/vecstack#7-can-you-explain-stacking-stacked-generalization-in-10-lines-of-code)
117117
8. [Why do I need complicated inner procedure for stacking?](https://github.com/vecxoz/vecstack#8-why-do-i-need-complicated-inner-procedure-for-stacking)
118-
9. [I want to implement stacking from scratch. Can you help me?](https://github.com/vecxoz/vecstack#9-i-want-to-implement-stacking-from-scratch-can-you-help-me)
118+
9. [I want to implement stacking (stacked generalization) from scratch. Can you help me?](https://github.com/vecxoz/vecstack#9-i-want-to-implement-stacking-stacked-generalization-from-scratch-can-you-help-me)
119119
10. [What is OOF?](https://github.com/vecxoz/vecstack#10-what-is-oof)
120120
11. [What are *estimator*, *learner*, *model*?](https://github.com/vecxoz/vecstack#11-what-are-estimator-learner-model)
121121
12. [What is *blending*? How is it related to stacking?](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking)
122-
13. [How to optimize weights for blending?](https://github.com/vecxoz/vecstack#13-how-to-optimize-weights-for-blending)
123-
14. [What is better: *blending* (weighted average) or *stacking* (2nd level model)?](https://github.com/vecxoz/vecstack#14-what-is-better-blending-weighted-average-or-stacking-2nd-level-model)
122+
13. [How to optimize weights for weighted average?](https://github.com/vecxoz/vecstack#13-how-to-optimize-weights-for-weighted-average)
123+
14. [What is better: weighted average for current level or additional level?](https://github.com/vecxoz/vecstack#14-what-is-better-weighted-average-for-current-level-or-additional-level)
124124
15. [What is *bagging*? How is it related to stacking?](https://github.com/vecxoz/vecstack#15-what-is-bagging-how-is-it-related-to-stacking)
125125
16. [How many models should I use on a given stacking level?](https://github.com/vecxoz/vecstack#16-how-many-models-should-i-use-on-a-given-stacking-level)
126126
17. [How many stacking levels should I use?](https://github.com/vecxoz/vecstack#17-how-many-stacking-levels-should-i-use)
@@ -162,21 +162,25 @@ Just give me a star in the top right corner of the repository page.
162162

163163
### 4. What is stacking?
164164

165-
Stacking is a machine learning ensembling technique.
165+
Stacking (stacked generalization) is a machine learning ensembling technique.
166166
Main idea is to use predictions as features.
167-
More specifically we predict train set (in CV-like fashion) and test set using some 1st level model(s), and then use these predictions as features for 2nd level model. You can find more details (concept, pictures, code) in [stacking tutorial](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb). Also check out Wikipedia article about [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking).
167+
More specifically we predict train set (in CV-like fashion) and test set using some 1st level model(s), and then use these predictions as features for 2nd level model. You can find more details (concept, pictures, code) in [stacking tutorial](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb).
168+
Also make sure to check out:
169+
* [Ensemble Learning](https://en.wikipedia.org/wiki/Ensemble_learning) ([Stacking](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) in Wikipedia
170+
* Classical [Kaggle Ensembling Guide](https://mlwave.com/kaggle-ensembling-guide/)
171+
* [Stacked Generalization](https://www.researchgate.net/publication/222467943_Stacked_Generalization) paper by David H. Wolpert
168172

169173
### 5. What about stacking name?
170174

171-
Sometimes it is also called *stacked generalization*. The term is derived from the verb *to stack* (to put together, to put on top of each other). It implies that we put some models on top of other models, i.e. train some models on predictions of other models. From another point of view we can say that we stack predictions in order to use them as features.
175+
Often it is also called *stacked generalization*. The term is derived from the verb *to stack* (to put together, to put on top of each other). It implies that we put some models on top of other models, i.e. train some models on predictions of other models. From another point of view we can say that we stack predictions in order to use them as features.
172176

173177
### 6. Do I need stacking at all?
174178

175179
It depends on specific business case. The main thing to know about stacking is that it requires ***significant computing resources***. [No Free Lunch Theorem](https://en.wikipedia.org/wiki/There_ain%27t_no_such_thing_as_a_free_lunch) applies as always. Stacking can give you an improvement but for certain price (deployment, computation, maintenance). Only experiment for given business case will give you an answer: is it worth an effort and money.
176180

177-
At current point large part of stacking users are participants of machine learning competitions. On Kaggle you can't go too far without stacking or [blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking). I can secretly tell you that at least top half of leaderboard in pretty much any competition uses stacking or [blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking) is some way. Stacking is less popular in production due to time and resource constraints, but I think it gains popularity.
181+
At current point large part of stacking users are participants of machine learning competitions. On Kaggle you can't go too far without ensembling. I can secretly tell you that at least top half of leaderboard in pretty much any competition uses stacking in some way. Stacking is less popular in production due to time and resource constraints, but I think it gains popularity.
178182

179-
### 7. Can you explain stacking in 10 lines of code?
183+
### 7. Can you explain stacking (stacked generalization) in 10 lines of code?
180184

181185
[Of course](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb)
182186

@@ -196,93 +200,69 @@ final_prediction = model_L2.predict(S_test)
196200

197201
Code above will give meaningless result. If we fit on `X_train` we can’t just predict `X_train`, because our 1st level model has already seen `X_train`, and its prediction will be overfitted. To avoid overfitting we perform cross-validation procedure and in each fold we predict out-of-fold (OOF) part of `X_train`. You can find more details (concept, pictures, code) in [stacking tutorial](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb).
198202

199-
### 9. I want to implement stacking from scratch. Can you help me?
203+
### 9. I want to implement stacking (stacked generalization) from scratch. Can you help me?
200204

201205
[Not a problem](https://github.com/vecxoz/vecstack/blob/master/examples/00_stacking_concept_pictures_code.ipynb)
202206

203207
### 10. What is OOF?
204208

205-
OOF is abbreviation for out-of-fold prediction. It's also known as *OOF features*, *stacked features*, *stacking features*, etc. Basically it means predictions on the part of data that model haven't seen during training.
209+
OOF is abbreviation for out-of-fold prediction. It's also known as *OOF features*, *stacked features*, *stacking features*, etc. Basically it means predictions for the part of train data that model haven't seen during training.
206210

207211
### 11. What are *estimator*, *learner*, *model*?
208212

209213
Basically it is the same thing meaning *machine learning algorithm*. Often these terms are used interchangeably.
210214
Speaking about inner stacking mechanics, you should remember that when you have *single 1st level model* there will be at least `n_folds` separate models *trained in each CV fold* on different subsets of data. See [Q23](https://github.com/vecxoz/vecstack#23-how-to-estimate-stacking-training-time-and-number-of-models-which-will-be-built) for more details.
211215

212216
### 12. What is *blending*? How is it related to stacking?
213-
214-
Basically it is the same thing. Both approaches use predictions as features, but final prediction on the 2nd (final) level is obtained differently.
215-
* In *stacking* we train 2nd (final) level model (e.g. Linear Regression or Logistic Regression) using predictions of 1st level models as features.
216-
* In *blending* we compute weighted average of predictions of 1st level models. Of course you can view weighted average as a model too.
217217

218-
Let's look at example.
218+
Basically it is the same thing. Both approaches use predictions as features.
219+
Often this terms are used interchangably.
220+
The difference is how we generate features (predictions) for the next level:
221+
* *stacking*: perform cross-validation procedure and predict each part of train set (OOF)
222+
* *blending*: predict fixed holdout set
219223

220-
```python
221-
# Using two 1st level models
222-
models_L1 = [
223-
RandomForestRegressor(),
224-
XGBRegressor(),
225-
]
226-
227-
# Stacking
228-
S_train, S_test = stacking(models_L1, X_train, y_train, X_test)
229-
model_L2 = LinearRegression()
230-
model_L2 = model_L2.fit(S_train, y_train)
231-
final_predicition = model_L2.predict(S_test)
232-
233-
# Simple blending (bagging) (all weights are equal, can be done without OOF)
234-
_, S_test = stacking(models_L1, X_train, y_train, X_test, mode='pred')
235-
final_predicition = np.mean(S_test, axis=1)
236-
237-
# Advanced blending (weights are chosen based on OOF, i.e. optimized)
238-
# In the code below different weight combinations are computed by hand, but in practice
239-
# we use special optimization routines like scipy.optimize.minimize
240-
S_train, S_test = stacking(models_L1, X_train, y_train, X_test)
241-
# Combine predictions and compute score (columns correspond to 1st level models)
242-
y_pred_train = S_train[:, 0] * 0.4 + S_train[:, 1] * 0.6
243-
print(mean_absolute_error(y_train, y_pred_train))
244-
# Repeat with other weights
245-
y_pred_train = S_train[:, 0] * 0.3 + S_train[:, 1] * 0.7
246-
print(mean_absolute_error(y_train, y_pred_train))
247-
# Final prediction is computed with weights corresponding to the best score on train set
248-
final_predicition = S_test[:, 0] * 0.3 + S_test[:, 1] * 0.7
249-
```
224+
*vecstack* package supports only *stacking* i.e. cross-validation approach. For given `random_state` value (e.g. 42) folds (splits) will be the same across all estimators. See also [Q30](https://github.com/vecxoz/vecstack#30-do-folds-splits-have-to-be-the-same-across-estimators-and-stacking-levels-how-does-random_state-work).
250225

251-
### 13. How to optimize weights for blending?
226+
### 13. How to optimize weights for weighted average?
252227

253228
You can use for example:
254229

255230
* `scipy.optimize.minimize`
256231
* `scipy.optimize.differential_evolution`
257232

258-
### 14. What is better: *blending* (weighted average) or *stacking* (2nd level model)?
233+
### 14. What is better: weighted average for current level or additional level?
259234

260-
By default you can start from blending. It is easier to apply and more chances that it will give good result. Then you can try stacking which potentially can outperform blending (but not always and not in an easy way). Experiment is your friend.
235+
By default you can start from weighted average. It is easier to apply and more chances that it will give good result. Then you can try additional level which potentially can outperform weighted average (but not always and not in an easy way). Experiment is your friend.
261236

262237
### 15. What is *bagging*? How is it related to stacking?
263238

264-
[Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) or Bootstrap aggregating works as follows: generate subsets of training set, train estimator on these subsets and then find average of predictions. When we train several different algorithms on the same data and then find average we can call this bagging as well. See [simple blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking).
239+
[Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) or Bootstrap aggregating works as follows: generate subsets of training set, train models on these subsets and then find average of predictions.
240+
Also term *bagging* is often used to describe following approaches:
241+
* train several different models on the same data and average predictions
242+
* train same model with different random seeds on the same data and average predictions
243+
244+
So if we run stacking and just average predictions - it is *bagging*.
265245

266246
### 16. How many models should I use on a given stacking level?
267247

268248
***Note 1:*** The best architecture can be found only by experiment.
269-
***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (blending) is diversity - low correlation between models.
249+
***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.
270250

271251
It depends on many factors like type of problem, type of data, quality of models, correlation of models, expected result, etc.
272252
Some example configurations are listed below.
273253
* Reasonable starting point:
274-
* `L1: 2-10 models -> L2: blend (weighted average) or single model`
254+
* `L1: 2-10 models -> L2: weighted (rank) average or single model`
275255
* Then try to add more 1st level models and additional level:
276-
* `L1: 10-50 models -> L2: 2-10 models -> L3: blend (weighted average)`
256+
* `L1: 10-50 models -> L2: 2-10 models -> L3: weighted (rank) average`
277257
* If you're crunching numbers at Kaggle and decided to go wild:
278-
* `L1: 100-inf models -> L2: 10-50 models -> L3: 2-10 models -> L4: blend (weighted average)`
258+
* `L1: 100-inf models -> L2: 10-50 models -> L3: 2-10 models -> L4: weighted (rank) average`
279259

280260
You can also find some winning stacking architectures on [Kaggle blog](http://blog.kaggle.com/), e.g.: [1st place in Homesite Quote Conversion](http://blog.kaggle.com/2016/04/08/homesite-quote-conversion-winners-write-up-1st-place-kazanova-faron-clobber/)
281261

282262
### 17. How many stacking levels should I use?
283263

284264
***Note 1:*** The best architecture can be found only by experiment.
285-
***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (blending) is diversity - low correlation between models.
265+
***Note 2:*** Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.
286266

287267
For some example configurations see [Q16](https://github.com/vecxoz/vecstack#16-how-many-models-should-i-use-on-a-given-stacking-level)
288268

@@ -292,7 +272,7 @@ Based on experiments and correlation (e.g. Pearson). Less correlated models give
292272

293273
### 19. I am trying hard but still can't beat my best single model with stacking. What is wrong?
294274

295-
Nothing is wrong. Stacking is advanced complicated technique. It's hard to make it work. ***Solution:*** Try [blending](https://github.com/vecxoz/vecstack#12-what-is-blending-how-is-it-related-to-stacking) first. Blending is much easier to apply and in most cases it will surely outperform your best model. If still no luck - then probably your models are highly correlated.
275+
Nothing is wrong. Stacking is advanced complicated technique. It's hard to make it work. ***Solution:*** make sure to try weighted (rank) average first instead of additional level with some advanced models. Average is much easier to apply and in most cases it will surely outperform your best model. If still no luck - then probably your models are highly correlated.
296276

297277
### 20. What should I choose: functional API (`stacking` function) or Scikit-learn API (`StackingTransformer`)?
298278

0 commit comments

Comments
 (0)