- The goal of the competition is to predict which ad will be clicked on
- See https://www.kaggle.com/c/outbrain-click-prediction for more details
- This is
ololo's part of the 13th place solution to the challenge (team "diaman & ololo") - The presentation of the solution: http://www.slideshare.net/AlexeyGrigorev/outbrain-click-prediction-71724151
diaman's solution can be found at https://github.com/dselivanov/kaggle-outbrain
The part of the solution is a combination of 5 models:
- SVM and FTRL on basic features:
- event features: user id, document id, platform id, day, hour and geo
- ad features: ad document id, campaign, advertizer id
- XGB and ET on MTV (Mean Target Value) features:
- all categorical features that previous model used
- document features like publisher, source, top category, topic and entity
- interaction between these featuers
- also, the document similarity features: the cosine between the ad doc and the page with the ad
- FFM with the following features:
- all categorical features from the above, except document similarity, categories, topics and entities
- XGB leaves from the previous step (see slide 9 from this presentation for the description of the idea)
- The models are combined with an XGB model (
rank:pairwiseobjective)
To get the 13th positions, models from diaman should also be added
0_prepare_splits.pysplits the training dataset into two folds1_svm_data.pyprepares the data for SVM and FTRL1_train_ftrl.pyand1_train_svm.pytrain models on data from1_svm_data.py2_extract_leaked_docs.pyand2_leak_features.pyextract the leak3_doc_similarity_features.pycalculates TF-IDF similarity between the document user on and the ad document4_categorical_data_join.pyand4_categorical_data_unwrap_columnwise.pyprepare data for MTV features calculation4_mean_target_value.pycalculates MTV for all features fromcategorical_features.txt5_best_mtv_features_xgb.pybuilds an XBG on a small part of data and selects best features to be used on for XGB and ET5_mtv_et.pytrains ET model on MTV features5_mtv_xgb.pytrains XGB model on MTV features and creates leaf featurse to be used in FFM6_1_generate_ffm_data.pycreates the input file to be read by ffmlib6_2_split_ffm_to_subfolds.pysplits each fold into two subfolds (can't use the original folds because the leaf features are not transferable between folds)6_3_run_ffm.shruns libffm for training FFM models6_4_put_ffm_subfolds_together.pyputs FFM predictions from each fold/subfold together7_ensemble_data_prep.pyputs all the features and model predictions together for ensembling7_ensemble_xgb.pytraings the second level XGB model on top of all these features
The files should be run in the above order
Diaman's features should be included into 7_ensemble_data_prep.py - and the rest can stay unchanged.