|
| 1 | +# Learning Click-Through Rate at Scale with Tensorflow on Spark |
| 2 | + |
| 3 | +## Introduction |
| 4 | +This project consists of learning a click-throughrate model at scale using TensorflowOnSpark technology. |
| 5 | +Criteo released a 1TB dataset: http://labs.criteo.com/2013/12/download-terabyte-click-logs/ |
| 6 | +In order to promote Google cloud technology, Google published a solution to train a model at scale using there |
| 7 | +proprietary platform : https://cloud.google.com/blog/big-data/2017/02/using-google-cloud-machine-learning-to-predict-clicks-at-scale |
| 8 | + |
| 9 | +Instead, we propose a solution based on open source technology that can be leveraged on any cloud, |
| 10 | +or private cluster relying on spark. |
| 11 | + |
| 12 | +We demonstrate how Tensorflow on Spark (https://github.com/yahoo/TensorFlowOnSpark) can be used to reach the state of the art when it comes to predicting the proba of click at scale. |
| 13 | +Notice that the goal here is not to produce the best pCTR predictor, but rather establish a open method that still reaches the best performance published so far on this dataset. |
| 14 | +Hence, our solutions remains very simple, and rely solely on basic feature extraction, cross-features and hashing, the all trained on logistic regression. |
| 15 | + |
| 16 | +## Install and test TF on spark |
| 17 | +Before making use of this code, please make sure you can install TF on spark on your cluster and |
| 18 | +run the mnist example as illustrated here: |
| 19 | +https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN |
| 20 | +By so doing, you should make sure that did set up the following variables correctly: |
| 21 | + |
| 22 | +``` |
| 23 | +export JAVA_HOME= |
| 24 | +export HADOOP_HOME= |
| 25 | +export SPARK_HOME= |
| 26 | +export HADOOP_HDFS_HOME= |
| 27 | +export SPARK_HOME= |
| 28 | +export PYTHON_ROOT=./Python |
| 29 | +export PATH=${PATH}:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${HADOOP_HDFS_HOME}/bin:${SPARK_HOME}/bin:${PYTHON_ROOT}/bin |
| 30 | +export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python |
| 31 | +export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/bin/python" |
| 32 | +export QUEUE=default |
| 33 | +export LIB_HDFS= |
| 34 | +export LIB_JVM= |
| 35 | +``` |
| 36 | + |
| 37 | +## Data set |
| 38 | + |
| 39 | +The raw data can be accessed here: http://labs.criteo.com/2013/12/download-terabyte-click-logs/ |
| 40 | + |
| 41 | +### Download the data set |
| 42 | +``` |
| 43 | +for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23; do |
| 44 | + curl -O http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_${i}.gz |
| 45 | + aws s3 mv day_${i}.gz s3://criteo-display-ctr-dataset/released/ |
| 46 | +done |
| 47 | +``` |
| 48 | + |
| 49 | +### Upload training data on your AWS s3 using Pig |
| 50 | + |
| 51 | +``` |
| 52 | +%declare awskey yourkey |
| 53 | +%declare awssecretkey yoursecretkey |
| 54 | +SET mapred.output.compress 'true'; |
| 55 | +SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec'; |
| 56 | +train_data = load 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/released/day_{[0-9],1[0-9],2[0-2]}.gz '; |
| 57 | +train_data = FOREACH (GROUP train_data BY ROUND(10000* RANDOM()) PARALLEL 10000) GENERATE FLATTEN(train_data); |
| 58 | +store train_data into 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/data/training/' using PigStorage(); |
| 59 | +``` |
| 60 | +We here divide the training data in 10000 chunks, which will allow TFonSpark to reduce its memory usage. |
| 61 | + |
| 62 | +### Upload validation data on your AWS s3 using Pig |
| 63 | +``` |
| 64 | +%declare awskey yourkey |
| 65 | +%declare awssecretkey yoursecretkey |
| 66 | +SET mapred.output.compress 'true'; |
| 67 | +SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec'; |
| 68 | +train_data = load 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/released/day_23.gz'; |
| 69 | +train_data = FOREACH (GROUP train_data BY ROUND(100* RANDOM()) PARALLEL 100) GENERATE FLATTEN(train_data); |
| 70 | +store train_data into 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/data/validation' using PigStorage(); |
| 71 | +``` |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | +## Running the example |
| 79 | + |
| 80 | +Set up task variables |
| 81 | +``` |
| 82 | +export TRAINING_DATA=hdfs_path_to_training_data_directory |
| 83 | +export VALIDATION_DATA=hdfs_path_to_validation_data_directory |
| 84 | +export MODEL_OUTPUT=hdfs://default/tmp/criteo_ctr_prediction |
| 85 | +``` |
| 86 | +Run command: |
| 87 | + |
| 88 | +``` |
| 89 | +${SPARK_HOME}/bin/spark-submit \ |
| 90 | +--master yarn \ |
| 91 | +--deploy-mode cluster \ |
| 92 | +--queue ${QUEUE} \ |
| 93 | +--num-executors 12 \ |
| 94 | +--executor-memory 27G \ |
| 95 | +--py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/criteo/spark/criteo_dist.py \ |
| 96 | +--conf spark.dynamicAllocation.enabled=false \ |
| 97 | +--conf spark.yarn.maxAppAttempts=1 \ |
| 98 | +--archives hdfs:///user/${USER}/Python.zip#Python \ |
| 99 | +--conf spark.executorEnv.LD_LIBRARY_PATH="$LIB_HDFS:$LIB_JVM" \ |
| 100 | +--conf spark.executorEnv.HADOOP_HDFS_HOME="$HADOOP_HDFS_HOME" \ |
| 101 | +--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \ |
| 102 | +TensorFlowOnSpark/examples/criteo/spark/criteo_spark.py \ |
| 103 | +--mode train \ |
| 104 | +--data ${TRAINING_DATA} \ |
| 105 | +--validation ${VALIDATION_DATA} \ |
| 106 | +--steps 1000000 \ |
| 107 | +--model ${MODEL_OUTPUT} --tensorboard \ |
| 108 | +--tensorboardlogdir ${MODEL_OUTPUT} |
| 109 | +``` |
| 110 | +## Tensorboard tracking: |
| 111 | + |
| 112 | +By connecting to the Web UI tracker of your application, |
| 113 | +you be able to retrieve the tensorboard URL in the stdout of the driver: |
| 114 | + |
| 115 | +``` |
| 116 | + TensorBoard running at: http://10.4.112.234:36911 |
| 117 | +``` |
| 118 | + |
| 119 | +You can then track the training loss, and validation loss: |
| 120 | + |
| 121 | + |
| 122 | + |
| 123 | + |
| 124 | + |
0 commit comments