Skip to content

Commit 37f38b4

Browse files
author
Amin Mantrach
committed
add criteo example
1 parent 92a5eae commit 37f38b4

File tree

5 files changed

+489
-0
lines changed

5 files changed

+489
-0
lines changed

examples/criteo/README.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Learning Click-Through Rate at Scale with Tensorflow on Spark
2+
3+
## Introduction
4+
This project consists of learning a click-throughrate model at scale using TensorflowOnSpark technology.
5+
Criteo released a 1TB dataset: http://labs.criteo.com/2013/12/download-terabyte-click-logs/
6+
In order to promote Google cloud technology, Google published a solution to train a model at scale using there
7+
proprietary platform : https://cloud.google.com/blog/big-data/2017/02/using-google-cloud-machine-learning-to-predict-clicks-at-scale
8+
9+
Instead, we propose a solution based on open source technology that can be leveraged on any cloud,
10+
or private cluster relying on spark.
11+
12+
We demonstrate how Tensorflow on Spark (https://github.com/yahoo/TensorFlowOnSpark) can be used to reach the state of the art when it comes to predicting the proba of click at scale.
13+
Notice that the goal here is not to produce the best pCTR predictor, but rather establish a open method that still reaches the best performance published so far on this dataset.
14+
Hence, our solutions remains very simple, and rely solely on basic feature extraction, cross-features and hashing, the all trained on logistic regression.
15+
16+
## Install and test TF on spark
17+
Before making use of this code, please make sure you can install TF on spark on your cluster and
18+
run the mnist example as illustrated here:
19+
https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN
20+
By so doing, you should make sure that did set up the following variables correctly:
21+
22+
```
23+
export JAVA_HOME=
24+
export HADOOP_HOME=
25+
export SPARK_HOME=
26+
export HADOOP_HDFS_HOME=
27+
export SPARK_HOME=
28+
export PYTHON_ROOT=./Python
29+
export PATH=${PATH}:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${HADOOP_HDFS_HOME}/bin:${SPARK_HOME}/bin:${PYTHON_ROOT}/bin
30+
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
31+
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/bin/python"
32+
export QUEUE=default
33+
export LIB_HDFS=
34+
export LIB_JVM=
35+
```
36+
37+
## Data set
38+
39+
The raw data can be accessed here: http://labs.criteo.com/2013/12/download-terabyte-click-logs/
40+
41+
### Download the data set
42+
```
43+
for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23; do
44+
curl -O http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_${i}.gz
45+
aws s3 mv day_${i}.gz s3://criteo-display-ctr-dataset/released/
46+
done
47+
```
48+
49+
### Upload training data on your AWS s3 using Pig
50+
51+
```
52+
%declare awskey yourkey
53+
%declare awssecretkey yoursecretkey
54+
SET mapred.output.compress 'true';
55+
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';
56+
train_data = load 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/released/day_{[0-9],1[0-9],2[0-2]}.gz ';
57+
train_data = FOREACH (GROUP train_data BY ROUND(10000* RANDOM()) PARALLEL 10000) GENERATE FLATTEN(train_data);
58+
store train_data into 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/data/training/' using PigStorage();
59+
```
60+
We here divide the training data in 10000 chunks, which will allow TFonSpark to reduce its memory usage.
61+
62+
### Upload validation data on your AWS s3 using Pig
63+
```
64+
%declare awskey yourkey
65+
%declare awssecretkey yoursecretkey
66+
SET mapred.output.compress 'true';
67+
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';
68+
train_data = load 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/released/day_23.gz';
69+
train_data = FOREACH (GROUP train_data BY ROUND(100* RANDOM()) PARALLEL 100) GENERATE FLATTEN(train_data);
70+
store train_data into 's3n://${awskey}:${awssecretkey}@criteo-display-ctr-dataset/data/validation' using PigStorage();
71+
```
72+
73+
74+
75+
76+
77+
78+
## Running the example
79+
80+
Set up task variables
81+
```
82+
export TRAINING_DATA=hdfs_path_to_training_data_directory
83+
export VALIDATION_DATA=hdfs_path_to_validation_data_directory
84+
export MODEL_OUTPUT=hdfs://default/tmp/criteo_ctr_prediction
85+
```
86+
Run command:
87+
88+
```
89+
${SPARK_HOME}/bin/spark-submit \
90+
--master yarn \
91+
--deploy-mode cluster \
92+
--queue ${QUEUE} \
93+
--num-executors 12 \
94+
--executor-memory 27G \
95+
--py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/criteo/spark/criteo_dist.py \
96+
--conf spark.dynamicAllocation.enabled=false \
97+
--conf spark.yarn.maxAppAttempts=1 \
98+
--archives hdfs:///user/${USER}/Python.zip#Python \
99+
--conf spark.executorEnv.LD_LIBRARY_PATH="$LIB_HDFS:$LIB_JVM" \
100+
--conf spark.executorEnv.HADOOP_HDFS_HOME="$HADOOP_HDFS_HOME" \
101+
--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \
102+
TensorFlowOnSpark/examples/criteo/spark/criteo_spark.py \
103+
--mode train \
104+
--data ${TRAINING_DATA} \
105+
--validation ${VALIDATION_DATA} \
106+
--steps 1000000 \
107+
--model ${MODEL_OUTPUT} --tensorboard \
108+
--tensorboardlogdir ${MODEL_OUTPUT}
109+
```
110+
## Tensorboard tracking:
111+
112+
By connecting to the Web UI tracker of your application,
113+
you be able to retrieve the tensorboard URL in the stdout of the driver:
114+
115+
```
116+
TensorBoard running at: http://10.4.112.234:36911
117+
```
118+
119+
You can then track the training loss, and validation loss:
120+
121+
122+
![Alt Text](resources/data/TensorBoard-TFonSpark-Criteo-04.png)
123+
124+

examples/criteo/spark/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)