|
| 1 | +# TF Model Garden Ranking Models |
| 2 | + |
| 3 | +## Overview |
| 4 | +This is an implementation of [DLRM](https://arxiv.org/abs/1906.00091) and |
| 5 | +[DCN v2](https://arxiv.org/abs/2008.13535) ranking models that can be used for |
| 6 | +tasks such as CTR prediction. |
| 7 | + |
| 8 | +The model inputs are numerical and categorical features, and output is a scalar |
| 9 | +(for example click probability). |
| 10 | +The model can be trained and evaluated on GPU, TPU and CPU. The deep ranking |
| 11 | +models are both memory intensive (for embedding tables/lookup) and compute |
| 12 | +intensive for deep networks (MLPs). CPUs are best suited for large sparse |
| 13 | +embedding lookup, GPUs for fast compute. TPUs are designed for both. |
| 14 | + |
| 15 | +When training on TPUs we use |
| 16 | +[TPUEmbedding layer](https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/layers/embedding/tpu_embedding_layer.py) |
| 17 | +for categorical features. TPU embedding supports large embedding tables with |
| 18 | +fast lookup, the size of embedding tables scales linearly with the size of TPU |
| 19 | +pod. We can have up to 90 GB embedding tables for TPU v3-8 and 5.6 TB for |
| 20 | +v3-512 and 22,4 TB for TPU Pod v3-2048. |
| 21 | + |
| 22 | +The Model code is in |
| 23 | +[TensorFlow Recommenders](https://github.com/tensorflow/recommenders/tree/main/tensorflow_recommenders/experimental/models) |
| 24 | +library, while input pipeline, configuration and training loop is here. |
| 25 | + |
| 26 | +## Prerequisites |
| 27 | +To get started, download the code from TensorFlow models GitHub repository or |
| 28 | +use the pre-installed Google Cloud VM. |
| 29 | + |
| 30 | +```bash |
| 31 | +git clone https://github.com/tensorflow/models.git |
| 32 | +export PYTHONPATH=$PYTHONPATH:$(pwd)/models |
| 33 | +``` |
| 34 | + |
| 35 | +We also need to install |
| 36 | +[TensorFlow Recommenders](https://www.tensorflow.org/recommenders) library. |
| 37 | +If you are using [tf-nightly](https://pypi.org/project/tf-nightly/) make |
| 38 | +sure to install |
| 39 | +[tensorflow-recommenders](https://pypi.org/project/tensorflow-recommenders/) |
| 40 | +without its dependancies by passing `--no-deps` argument. |
| 41 | + |
| 42 | +For tf-nightly: |
| 43 | +```bash |
| 44 | +pip install tensorflow-recommenders --no-deps |
| 45 | +``` |
| 46 | + |
| 47 | +For stable TensorFlow 2.4+ [releases](https://pypi.org/project/tensorflow/): |
| 48 | +```bash |
| 49 | +pip install tensorflow-recommenders |
| 50 | +``` |
| 51 | + |
| 52 | + |
| 53 | +## Dataset |
| 54 | + |
| 55 | +The models can be trained on various datasets, Two commonly used ones are |
| 56 | +[Criteo Terabyte](https://labs.criteo.com/2013/12/download-terabyte-click-logs/) |
| 57 | +and [Criteo Kaggle](https://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/) |
| 58 | +datasets. |
| 59 | +We can train on synthetic data, by setting the flag `use_synthetic_data=True`. |
| 60 | + |
| 61 | +### Download |
| 62 | + |
| 63 | +The dataset is the Terabyte click logs dataset provided by Criteo. Follow the |
| 64 | +[instructions](https://labs.criteo.com/2013/12/download-terabyte-click-logs/) at |
| 65 | +the Criteo website to download the data. |
| 66 | + |
| 67 | +Note that the dataset is large (~1TB). |
| 68 | + |
| 69 | +### Preprocess the data |
| 70 | + |
| 71 | +Data preprocessing steps are summarized below. |
| 72 | + |
| 73 | +Integer feature processing steps, sequentially: |
| 74 | + |
| 75 | +1. Missing values are replaced with zeros. |
| 76 | +2. Negative values are replaced with zeros. |
| 77 | +3. Integer features are transformed by log(x+1) and are hence tf.float32. |
| 78 | + |
| 79 | +Categorical features: |
| 80 | + |
| 81 | +1. Categorical data is bucketized to tf.int32. |
| 82 | +2. Optionally, the resulting integers are hashed to a lower dimensionality. |
| 83 | + This is necessary to reduce the sizes of the large tables. Simple hashing |
| 84 | + function such as modulus will suffice, i.e. feature_value % MAX_INDEX. |
| 85 | + |
| 86 | +The vocabulary sizes resulting from pre-processing are passed in to the model |
| 87 | +trainer using the model.vocab_sizes config. |
| 88 | + |
| 89 | +The full dataset is composed of 24 directories. Partition the data into training |
| 90 | +and eval sets, for example days 1-23 for training and day 24 for evaluation. |
| 91 | + |
| 92 | +Training and eval datasets are expected to be saved in many tab-separated values |
| 93 | +(TSV) files in the following format: numberical fetures, categorical features |
| 94 | +and label. |
| 95 | + |
| 96 | +On each row of the TSV file first `num_dense_features` inputs are numerical |
| 97 | +features, then `vocab_sizes` categorical features and the last one is the label |
| 98 | +(either 0 or 1). Each i-th categorical feature is expected to be an integer in |
| 99 | +the range of `[0, vocab_sizes[i])`. |
| 100 | + |
| 101 | +## Train and Evaluate |
| 102 | + |
| 103 | +To train DLRM model we use dot product feature interaction, i.e. |
| 104 | +`interaction: 'dot'` to train DCN v2 model we use `interaction: 'cross'`. |
| 105 | + |
| 106 | + |
| 107 | +### Training on TPU |
| 108 | + |
| 109 | +```shell |
| 110 | +export TPU_NAME=my-dlrm-tpu |
| 111 | +export EXPERIMENT_NAME=my_experiment_name |
| 112 | +export BUCKET_NAME="gs://my_dlrm_bucket" |
| 113 | +export DATA_DIR="${BUCKET_NAME}/data" |
| 114 | + |
| 115 | +python3 models/official/recommendation/ranking/train.py --mode=train_and_eval \ |
| 116 | +--model_dir=${BUCKET_NAME}/model_dirs/${EXPERIMENT_NAME} --params_override=" |
| 117 | +runtime: |
| 118 | + distribution_strategy: 'tpu' |
| 119 | +task: |
| 120 | + use_synthetic_data: false |
| 121 | + train_data: |
| 122 | + input_path: '${DATA_DIR}/train/*' |
| 123 | + global_batch_size: 16384 |
| 124 | + validation_data: |
| 125 | + input_path: '${DATA_DIR}/eval/*' |
| 126 | + global_batch_size: 16384 |
| 127 | + model: |
| 128 | + num_dense_features: 13 |
| 129 | + bottom_mlp: [512,256,128] |
| 130 | + embedding_dim: 128 |
| 131 | + top_mlp: [1024,1024,512,256,1] |
| 132 | + interaction: 'dot' |
| 133 | + vocab_sizes: [39884406, 39043, 17289, 7420, 20263, 3, 7120, 1543, 63, |
| 134 | + 38532951, 2953546, 403346, 10, 2208, 11938, 155, 4, 976, 14, |
| 135 | + 39979771, 25641295, 39664984, 585935, 12972, 108, 36] |
| 136 | +trainer: |
| 137 | + use_orbit: true |
| 138 | + validation_interval: 90000 |
| 139 | + checkpoint_interval: 100000 |
| 140 | + validation_steps: 5440 |
| 141 | + train_steps: 256054 |
| 142 | + steps_per_loop: 1000 |
| 143 | +" |
| 144 | +``` |
| 145 | + |
| 146 | +The data directory should have two subdirectories: |
| 147 | + |
| 148 | +* $DATA_DIR/train |
| 149 | +* $DATA_DIR/eval |
| 150 | + |
| 151 | +### Training on GPU |
| 152 | + |
| 153 | +Training on GPUs are similar to TPU training. Only distribution strategy needs |
| 154 | +to be updated and number of GPUs provided (for 4 GPUs): |
| 155 | + |
| 156 | +```shell |
| 157 | +python3 official/recommendation/ranking/main.py --mode=train_and_eval \ |
| 158 | +--model_dir=${BUCKET_NAME}/model_dirs/${EXPERIMENT_NAME} --params_override=" |
| 159 | +runtime: |
| 160 | + distribution_strategy: 'mirrored' |
| 161 | + num_gpus: 4 |
| 162 | +... |
| 163 | +" |
| 164 | +``` |
0 commit comments