Add an example for DeepFM estimator in deepctr (#2526)

workingloong · web-flow · commit 5217774bdda5 · 2021-09-06T16:57:54.000+08:00
* Add an example for DeepFM estimator in deepctr

* Add a hook to report batch done

* Add a tutorial for DeepCTR estimator models

* Fix by comments

* Fix by comments
diff --git a/docs/tutorials/elasticdl_deepctr_estimator.md b/docs/tutorials/elasticdl_deepctr_estimator.md
@@ -0,0 +1,173 @@
+# Distributed Training of DeepCTR Estimator using ElasticDL on Kubernetes
+
+This document shows how to run a distributed training job of a deepctr
+estimator model (DeepFM) using [ElasticDL](https://github.com/sql-machine-learning/elasticdl)
+on Kubernetes.
+
+## Prerequisites
+
+1. Install Minikube, preferably >= v1.11.0, following the installation
+   [guide](https://kubernetes.io/docs/tasks/tools/install-minikube).  Minikube
+   runs a single-node Kubernetes cluster in a virtual machine on your personal
+   computer.
+
+1. Install Docker CE, preferably >= 18.x, following the
+   [guide](https://docs.docker.com/docker-for-mac/install/) for building Docker
+   images containing user-defined models and the ElasticDL framework.
+
+1. Install Python, preferably >= 3.6, because the ElasticDL command-line tool is
+   in Python.
+
+## Models
+
+In this tutorial, we use a [DeepFM estimator](https://github.com/shenweichen/DeepCTR/blob/master/deepctr/estimator/models/deepfm.py)
+model in DeepCTR. The complete program to train the model with the
+dataset definition is in [ElasticDL model zoo](https://github.com/sql-machine-learning/elasticdl/tree/develop/model_zoo/deepctr).
+
+## Dataset
+
+In this tutorial, We use the [criteo dataset](https://github.com/shenweichen/DeepCTR/blob/master/examples/criteo_sample.txt)
+in DeepCTR examples.
+
+```bash
+mkdir ./data
+wget https://github.com/shenweichen/DeepCTR/blob/master/examples/criteo_sample.txt -O ./data/criteo_sample.txt
+```
+
+## The Kubernetes Cluster
+
+The following command starts a Kubernetes cluster locally using Minikube.  It
+uses [VirtualBox](https://www.virtualbox.org/), a hypervisor coming with
+macOS, to create the virtual machine cluster.
+
+```bash
+minikube start --vm-driver=virtualbox \
+  --cpus 2 --memory 6144 --disk-size=50gb 
+eval $(minikube docker-env)
+```
+
+The command `minikube docker-env` returns a set of Bash environment variable
+to configure your local environment to re-use the Docker daemon inside
+the Minikube instance.
+
+The following command is necessary to enable
+[RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) of
+Kubernetes.
+
+```bash
+kubectl apply -f \
+  https://raw.githubusercontent.com/sql-machine-learning/elasticdl/develop/elasticdl/manifests/elasticdl-rbac.yaml
+```
+
+If you happen to live in a region where `raw.githubusercontent.com` is banned,
+you might want to Git clone the above repository to get the YAML file.
+
+## Install ElasticDL Client Tool
+
+The following command installs the command line tool `elasticdl`, which talks to
+the Kubernetes cluster and operates ElasticDL jobs.
+
+```bash
+pip install elasticdl_client
+```
+
+## Build the Docker Image with Model Definition
+
+Kubernetes runs Docker containers, so we need to put user-defined models,
+the ElasticDL api package and all dependencies into a Docker image.
+
+In this tutorial, we use a complete program using a DeepFM estimator model of DeepCTR
+in the ElasticDL repository. To retrieve the source code, please run the following command.
+
+```bash
+git clone https://github.com/sql-machine-learning/elasticdl
+```
+
+Complete codes are in directory [elasticdl/model_zoo/deepctr](https://github.com/sql-machine-learning/elasticdl/tree/develop/model_zoo/deepctr).
+
+We build the image based on tensorflow:1.13.2 and the dockerfile
+is
+
+```text
+FROM tensorflow/tensorflow:1.13.2-py3 as base
+
+RUN pip install elasticdl_api
+RUN pip install deepctr
+
+COPY ./model_zoo model_zoo
+```
+
+Then, we use docker to build the image
+
+```bash
+docker build -t elasticdl:deepctr_estimator -f ${deepctr_dockerfile} .
+```
+
+## Submit the Training Job
+
+The following command submits a training job:
+
+```bash
+elasticdl train \
+  --image_name=elasticdl/elasticdl:1.0.0 \
+  --worker_image=elasticdl:deepctr_estimator \
+  --ps_image=elasticdl:deepctr_estimator \
+  --job_command="python -m model_zoo.deepctr.deepfm_estimator --training_data=/data/criteo_sample.txt --validation_data=/data/criteo_sample.txt" \
+  --num_workers=1 \
+  --num_ps=1 \
+  --num_evaluator=1 \
+  --master_resource_request="cpu=0.2,memory=1024Mi" \
+  --master_resource_limit="cpu=1,memory=2048Mi" \
+  --ps_resource_request="cpu=0.2,memory=1024Mi" \
+  --ps_resource_limit="cpu=1,memory=2048Mi" \
+  --worker_resource_request="cpu=0.3,memory=1024Mi" \
+  --worker_resource_limit="cpu=1,memory=2048Mi" \
+  --chief_resource_request="cpu=0.3,memory=1024Mi" \
+  --chief_resource_limit="cpu=1,memory=2048Mi" \
+  --evaluator_resource_request="cpu=0.3,memory=1024Mi" \
+  --evaluator_resource_limit="cpu=1,memory=2048Mi" \
+  --job_name=test-deepfm-estimator \
+  --distribution_strategy=ParameterServerStrategy \
+  --need_tf_config=true \
+  --volume="host_path={criteo_data_path},mount_path=/data" \
+```
+
+`--image_name` is the image to launch the ElasticDL master which
+has nothing to do with the estimator model. The ElasticDL master is
+responsible for launching pod and assigning data shards to workers with
+elasticity and fault-tolerance.
+
+`{criteo_data_path}` is the absolute path of the `./data` with `criteo_sample.txt`.
+Here, the option `--volume="host_path={criteo_data_path},mount_path=/data"`
+bind mount it into the containers/pods.
+
+The option `--num_workers=1` tells the master to start a worker pod.
+The option `--num_ps=1` tells the master to start a ps pod.
+The option `--num_evaluator=1` tells the master to start an evaluator pod.
+
+And the master will start a chief worker for a TensorFlow estimator model by default.
+
+### Check Job Status
+
+After the job submission, we can run the command `kubectl get pods` to list
+related containers.
+
+```bash
+NAME                                     READY   STATUS    RESTARTS   AGE
+elasticdl-test-deepctr-estimator-master     1/1     Running   0          9s
+test-deepctr-estimator-edljob-chief-0       1/1     Running   0          6s
+test-deepctr-estimator-edljob-evaluator-0   0/1     Pending   0          6s
+test-deepctr-estimator-edljob-ps-0          1/1     Running   0          7s
+test-deepctr-estimator-edljob-worker-0      1/1     Running   0          6s
+```
+
+We can view the log of workers by `kubectl logs test-deepctr-estimator-edljob-chief-0`.
+
+```text
+INFO:tensorflow:global_step/sec: 4.84156
+INFO:tensorflow:global_step/sec: 4.84156
+INFO:tensorflow:Saving checkpoints for 203 into /data/ckpts/model.ckpt.
+INFO:tensorflow:Saving checkpoints for 203 into /data/ckpts/model.ckpt.
+INFO:tensorflow:global_step/sec: 7.05433
+INFO:tensorflow:global_step/sec: 7.05433
+```
diff --git a/docs/tutorials/elasticdl_deepctr_keras.md b/docs/tutorials/elasticdl_deepctr_keras.md
diff --git a/elasticai_api/tensorflow/hooks.py b/elasticai_api/tensorflow/hooks.py
@@ -0,0 +1,27 @@
+# Copyright 2021 The ElasticDL Authors. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tensorflow as tf
+
+from elasticai_api.util.log_utils import default_logger as logger
+
+
+class ElasticDataShardReportHook(tf.train.SessionRunHook):
+    def __init__(self, data_shard_service) -> None:
+        self._data_shard_service = data_shard_service
+
+    def after_run(self, run_context, run_values):
+        try:
+            self._data_shard_service.report_batch_done()
+        except Exception as ex:
+            logger.error("elastic_ai: report batch done failed: %s", ex)
diff --git a/model_zoo/deepctr/deepfm_estimator.py b/model_zoo/deepctr/deepfm_estimator.py
@@ -0,0 +1,172 @@
+# Copyright 2021 The ElasticDL Authors. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import csv
+import os
+
+import tensorflow as tf
+from deepctr.estimator.models import DeepFMEstimator
+
+from elasticai_api.common.data_shard_service import DataShardService
+from elasticai_api.tensorflow.hooks import ElasticDataShardReportHook
+
+tf.logging.set_verbosity(tf.logging.INFO)
+
+
+def read_csv(file_path):
+    rows = []
+    with open(file_path) as csvfile:
+        spamreader = csv.reader(csvfile)
+        for i, row in enumerate(spamreader):
+            if i > 0:
+                row_values = []
+                row_values.append(int(row[0]))
+                for i in range(1, 14):
+                    value = row[i] if row[i] else 0
+                    row_values.append(float(value))
+                row_values.extend(row[14:])
+                rows.append(row_values)
+    return rows
+
+
+def train_generator(data_path, shard_service):
+    rows = read_csv(data_path)
+    while True:
+        # Read samples by the range of the shard from
+        # the data shard serice.
+        shard = shard_service.fetch_shard()
+        if not shard:
+            break
+        for i in range(shard.start, shard.end):
+            yield tuple(rows[i])
+
+
+def eval_generator(data_path):
+    rows = read_csv(data_path)
+    for row in rows:
+        yield tuple(row)
+
+
+def input_fn(sample_generator, batch_size, dense_features, sparse_features):
+    output_types = tuple(
+        [tf.int32]
+        + [tf.float32 for i in dense_features]
+        + [tf.string for i in sparse_features]
+    )
+    dataset = tf.data.Dataset.from_generator(
+        sample_generator, output_types=output_types,
+    )
+    dataset = dataset.shuffle(100).batch(batch_size)
+    values = dataset.make_one_shot_iterator().get_next()
+
+    label_value = values[0]
+    feature_values = {}
+    feature_index = 1
+    for feature in dense_features:
+        feature_values[feature] = values[feature_index]
+        feature_index += 1
+
+    for feature in sparse_features:
+        feature_values[feature] = values[feature_index]
+        feature_index += 1
+    return feature_values, label_value
+
+
+def arg_parser():
+    parser = argparse.ArgumentParser(description="Process training parameters")
+    parser.add_argument("--training_data", type=str, required=True)
+    parser.add_argument(
+        "--validation_data", type=str, default="", required=False
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = arg_parser()
+    args = parser.parse_args()
+
+    training_data = args.training_data
+    validation_data = args.validation_data
+
+    model_dir = "/data/ckpts/"
+    os.makedirs(model_dir, exist_ok=True)
+
+    sparse_features = ["C" + str(i) for i in range(1, 27)]
+    dense_features = ["I" + str(i) for i in range(1, 14)]
+
+    dnn_feature_columns = []
+    linear_feature_columns = []
+
+    for i, feat in enumerate(sparse_features):
+        dnn_feature_columns.append(
+            tf.feature_column.embedding_column(
+                tf.feature_column.categorical_column_with_hash_bucket(
+                    feat, 1000
+                ),
+                4,
+            )
+        )
+        linear_feature_columns.append(
+            tf.feature_column.categorical_column_with_hash_bucket(feat, 1000)
+        )
+    for feat in dense_features:
+        dnn_feature_columns.append(tf.feature_column.numeric_column(feat))
+        linear_feature_columns.append(tf.feature_column.numeric_column(feat))
+
+    batch_size = 64
+
+    config = tf.estimator.RunConfig(
+        model_dir=model_dir, save_checkpoints_steps=100, keep_checkpoint_max=3
+    )
+    model = DeepFMEstimator(
+        linear_feature_columns,
+        dnn_feature_columns,
+        task="binary",
+        config=config,
+    )
+
+    # Create a data shard service which can split the dataset
+    # into shards.
+    rows = read_csv(training_data)
+    training_data_shard_svc = DataShardService(
+        batch_size=batch_size,
+        num_epochs=100,
+        dataset_size=len(rows),
+        num_minibatches_per_shard=1,
+        dataset_name="iris_training_data",
+    )
+
+    def train_input_fn():
+        return input_fn(
+            lambda: train_generator(training_data, training_data_shard_svc),
+            batch_size,
+            dense_features,
+            sparse_features,
+        )
+
+    def eval_input_fn():
+        return input_fn(
+            lambda: eval_generator(validation_data),
+            batch_size,
+            dense_features,
+            sparse_features,
+        )
+
+    hooks = [
+        ElasticDataShardReportHook(training_data_shard_svc),
+    ]
+    train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, hooks=hooks)
+    eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
+
+    tf.estimator.train_and_evaluate(model, train_spec, eval_spec)