|
1 | 1 | # Image Classification
|
2 | 2 |
|
3 |
| -This folder contains the TF 2.0 model examples for image classification: |
| 3 | +This folder contains TF 2.0 model examples for image classification: |
4 | 4 |
|
5 |
| -* [ResNet](#resnet) |
6 | 5 | * [MNIST](#mnist)
|
| 6 | +* [Classifier Trainer](#classifier-trainer), a framework that uses the Keras |
| 7 | +compile/fit methods for image classification models, including: |
| 8 | + * ResNet |
| 9 | + * EfficientNet[^1] |
7 | 10 |
|
| 11 | +[^1]: Currently a work in progress. We cannot match "AutoAugment (AA)" in [the original version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). |
8 | 12 | For more information about other types of models, please refer to this
|
9 | 13 | [README file](../../README.md).
|
10 | 14 |
|
11 |
| -## ResNet |
12 |
| - |
13 |
| -Similar to the [estimator implementation](../../r1/resnet), the Keras |
14 |
| -implementation has code for the ImageNet dataset. The ImageNet |
15 |
| -version uses a ResNet50 model implemented in |
16 |
| -[`resnet_model.py`](./resnet/resnet_model.py). |
17 |
| - |
| 15 | +## Before you begin |
18 | 16 | Please make sure that you have the latest version of TensorFlow
|
19 | 17 | installed and
|
20 | 18 | [add the models folder to your Python path](/official/#running-the-models).
|
21 | 19 |
|
22 |
| -### Pretrained Models |
23 |
| - |
24 |
| -* [ResNet50 Checkpoints](https://storage.googleapis.com/cloud-tpu-checkpoints/resnet/resnet50.tar.gz) |
25 |
| - |
26 |
| -* ResNet50 TFHub: [feature vector](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1) |
27 |
| -and [classification](https://tfhub.dev/tensorflow/resnet_50/classification/1) |
28 |
| - |
29 |
| -### ImageNet Training |
| 20 | +### ImageNet preparation |
30 | 21 |
|
31 | 22 | Download the ImageNet dataset and convert it to TFRecord format.
|
32 | 23 | The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)
|
33 | 24 | and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
|
34 | 25 | provide a few options.
|
35 | 26 |
|
36 |
| -Once your dataset is ready, you can begin training the model as follows: |
37 |
| - |
38 |
| -```bash |
39 |
| -python resnet/resnet_imagenet_main.py |
40 |
| -``` |
41 |
| - |
42 |
| -Again, if you did not download the data to the default directory, specify the |
43 |
| -location with the `--data_dir` flag: |
44 |
| - |
45 |
| -```bash |
46 |
| -python resnet/resnet_imagenet_main.py --data_dir=/path/to/imagenet |
47 |
| -``` |
48 |
| - |
49 |
| -There are more flag options you can specify. Here are some examples: |
50 |
| - |
51 |
| -- `--use_synthetic_data`: when set to true, synthetic data, rather than real |
52 |
| -data, are used; |
53 |
| -- `--batch_size`: the batch size used for the model; |
54 |
| -- `--model_dir`: the directory to save the model checkpoint; |
55 |
| -- `--train_epochs`: number of epoches to run for training the model; |
56 |
| -- `--train_steps`: number of steps to run for training the model. We now only |
57 |
| -support a number that is smaller than the number of batches in an epoch. |
58 |
| -- `--skip_eval`: when set to true, evaluation as well as validation during |
59 |
| -training is skipped |
60 |
| - |
61 |
| -For example, this is a typical command line to run with ImageNet data with |
62 |
| -batch size 128 per GPU: |
63 |
| - |
64 |
| -```bash |
65 |
| -python -m resnet/resnet_imagenet_main.py \ |
66 |
| - --model_dir=/tmp/model_dir/something \ |
67 |
| - --num_gpus=2 \ |
68 |
| - --batch_size=128 \ |
69 |
| - --train_epochs=90 \ |
70 |
| - --train_steps=10 \ |
71 |
| - --use_synthetic_data=false |
72 |
| -``` |
73 |
| - |
74 |
| -See [`common.py`](common.py) for full list of options. |
75 |
| - |
76 |
| -### Using multiple GPUs |
77 |
| - |
78 |
| -You can train these models on multiple GPUs using `tf.distribute.Strategy` API. |
79 |
| -You can read more about them in this |
80 |
| -[guide](https://www.tensorflow.org/guide/distribute_strategy). |
81 |
| - |
82 |
| -In this example, we have made it easier to use is with just a command line flag |
83 |
| -`--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA, |
84 |
| -and 0 otherwise. |
85 |
| - |
86 |
| -- --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device. |
87 |
| -- --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device. |
88 |
| -- --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous |
89 |
| -distributed training across the GPUs. |
90 |
| - |
91 |
| -If you wish to run without `tf.distribute.Strategy`, you can do so by setting |
92 |
| -`--distribution_strategy=off`. |
93 |
| - |
94 |
| -### Running on multiple GPU hosts |
95 |
| - |
96 |
| -You can also train these models on multiple hosts, each with GPUs, using |
97 |
| -`tf.distribute.Strategy`. |
98 |
| - |
99 |
| -The easiest way to run multi-host benchmarks is to set the |
100 |
| -[`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG) |
101 |
| -appropriately at each host. e.g., to run using `MultiWorkerMirroredStrategy` on |
102 |
| -2 hosts, the `cluster` in `TF_CONFIG` should have 2 `host:port` entries, and |
103 |
| -host `i` should have the `task` in `TF_CONFIG` set to `{"type": "worker", |
104 |
| -"index": i}`. `MultiWorkerMirroredStrategy` will automatically use all the |
105 |
| -available GPUs at each host. |
106 |
| - |
107 | 27 | ### Running on Cloud TPUs
|
108 | 28 |
|
109 |
| -Note: This model will **not** work with TPUs on Colab. |
| 29 | +Note: These models will **not** work with TPUs on Colab. |
110 | 30 |
|
111 |
| -You can train the ResNet CTL model on Cloud TPUs using |
| 31 | +You can train image classification models on Cloud TPUs using |
112 | 32 | `tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
|
113 | 33 | strongly recommended that you go through the
|
114 | 34 | [quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
|
115 | 35 | create a TPU and GCE VM.
|
116 | 36 |
|
117 |
| -To run ResNet model on a TPU, you must set `--distribution_strategy=tpu` and |
118 |
| -`--tpu=$TPU_NAME`, where `$TPU_NAME` the name of your TPU in the Cloud Console. |
119 |
| -From a GCE VM, you can run the following command to train ResNet for one epoch |
120 |
| -on a v2-8 or v3-8 TPU: |
| 37 | +## MNIST |
| 38 | + |
| 39 | +To download the data and run the MNIST sample model locally for the first time, |
| 40 | +run one of the following command: |
121 | 41 |
|
122 | 42 | ```bash
|
123 |
| -python resnet/resnet_ctl_imagenet_main.py \ |
124 |
| - --tpu=$TPU_NAME \ |
| 43 | +python3 mnist_main.py \ |
125 | 44 | --model_dir=$MODEL_DIR \
|
126 | 45 | --data_dir=$DATA_DIR \
|
127 |
| - --batch_size=1024 \ |
128 |
| - --steps_per_loop=500 \ |
129 |
| - --train_epochs=1 \ |
130 |
| - --use_synthetic_data=false \ |
131 |
| - --dtype=fp32 \ |
132 |
| - --enable_eager=true \ |
133 |
| - --enable_tensorboard=true \ |
134 |
| - --distribution_strategy=tpu \ |
135 |
| - --log_steps=50 \ |
136 |
| - --single_l2_loss_op=true \ |
137 |
| - --use_tf_function=true |
| 46 | + --train_epochs=10 \ |
| 47 | + --distribution_strategy=one_device \ |
| 48 | + --num_gpus=$NUM_GPUS \ |
| 49 | + --download |
138 | 50 | ```
|
139 | 51 |
|
140 |
| -To train the ResNet to convergence, run it for 90 epochs: |
| 52 | +To train the model on a Cloud TPU, run the following command: |
141 | 53 |
|
142 | 54 | ```bash
|
143 |
| -python resnet/resnet_ctl_imagenet_main.py \ |
| 55 | +python3 mnist_main.py \ |
144 | 56 | --tpu=$TPU_NAME \
|
145 | 57 | --model_dir=$MODEL_DIR \
|
146 | 58 | --data_dir=$DATA_DIR \
|
147 |
| - --batch_size=1024 \ |
148 |
| - --steps_per_loop=500 \ |
149 |
| - --train_epochs=90 \ |
150 |
| - --use_synthetic_data=false \ |
151 |
| - --dtype=fp32 \ |
152 |
| - --enable_eager=true \ |
153 |
| - --enable_tensorboard=true \ |
| 59 | + --train_epochs=10 \ |
154 | 60 | --distribution_strategy=tpu \
|
155 |
| - --log_steps=50 \ |
156 |
| - --single_l2_loss_op=true \ |
157 |
| - --use_tf_function=true |
| 61 | + --download |
158 | 62 | ```
|
159 | 63 |
|
160 |
| -Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths. |
| 64 | +Note: the `--download` flag is only required the first time you run the model. |
161 | 65 |
|
162 | 66 |
|
163 |
| -## MNIST |
| 67 | +## Classifier Trainer |
| 68 | +The classifier trainer is a unified framework for running image classification |
| 69 | +models using Keras's compile/fit methods. Experiments should be provided in the |
| 70 | +form of YAML files, some examples are included within the configs/examples |
| 71 | +folder. Please see [configs/examples](./configs/examples) for more example |
| 72 | +configurations. |
164 | 73 |
|
165 |
| -To download the data and run the MNIST sample model locally for the first time, |
166 |
| -run one of the following command: |
| 74 | +The provided configuration files use a per replica batch size and is scaled |
| 75 | +by the number of devices. For instance, if `batch size` = 64, then for 1 GPU |
| 76 | +the global batch size would be 64 * 1 = 64. For 8 GPUs, the global batch size |
| 77 | +would be 64 * 8 = 512. Similarly, for a v3-8 TPU, the global batch size would |
| 78 | +be 64 * 8 = 512, and for a v3-32, the global batch size is 64 * 32 = 2048. |
167 | 79 |
|
| 80 | +### ResNet50 |
| 81 | + |
| 82 | +#### On GPU: |
168 | 83 | ```bash
|
169 |
| -python mnist_main.py \ |
| 84 | +python3 classifier_trainer.py \ |
| 85 | + --mode=train_and_eval \ |
| 86 | + --model_type=resnet \ |
| 87 | + --dataset=imagenet \ |
170 | 88 | --model_dir=$MODEL_DIR \
|
171 | 89 | --data_dir=$DATA_DIR \
|
172 |
| - --train_epochs=10 \ |
173 |
| - --distribution_strategy=one_device \ |
174 |
| - --num_gpus=$NUM_GPUS \ |
175 |
| - --download |
| 90 | + --config_file=configs/examples/resnet/imagenet/gpu.yaml \ |
| 91 | + --params_override='runtime.num_gpus=$NUM_GPUS' |
176 | 92 | ```
|
177 | 93 |
|
178 |
| -To train the model on a Cloud TPU, run the following command: |
| 94 | +#### On TPU: |
| 95 | +```bash |
| 96 | +python3 classifier_trainer.py \ |
| 97 | + --mode=train_and_eval \ |
| 98 | + --model_type=resnet \ |
| 99 | + --dataset=imagenet \ |
| 100 | + --tpu=$TPU_NAME \ |
| 101 | + --model_dir=$MODEL_DIR \ |
| 102 | + --data_dir=$DATA_DIR \ |
| 103 | + --config_file=config/examples/resnet/imagenet/tpu.yaml |
| 104 | +``` |
| 105 | + |
| 106 | +### EfficientNet |
| 107 | +**Note: EfficientNet development is a work in progress.** |
| 108 | +#### On GPU: |
| 109 | +```bash |
| 110 | +python3 classifier_trainer.py \ |
| 111 | + --mode=train_and_eval \ |
| 112 | + --model_type=efficientnet \ |
| 113 | + --dataset=imagenet \ |
| 114 | + --model_dir=$MODEL_DIR \ |
| 115 | + --data_dir=$DATA_DIR \ |
| 116 | + --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml \ |
| 117 | + --params_override='runtime.num_gpus=$NUM_GPUS' |
| 118 | +``` |
179 | 119 |
|
| 120 | + |
| 121 | +#### On TPU: |
180 | 122 | ```bash
|
181 |
| -python mnist_main.py \ |
| 123 | +python3 classifier_trainer.py \ |
| 124 | + --mode=train_and_eval \ |
| 125 | + --model_type=efficientnet \ |
| 126 | + --dataset=imagenet \ |
182 | 127 | --tpu=$TPU_NAME \
|
183 | 128 | --model_dir=$MODEL_DIR \
|
184 | 129 | --data_dir=$DATA_DIR \
|
185 |
| - --train_epochs=10 \ |
186 |
| - --distribution_strategy=tpu \ |
187 |
| - --download |
| 130 | + --config_file=config/examples/efficientnet/imagenet/efficientnet-b0-tpu.yaml |
188 | 131 | ```
|
189 | 132 |
|
190 |
| -Note: the `--download` flag is only required the first time you run the model. |
| 133 | +Note that the number of GPU devices can be overridden in the command line using |
| 134 | +`--params_overrides`. The TPU does not need this override as the device is fixed |
| 135 | +by providing the TPU address or name with the `--tpu` flag. |
| 136 | + |
0 commit comments