Skip to content

Commit de31fd8

Browse files
author
Allen Wang
committed
Internal change
PiperOrigin-RevId: 302937425
1 parent ee3dae7 commit de31fd8

27 files changed

+5010
-134
lines changed
Lines changed: 80 additions & 134 deletions
Original file line numberDiff line numberDiff line change
@@ -1,190 +1,136 @@
11
# Image Classification
22

3-
This folder contains the TF 2.0 model examples for image classification:
3+
This folder contains TF 2.0 model examples for image classification:
44

5-
* [ResNet](#resnet)
65
* [MNIST](#mnist)
6+
* [Classifier Trainer](#classifier-trainer), a framework that uses the Keras
7+
compile/fit methods for image classification models, including:
8+
* ResNet
9+
* EfficientNet[^1]
710

11+
[^1]: Currently a work in progress. We cannot match "AutoAugment (AA)" in [the original version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
812
For more information about other types of models, please refer to this
913
[README file](../../README.md).
1014

11-
## ResNet
12-
13-
Similar to the [estimator implementation](../../r1/resnet), the Keras
14-
implementation has code for the ImageNet dataset. The ImageNet
15-
version uses a ResNet50 model implemented in
16-
[`resnet_model.py`](./resnet/resnet_model.py).
17-
15+
## Before you begin
1816
Please make sure that you have the latest version of TensorFlow
1917
installed and
2018
[add the models folder to your Python path](/official/#running-the-models).
2119

22-
### Pretrained Models
23-
24-
* [ResNet50 Checkpoints](https://storage.googleapis.com/cloud-tpu-checkpoints/resnet/resnet50.tar.gz)
25-
26-
* ResNet50 TFHub: [feature vector](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1)
27-
and [classification](https://tfhub.dev/tensorflow/resnet_50/classification/1)
28-
29-
### ImageNet Training
20+
### ImageNet preparation
3021

3122
Download the ImageNet dataset and convert it to TFRecord format.
3223
The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)
3324
and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
3425
provide a few options.
3526

36-
Once your dataset is ready, you can begin training the model as follows:
37-
38-
```bash
39-
python resnet/resnet_imagenet_main.py
40-
```
41-
42-
Again, if you did not download the data to the default directory, specify the
43-
location with the `--data_dir` flag:
44-
45-
```bash
46-
python resnet/resnet_imagenet_main.py --data_dir=/path/to/imagenet
47-
```
48-
49-
There are more flag options you can specify. Here are some examples:
50-
51-
- `--use_synthetic_data`: when set to true, synthetic data, rather than real
52-
data, are used;
53-
- `--batch_size`: the batch size used for the model;
54-
- `--model_dir`: the directory to save the model checkpoint;
55-
- `--train_epochs`: number of epoches to run for training the model;
56-
- `--train_steps`: number of steps to run for training the model. We now only
57-
support a number that is smaller than the number of batches in an epoch.
58-
- `--skip_eval`: when set to true, evaluation as well as validation during
59-
training is skipped
60-
61-
For example, this is a typical command line to run with ImageNet data with
62-
batch size 128 per GPU:
63-
64-
```bash
65-
python -m resnet/resnet_imagenet_main.py \
66-
--model_dir=/tmp/model_dir/something \
67-
--num_gpus=2 \
68-
--batch_size=128 \
69-
--train_epochs=90 \
70-
--train_steps=10 \
71-
--use_synthetic_data=false
72-
```
73-
74-
See [`common.py`](common.py) for full list of options.
75-
76-
### Using multiple GPUs
77-
78-
You can train these models on multiple GPUs using `tf.distribute.Strategy` API.
79-
You can read more about them in this
80-
[guide](https://www.tensorflow.org/guide/distribute_strategy).
81-
82-
In this example, we have made it easier to use is with just a command line flag
83-
`--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA,
84-
and 0 otherwise.
85-
86-
- --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
87-
- --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
88-
- --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous
89-
distributed training across the GPUs.
90-
91-
If you wish to run without `tf.distribute.Strategy`, you can do so by setting
92-
`--distribution_strategy=off`.
93-
94-
### Running on multiple GPU hosts
95-
96-
You can also train these models on multiple hosts, each with GPUs, using
97-
`tf.distribute.Strategy`.
98-
99-
The easiest way to run multi-host benchmarks is to set the
100-
[`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG)
101-
appropriately at each host. e.g., to run using `MultiWorkerMirroredStrategy` on
102-
2 hosts, the `cluster` in `TF_CONFIG` should have 2 `host:port` entries, and
103-
host `i` should have the `task` in `TF_CONFIG` set to `{"type": "worker",
104-
"index": i}`. `MultiWorkerMirroredStrategy` will automatically use all the
105-
available GPUs at each host.
106-
10727
### Running on Cloud TPUs
10828

109-
Note: This model will **not** work with TPUs on Colab.
29+
Note: These models will **not** work with TPUs on Colab.
11030

111-
You can train the ResNet CTL model on Cloud TPUs using
31+
You can train image classification models on Cloud TPUs using
11232
`tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
11333
strongly recommended that you go through the
11434
[quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
11535
create a TPU and GCE VM.
11636

117-
To run ResNet model on a TPU, you must set `--distribution_strategy=tpu` and
118-
`--tpu=$TPU_NAME`, where `$TPU_NAME` the name of your TPU in the Cloud Console.
119-
From a GCE VM, you can run the following command to train ResNet for one epoch
120-
on a v2-8 or v3-8 TPU:
37+
## MNIST
38+
39+
To download the data and run the MNIST sample model locally for the first time,
40+
run one of the following command:
12141

12242
```bash
123-
python resnet/resnet_ctl_imagenet_main.py \
124-
--tpu=$TPU_NAME \
43+
python3 mnist_main.py \
12544
--model_dir=$MODEL_DIR \
12645
--data_dir=$DATA_DIR \
127-
--batch_size=1024 \
128-
--steps_per_loop=500 \
129-
--train_epochs=1 \
130-
--use_synthetic_data=false \
131-
--dtype=fp32 \
132-
--enable_eager=true \
133-
--enable_tensorboard=true \
134-
--distribution_strategy=tpu \
135-
--log_steps=50 \
136-
--single_l2_loss_op=true \
137-
--use_tf_function=true
46+
--train_epochs=10 \
47+
--distribution_strategy=one_device \
48+
--num_gpus=$NUM_GPUS \
49+
--download
13850
```
13951

140-
To train the ResNet to convergence, run it for 90 epochs:
52+
To train the model on a Cloud TPU, run the following command:
14153

14254
```bash
143-
python resnet/resnet_ctl_imagenet_main.py \
55+
python3 mnist_main.py \
14456
--tpu=$TPU_NAME \
14557
--model_dir=$MODEL_DIR \
14658
--data_dir=$DATA_DIR \
147-
--batch_size=1024 \
148-
--steps_per_loop=500 \
149-
--train_epochs=90 \
150-
--use_synthetic_data=false \
151-
--dtype=fp32 \
152-
--enable_eager=true \
153-
--enable_tensorboard=true \
59+
--train_epochs=10 \
15460
--distribution_strategy=tpu \
155-
--log_steps=50 \
156-
--single_l2_loss_op=true \
157-
--use_tf_function=true
61+
--download
15862
```
15963

160-
Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths.
64+
Note: the `--download` flag is only required the first time you run the model.
16165

16266

163-
## MNIST
67+
## Classifier Trainer
68+
The classifier trainer is a unified framework for running image classification
69+
models using Keras's compile/fit methods. Experiments should be provided in the
70+
form of YAML files, some examples are included within the configs/examples
71+
folder. Please see [configs/examples](./configs/examples) for more example
72+
configurations.
16473

165-
To download the data and run the MNIST sample model locally for the first time,
166-
run one of the following command:
74+
The provided configuration files use a per replica batch size and is scaled
75+
by the number of devices. For instance, if `batch size` = 64, then for 1 GPU
76+
the global batch size would be 64 * 1 = 64. For 8 GPUs, the global batch size
77+
would be 64 * 8 = 512. Similarly, for a v3-8 TPU, the global batch size would
78+
be 64 * 8 = 512, and for a v3-32, the global batch size is 64 * 32 = 2048.
16779

80+
### ResNet50
81+
82+
#### On GPU:
16883
```bash
169-
python mnist_main.py \
84+
python3 classifier_trainer.py \
85+
--mode=train_and_eval \
86+
--model_type=resnet \
87+
--dataset=imagenet \
17088
--model_dir=$MODEL_DIR \
17189
--data_dir=$DATA_DIR \
172-
--train_epochs=10 \
173-
--distribution_strategy=one_device \
174-
--num_gpus=$NUM_GPUS \
175-
--download
90+
--config_file=configs/examples/resnet/imagenet/gpu.yaml \
91+
--params_override='runtime.num_gpus=$NUM_GPUS'
17692
```
17793

178-
To train the model on a Cloud TPU, run the following command:
94+
#### On TPU:
95+
```bash
96+
python3 classifier_trainer.py \
97+
--mode=train_and_eval \
98+
--model_type=resnet \
99+
--dataset=imagenet \
100+
--tpu=$TPU_NAME \
101+
--model_dir=$MODEL_DIR \
102+
--data_dir=$DATA_DIR \
103+
--config_file=config/examples/resnet/imagenet/tpu.yaml
104+
```
105+
106+
### EfficientNet
107+
**Note: EfficientNet development is a work in progress.**
108+
#### On GPU:
109+
```bash
110+
python3 classifier_trainer.py \
111+
--mode=train_and_eval \
112+
--model_type=efficientnet \
113+
--dataset=imagenet \
114+
--model_dir=$MODEL_DIR \
115+
--data_dir=$DATA_DIR \
116+
--config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml \
117+
--params_override='runtime.num_gpus=$NUM_GPUS'
118+
```
179119

120+
121+
#### On TPU:
180122
```bash
181-
python mnist_main.py \
123+
python3 classifier_trainer.py \
124+
--mode=train_and_eval \
125+
--model_type=efficientnet \
126+
--dataset=imagenet \
182127
--tpu=$TPU_NAME \
183128
--model_dir=$MODEL_DIR \
184129
--data_dir=$DATA_DIR \
185-
--train_epochs=10 \
186-
--distribution_strategy=tpu \
187-
--download
130+
--config_file=config/examples/efficientnet/imagenet/efficientnet-b0-tpu.yaml
188131
```
189132

190-
Note: the `--download` flag is only required the first time you run the model.
133+
Note that the number of GPU devices can be overridden in the command line using
134+
`--params_overrides`. The TPU does not need this override as the device is fixed
135+
by providing the TPU address or name with the `--tpu` flag.
136+

0 commit comments

Comments
 (0)