|
| 1 | +# Dynamic Sharding Support Reading Original Images |
| 2 | + |
| 3 | +Now, users have to convert images into RecordIO for ElasticDL. In practice, |
| 4 | +many users read the original images to train their models. So, we need to |
| 5 | +support reading the original data from the storage. However, there are various |
| 6 | +file formats for the original data. We need to design a common definition |
| 7 | +for different formats. |
| 8 | + |
| 9 | +## Different Ways to Store Images and Annotations |
| 10 | + |
| 11 | +1. All images are in the same folder. |
| 12 | + |
| 13 | + ```txt |
| 14 | + |-- images |
| 15 | + |--0001.png |
| 16 | + |--0002.png |
| 17 | + |--0003.png |
| 18 | + |--0004.png |
| 19 | + |--0005.png |
| 20 | + ``` |
| 21 | +
|
| 22 | + In the format, users may not need labels in the image compression. |
| 23 | +
|
| 24 | +2. Images with the same label are in the same folder such as ImageNet. |
| 25 | +
|
| 26 | + ```txt |
| 27 | + |-- images |
| 28 | + |--0 |
| 29 | + |--0001.png |
| 30 | + |--0002.png |
| 31 | + |--0003.png |
| 32 | + |--1 |
| 33 | + |--0004.png |
| 34 | + |--0005.png |
| 35 | + |--0006.png |
| 36 | + ``` |
| 37 | +
|
| 38 | + Besides the images, there is usually a file to store all filenames and labels |
| 39 | + like: |
| 40 | +
|
| 41 | + ```csv |
| 42 | + 0001.png,0 |
| 43 | + 0002.png,0 |
| 44 | + 0003.png,0 |
| 45 | + 0004.png,1 |
| 46 | + 0005.png,1 |
| 47 | + 0006.png,1 |
| 48 | + ``` |
| 49 | +
|
| 50 | + Users will read the content from the file to read images from the storage. |
| 51 | +
|
| 52 | +3. The description of images, labels, and annotations is in a JSON or XML file. |
| 53 | +
|
| 54 | + For example, the description of the image is in a JSON file for COCO |
| 55 | + dataset and in a XML file for Pascal VOC dataset. The example of COCO |
| 56 | + description is |
| 57 | +
|
| 58 | + ```json |
| 59 | + "{'license': 3, |
| 60 | + 'file_name': 'COCO_val2014_000000391895.jpg', |
| 61 | + 'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg', |
| 62 | + 'height': 360, |
| 63 | + 'width': 640, |
| 64 | + 'date_captured': '2013-11-14 11:18:45', |
| 65 | + 'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', |
| 66 | + 'id': 391895}" |
| 67 | +
|
| 68 | + ``` |
| 69 | +
|
| 70 | + ```json |
| 71 | + "{'image_id': 203564, |
| 72 | + 'id': 37, |
| 73 | + 'caption': 'A bicycle replica with a clock as the front wheel.'}" |
| 74 | + ``` |
| 75 | +
|
| 76 | + In elastic training, we need to split the training data into shards and |
| 77 | + assign those shards to workers. When the worker fails, we need to |
| 78 | + reassign uncompleted shards of the worker to other alive workers. |
| 79 | +
|
| 80 | +## Shard Definition |
| 81 | +
|
| 82 | +In ElasticDL, we define the shard information `start`, `end` and `shard_name` |
| 83 | +in the task. We can define the shard independently and and expose the |
| 84 | +shard for users to create their datasets. |
| 85 | +
|
| 86 | +```proto |
| 87 | +message shard { |
| 88 | + // The storage name for data, such as a MaxCompute Table, |
| 89 | + // CSV file with image paths. |
| 90 | + string name = 1; |
| 91 | +
|
| 92 | + // Starting and ending (non-inclusive) record number. |
| 93 | + int64 start = 2; |
| 94 | + int64 end = 3; |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +In order to split the training data into shards, we must get the size of the |
| 99 | +training data firstly. We can get the size by |
| 100 | +`len(os.listdir("images"))` for case 1 and reading the CSV file for case 2. |
| 101 | +For simplicity, we also |
| 102 | +can store the image names in a CSV file for the first way to store images. |
| 103 | +So, the name of the shard is the CSV file. The start and end indices are |
| 104 | +the line number of the CSV file. When the worker gets the shard, it can read |
| 105 | +the images by the lines in the CSV file. |
| 106 | + |
| 107 | +For case 3, it is difficult to get the size because we don't know the |
| 108 | +format of the description. So, users need to indicate the size of |
| 109 | +training data. We can design a function in the model definition and |
| 110 | +users implement the function to return the size. |
| 111 | + |
| 112 | +```python |
| 113 | +from pycocotools.coco import COCO |
| 114 | + |
| 115 | +def get_training_size(): |
| 116 | + coco = COCO("annotations/captions_val2014.json") |
| 117 | + return len(coco.anns.keys()) |
| 118 | +``` |
| 119 | + |
| 120 | +Then, users should define the `training_data` as the Python file with the |
| 121 | +function. |
| 122 | + |
| 123 | +```bash |
| 124 | +--training_data="/test_data/coco/train/create_shard.py" |
| 125 | +``` |
| 126 | + |
| 127 | +The `PythonCustomReader` in ElasticDL can load the function to |
| 128 | +get the total size in the master. |
| 129 | + |
| 130 | +```python |
| 131 | +class PythonCustomReader(object): |
| 132 | + def __init__(self, records_per_task): |
| 133 | + """ |
| 134 | + Args: |
| 135 | + kwargs should contains "records_per_task". |
| 136 | + """ |
| 137 | + AbstractDataReader.__init__(self, **kwargs) |
| 138 | + self._filename = self._kwargs["filename"] |
| 139 | + self._records_per_task = self._kwargs["records_per_task"] |
| 140 | + self._get_size_fn = None |
| 141 | + |
| 142 | + def load_get_size_fn(self, fn_name="get_training_size"): |
| 143 | + module = load_module(self._filename) |
| 144 | + self._get_size_fn = module[fn_name] |
| 145 | + |
| 146 | + def get_size(self): |
| 147 | + if self._get_size_fn: |
| 148 | + return self._get_size_fn() |
| 149 | + |
| 150 | + def create_shards(self): |
| 151 | + shard_name_prefix = "shard_" |
| 152 | + size = self.get_size() |
| 153 | + shards = {} |
| 154 | + num_shards = size // self._records_per_task |
| 155 | + start_ind = 0 |
| 156 | + for shard_id in range(num_shards): |
| 157 | + shards[shard_name_prefix + str(shard_id)] = ( |
| 158 | + start_ind, |
| 159 | + self._records_per_task, |
| 160 | + ) |
| 161 | + start_ind += self._records_per_task |
| 162 | + return shards |
| 163 | +``` |
| 164 | + |
| 165 | +Then, the master will call the function to get the size and split the |
| 166 | +training data into shards by the size. The shard message will only contains |
| 167 | +the start and end index. Users need to read the image information according |
| 168 | +to the index by themselves. |
| 169 | + |
| 170 | +## The Worker Creates the Dataset using Shards |
| 171 | + |
| 172 | +### APIs to Fetch Shards |
| 173 | + |
| 174 | +```python |
| 175 | +class DataShardService(object): |
| 176 | + def __init__(self, batch_size, master_client=None,): |
| 177 | + self._mc = master_client |
| 178 | + if not self._mc |
| 179 | + master_addr = os.getenv("MASTER_ADDR") |
| 180 | + worker_id = os.getenv("WORKER_ID") |
| 181 | + self._mc = MasterClient( |
| 182 | + build_channel(master_addr), worker_id |
| 183 | + ) |
| 184 | + self._pending_tasks = [] |
| 185 | + self.record_count = 0 |
| 186 | + |
| 187 | + def fetch_shard(self): |
| 188 | + return shard |
| 189 | + |
| 190 | + def report_batch_done(self): |
| 191 | + if task_done: |
| 192 | + report_task() |
| 193 | +``` |
| 194 | + |
| 195 | +### Create Dataset Using TensorFlow |
| 196 | + |
| 197 | +```python |
| 198 | +import tensorflow as tf |
| 199 | + |
| 200 | +global data_shard_service = DataShardService() |
| 201 | + |
| 202 | +class DynamicShardingHook(tf.train.SessionRunHook): |
| 203 | + def __init__(self, num_worker): |
| 204 | + self._max_steps = max_steps |
| 205 | + self._local_step = 0 |
| 206 | + self._batch_size = 256 |
| 207 | + |
| 208 | + def after_run(self, run_context, run_values): |
| 209 | + self._local_step += 1 |
| 210 | + if self._local_step * self._batch_size > data_shard_service.record_count: |
| 211 | + data_shard_service.report_batch_done() |
| 212 | + |
| 213 | +def get_dataset(shuffle=False): |
| 214 | + def _record_generator(): |
| 215 | + while True: |
| 216 | + shard = data_shard_service.fetch_shard() |
| 217 | + if not shard: |
| 218 | + break |
| 219 | + records = read_records(shard.start, shard.end) |
| 220 | + if shuffle: |
| 221 | + np.random.shuffle(records) |
| 222 | + for record in records: |
| 223 | + yield record |
| 224 | + return tf.data.Dataset.from_generator(_record_generator() |
| 225 | +``` |
| 226 | + |
| 227 | +### Create Dataset Using PyTorch |
| 228 | + |
| 229 | +Here, we create the dataset using COCO dataset. |
| 230 | + |
| 231 | +```python |
| 232 | +import torch |
| 233 | +import cv2 |
| 234 | +from pycocotools.coco import COCO |
| 235 | + |
| 236 | +global data_shard_service = DataShardService() |
| 237 | + |
| 238 | +coco = COCO("annotations/captions_val2014.json") |
| 239 | +ids = list(coco.anns.keys()) |
| 240 | + |
| 241 | +def read_images(shard): |
| 242 | + images = [] |
| 243 | + for index in range(shard.start, shard.end): |
| 244 | + ann_id = ids[index] |
| 245 | + caption = coco.anns[ann_id]['segmentation'] |
| 246 | + img_id = coco.anns[ann_id]['image_id'] |
| 247 | + path = coco.loadImgs(img_id)[0]['file_name'] |
| 248 | + image = cv2.imread(image_path) |
| 249 | + images.append(image, caption) |
| 250 | + return images |
| 251 | + |
| 252 | + |
| 253 | +class ImageDataset(torch.utils.data.IterableDataset): |
| 254 | + |
| 255 | + def __init__(self, shuffle=False): |
| 256 | + self._shuffle = shuffle |
| 257 | + |
| 258 | + def __iter__(self): |
| 259 | + while True: |
| 260 | + shard = data_shard_service.fetch_shard() |
| 261 | + if shard: |
| 262 | + images = read_images(shard) |
| 263 | + if self._shuffle: |
| 264 | + np.random.shuffle(images) |
| 265 | + for image in images: |
| 266 | + yield image |
| 267 | + else: |
| 268 | + break |
| 269 | + |
| 270 | +data_loader = DataLoader(dataset=dataset, batch_size=32) |
| 271 | +``` |
0 commit comments