Skip to content

Commit 4edfde5

Browse files
authored
Design for Dynamic Sharding to Support Reading Original Images (#2341)
* Dynamic Sharding Support Reading Original Images * Dynamic Sharding Support Reading Original Images * Fix by comments * Polish by comments * Polish by comments * Add a demo to create a dataloader using COCO dataset
1 parent 447ac93 commit 4edfde5

File tree

1 file changed

+271
-0
lines changed

1 file changed

+271
-0
lines changed
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
# Dynamic Sharding Support Reading Original Images
2+
3+
Now, users have to convert images into RecordIO for ElasticDL. In practice,
4+
many users read the original images to train their models. So, we need to
5+
support reading the original data from the storage. However, there are various
6+
file formats for the original data. We need to design a common definition
7+
for different formats.
8+
9+
## Different Ways to Store Images and Annotations
10+
11+
1. All images are in the same folder.
12+
13+
```txt
14+
|-- images
15+
|--0001.png
16+
|--0002.png
17+
|--0003.png
18+
|--0004.png
19+
|--0005.png
20+
```
21+
22+
In the format, users may not need labels in the image compression.
23+
24+
2. Images with the same label are in the same folder such as ImageNet.
25+
26+
```txt
27+
|-- images
28+
|--0
29+
|--0001.png
30+
|--0002.png
31+
|--0003.png
32+
|--1
33+
|--0004.png
34+
|--0005.png
35+
|--0006.png
36+
```
37+
38+
Besides the images, there is usually a file to store all filenames and labels
39+
like:
40+
41+
```csv
42+
0001.png,0
43+
0002.png,0
44+
0003.png,0
45+
0004.png,1
46+
0005.png,1
47+
0006.png,1
48+
```
49+
50+
Users will read the content from the file to read images from the storage.
51+
52+
3. The description of images, labels, and annotations is in a JSON or XML file.
53+
54+
For example, the description of the image is in a JSON file for COCO
55+
dataset and in a XML file for Pascal VOC dataset. The example of COCO
56+
description is
57+
58+
```json
59+
"{'license': 3,
60+
'file_name': 'COCO_val2014_000000391895.jpg',
61+
'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg',
62+
'height': 360,
63+
'width': 640,
64+
'date_captured': '2013-11-14 11:18:45',
65+
'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg',
66+
'id': 391895}"
67+
68+
```
69+
70+
```json
71+
"{'image_id': 203564,
72+
'id': 37,
73+
'caption': 'A bicycle replica with a clock as the front wheel.'}"
74+
```
75+
76+
In elastic training, we need to split the training data into shards and
77+
assign those shards to workers. When the worker fails, we need to
78+
reassign uncompleted shards of the worker to other alive workers.
79+
80+
## Shard Definition
81+
82+
In ElasticDL, we define the shard information `start`, `end` and `shard_name`
83+
in the task. We can define the shard independently and and expose the
84+
shard for users to create their datasets.
85+
86+
```proto
87+
message shard {
88+
// The storage name for data, such as a MaxCompute Table,
89+
// CSV file with image paths.
90+
string name = 1;
91+
92+
// Starting and ending (non-inclusive) record number.
93+
int64 start = 2;
94+
int64 end = 3;
95+
}
96+
```
97+
98+
In order to split the training data into shards, we must get the size of the
99+
training data firstly. We can get the size by
100+
`len(os.listdir("images"))` for case 1 and reading the CSV file for case 2.
101+
For simplicity, we also
102+
can store the image names in a CSV file for the first way to store images.
103+
So, the name of the shard is the CSV file. The start and end indices are
104+
the line number of the CSV file. When the worker gets the shard, it can read
105+
the images by the lines in the CSV file.
106+
107+
For case 3, it is difficult to get the size because we don't know the
108+
format of the description. So, users need to indicate the size of
109+
training data. We can design a function in the model definition and
110+
users implement the function to return the size.
111+
112+
```python
113+
from pycocotools.coco import COCO
114+
115+
def get_training_size():
116+
coco = COCO("annotations/captions_val2014.json")
117+
return len(coco.anns.keys())
118+
```
119+
120+
Then, users should define the `training_data` as the Python file with the
121+
function.
122+
123+
```bash
124+
--training_data="/test_data/coco/train/create_shard.py"
125+
```
126+
127+
The `PythonCustomReader` in ElasticDL can load the function to
128+
get the total size in the master.
129+
130+
```python
131+
class PythonCustomReader(object):
132+
def __init__(self, records_per_task):
133+
"""
134+
Args:
135+
kwargs should contains "records_per_task".
136+
"""
137+
AbstractDataReader.__init__(self, **kwargs)
138+
self._filename = self._kwargs["filename"]
139+
self._records_per_task = self._kwargs["records_per_task"]
140+
self._get_size_fn = None
141+
142+
def load_get_size_fn(self, fn_name="get_training_size"):
143+
module = load_module(self._filename)
144+
self._get_size_fn = module[fn_name]
145+
146+
def get_size(self):
147+
if self._get_size_fn:
148+
return self._get_size_fn()
149+
150+
def create_shards(self):
151+
shard_name_prefix = "shard_"
152+
size = self.get_size()
153+
shards = {}
154+
num_shards = size // self._records_per_task
155+
start_ind = 0
156+
for shard_id in range(num_shards):
157+
shards[shard_name_prefix + str(shard_id)] = (
158+
start_ind,
159+
self._records_per_task,
160+
)
161+
start_ind += self._records_per_task
162+
return shards
163+
```
164+
165+
Then, the master will call the function to get the size and split the
166+
training data into shards by the size. The shard message will only contains
167+
the start and end index. Users need to read the image information according
168+
to the index by themselves.
169+
170+
## The Worker Creates the Dataset using Shards
171+
172+
### APIs to Fetch Shards
173+
174+
```python
175+
class DataShardService(object):
176+
def __init__(self, batch_size, master_client=None,):
177+
self._mc = master_client
178+
if not self._mc
179+
master_addr = os.getenv("MASTER_ADDR")
180+
worker_id = os.getenv("WORKER_ID")
181+
self._mc = MasterClient(
182+
build_channel(master_addr), worker_id
183+
)
184+
self._pending_tasks = []
185+
self.record_count = 0
186+
187+
def fetch_shard(self):
188+
return shard
189+
190+
def report_batch_done(self):
191+
if task_done:
192+
report_task()
193+
```
194+
195+
### Create Dataset Using TensorFlow
196+
197+
```python
198+
import tensorflow as tf
199+
200+
global data_shard_service = DataShardService()
201+
202+
class DynamicShardingHook(tf.train.SessionRunHook):
203+
def __init__(self, num_worker):
204+
self._max_steps = max_steps
205+
self._local_step = 0
206+
self._batch_size = 256
207+
208+
def after_run(self, run_context, run_values):
209+
self._local_step += 1
210+
if self._local_step * self._batch_size > data_shard_service.record_count:
211+
data_shard_service.report_batch_done()
212+
213+
def get_dataset(shuffle=False):
214+
def _record_generator():
215+
while True:
216+
shard = data_shard_service.fetch_shard()
217+
if not shard:
218+
break
219+
records = read_records(shard.start, shard.end)
220+
if shuffle:
221+
np.random.shuffle(records)
222+
for record in records:
223+
yield record
224+
return tf.data.Dataset.from_generator(_record_generator()
225+
```
226+
227+
### Create Dataset Using PyTorch
228+
229+
Here, we create the dataset using COCO dataset.
230+
231+
```python
232+
import torch
233+
import cv2
234+
from pycocotools.coco import COCO
235+
236+
global data_shard_service = DataShardService()
237+
238+
coco = COCO("annotations/captions_val2014.json")
239+
ids = list(coco.anns.keys())
240+
241+
def read_images(shard):
242+
images = []
243+
for index in range(shard.start, shard.end):
244+
ann_id = ids[index]
245+
caption = coco.anns[ann_id]['segmentation']
246+
img_id = coco.anns[ann_id]['image_id']
247+
path = coco.loadImgs(img_id)[0]['file_name']
248+
image = cv2.imread(image_path)
249+
images.append(image, caption)
250+
return images
251+
252+
253+
class ImageDataset(torch.utils.data.IterableDataset):
254+
255+
def __init__(self, shuffle=False):
256+
self._shuffle = shuffle
257+
258+
def __iter__(self):
259+
while True:
260+
shard = data_shard_service.fetch_shard()
261+
if shard:
262+
images = read_images(shard)
263+
if self._shuffle:
264+
np.random.shuffle(images)
265+
for image in images:
266+
yield image
267+
else:
268+
break
269+
270+
data_loader = DataLoader(dataset=dataset, batch_size=32)
271+
```

0 commit comments

Comments
 (0)