forked from open-edge-platform/datumaro
-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
I checked the efficiency of COCO optimizations and streaming wrt. the subset access speed. I used the following script:
Details
import re
import sys
from argparse import ArgumentParser
from typing import List, Optional, Type
from timeit import timeit
from datumaro.components.dataset import Dataset, StreamDataset
from datumaro.components.environment import Environment
def parse_dataset_pathspec(s: str, env: Optional[Environment] = None, *, dataset_class: Type[Dataset] = Dataset) -> Dataset:
"""
Parses Dataset paths. The syntax is:
- <dataset path>[ :<format> ]
Returns: a dataset from the parsed path
"""
match = re.fullmatch(
r"""
(?P<dataset_path>(?: [^:] | :[/\\] )+)
(:(?P<format>.+))?
""",
s,
flags=re.VERBOSE,
)
if not match:
raise ValueError("Failed to recognize dataset pathspec in '%s'" % s)
match = match.groupdict()
path = match["dataset_path"]
format = match["format"]
return dataset_class.import_from(path, format, env=env)
def iterate_subsets(dataset: Dataset):
c = 0
for subset_name, subset in dataset.subsets().items():
for item in subset:
c += 1
# for item in dataset:
# c += 1
# print(c)
def main(args: Optional[List[str]] = None) -> int:
parser = ArgumentParser()
parser.add_argument("src_path")
parsed_args = parser.parse_args(args)
input_dataset = parse_dataset_pathspec(parsed_args.src_path, dataset_class=StreamDataset)
print(input_dataset.subsets())
n = 1
time = timeit(
"iterate_subsets(input_dataset)",
globals={**globals(), **locals()},
number=n,
)
print(f"Time taken to iterate subsets: {time:.4f} seconds")
print(f"Average time per iteration: {time / n:.4f} seconds")
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))During testing it on COCO annotations (2 subsets - train2017 and val2017 in 1 directory) I got the following results:
$ python test_stream_iter.py "datasets/coco/:coco_instances"
Time taken to iterate subsets: 196.3650 seconds
Average time per iteration: 39.2730 seconds
My observations:
- COCO's extractor
stream_items()andget_dataset_item()should includepage_mapperaccess in the lazy annotation loading - annotation types for a COCO subset kill the lazy annotation loading optimizations made, as they load annotations after the item is iterated
- simple iteration over COCO dataset takes about 3.5s (without ann types and ann loading), if there are many subsets it can be multiplied by the number of subsets. Maybe changes from Remove KEEPS_SUBSETS_INTACT and extra logic for streaming subsets #102 should be revisited and direct subset access should be restored, if provided by the source extractor (opposed to iteration over all the dataset and filtering).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels