Skip to content

Revisit subset access in streaming dataset / coco #109

@zhiltsov-max

Description

@zhiltsov-max

I checked the efficiency of COCO optimizations and streaming wrt. the subset access speed. I used the following script:

Details
import re
import sys
from argparse import ArgumentParser
from typing import List, Optional, Type

from timeit import timeit

from datumaro.components.dataset import Dataset, StreamDataset
from datumaro.components.environment import Environment


def parse_dataset_pathspec(s: str, env: Optional[Environment] = None, *, dataset_class: Type[Dataset] = Dataset) -> Dataset:
    """
    Parses Dataset paths. The syntax is:
        - <dataset path>[ :<format> ]

    Returns: a dataset from the parsed path
    """

    match = re.fullmatch(
        r"""
        (?P<dataset_path>(?: [^:] | :[/\\] )+)
        (:(?P<format>.+))?
        """,
        s,
        flags=re.VERBOSE,
    )
    if not match:
        raise ValueError("Failed to recognize dataset pathspec in '%s'" % s)
    match = match.groupdict()

    path = match["dataset_path"]
    format = match["format"]
    return dataset_class.import_from(path, format, env=env)

def iterate_subsets(dataset: Dataset):
    c = 0

    for subset_name, subset in dataset.subsets().items():
        for item in subset:
            c += 1

    # for item in dataset:
    #     c += 1

    # print(c)

def main(args: Optional[List[str]] = None) -> int:
    parser = ArgumentParser()
    parser.add_argument("src_path")
    parsed_args = parser.parse_args(args)

    input_dataset = parse_dataset_pathspec(parsed_args.src_path, dataset_class=StreamDataset)
    print(input_dataset.subsets())

    n = 1
    time = timeit(
        "iterate_subsets(input_dataset)",
        globals={**globals(), **locals()},
        number=n,
    )
    print(f"Time taken to iterate subsets: {time:.4f} seconds")
    print(f"Average time per iteration: {time / n:.4f} seconds")

    return 0

if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))

During testing it on COCO annotations (2 subsets - train2017 and val2017 in 1 directory) I got the following results:

$  python test_stream_iter.py "datasets/coco/:coco_instances"

Time taken to iterate subsets: 196.3650 seconds
Average time per iteration: 39.2730 seconds

My observations:

  • COCO's extractor stream_items() and get_dataset_item() should include page_mapper access in the lazy annotation loading
  • annotation types for a COCO subset kill the lazy annotation loading optimizations made, as they load annotations after the item is iterated
  • simple iteration over COCO dataset takes about 3.5s (without ann types and ann loading), if there are many subsets it can be multiplied by the number of subsets. Maybe changes from Remove KEEPS_SUBSETS_INTACT and extra logic for streaming subsets #102 should be revisited and direct subset access should be restored, if provided by the source extractor (opposed to iteration over all the dataset and filtering).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions