Revisit subset access in streaming dataset / coco

I checked the efficiency of COCO optimizations and streaming wrt. the subset access speed. I used the following script:

<details>

```python
import re
import sys
from argparse import ArgumentParser
from typing import List, Optional, Type

from timeit import timeit

from datumaro.components.dataset import Dataset, StreamDataset
from datumaro.components.environment import Environment


def parse_dataset_pathspec(s: str, env: Optional[Environment] = None, *, dataset_class: Type[Dataset] = Dataset) -> Dataset:
    """
    Parses Dataset paths. The syntax is:
        - <dataset path>[ :<format> ]

    Returns: a dataset from the parsed path
    """

    match = re.fullmatch(
        r"""
        (?P<dataset_path>(?: [^:] | :[/\\] )+)
        (:(?P<format>.+))?
        """,
        s,
        flags=re.VERBOSE,
    )
    if not match:
        raise ValueError("Failed to recognize dataset pathspec in '%s'" % s)
    match = match.groupdict()

    path = match["dataset_path"]
    format = match["format"]
    return dataset_class.import_from(path, format, env=env)

def iterate_subsets(dataset: Dataset):
    c = 0

    for subset_name, subset in dataset.subsets().items():
        for item in subset:
            c += 1

    # for item in dataset:
    #     c += 1

    # print(c)

def main(args: Optional[List[str]] = None) -> int:
    parser = ArgumentParser()
    parser.add_argument("src_path")
    parsed_args = parser.parse_args(args)

    input_dataset = parse_dataset_pathspec(parsed_args.src_path, dataset_class=StreamDataset)
    print(input_dataset.subsets())

    n = 1
    time = timeit(
        "iterate_subsets(input_dataset)",
        globals={**globals(), **locals()},
        number=n,
    )
    print(f"Time taken to iterate subsets: {time:.4f} seconds")
    print(f"Average time per iteration: {time / n:.4f} seconds")

    return 0

if __name__ == "__main__":
    sys.exit(main(sys.argv[1:]))

```

</details>

During testing it on COCO annotations (2 subsets - train2017 and val2017 in 1 directory) I got the following results:

```
$  python test_stream_iter.py "datasets/coco/:coco_instances"

Time taken to iterate subsets: 196.3650 seconds
Average time per iteration: 39.2730 seconds
```

My observations:
- COCO's extractor `stream_items()` and `get_dataset_item()` should include `page_mapper` access in the lazy annotation loading
- annotation types for a COCO subset kill the lazy annotation loading optimizations made, as they load annotations after the item is iterated
- simple iteration over COCO dataset takes about 3.5s (without ann types and ann loading), if there are many subsets it can be multiplied by the number of subsets. Maybe changes from #102 should be revisited and direct subset access should be restored, if provided by the source extractor (opposed to iteration over all the dataset and filtering).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit subset access in streaming dataset / coco #109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revisit subset access in streaming dataset / coco #109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions