open-mmlab
diff --git a/‎docs/en/migration/dataset.md‎
Lines changed: 16 additions & 28 deletions b/‎docs/en/migration/dataset.md‎
Lines changed: 16 additions & 28 deletions
diff --git a/‎docs/zh_cn/migration/dataset.md‎
Lines changed: 4 additions & 18 deletions b/‎docs/zh_cn/migration/dataset.md‎
Lines changed: 4 additions & 18 deletions
diff --git a/‎mmocr/datasets/recog_lmdb_dataset.py‎
Lines changed: 76 additions & 96 deletions b/‎mmocr/datasets/recog_lmdb_dataset.py‎
Lines changed: 76 additions & 96 deletions
diff --git a/‎mmocr/datasets/transforms/__init__.py‎
Lines changed: 5 additions & 6 deletions b/‎mmocr/datasets/transforms/__init__.py‎
Lines changed: 5 additions & 6 deletions
@@ -230,38 +230,26 @@ Specifically, we provide three dataset classes [IcdarDataset](mmocr.datasets.Icd
         parser_cfg=dict(
             type='LineJsonParser',
             keys=['filename', 'text'],
-        pipeline=[])
+        pipeline=[]))
    ```
 
-3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) supports LMDB format annotations for text recognition. You just need to add a new dataset config to `configs/textrecog/_base_/datasets` and specify its dataset type as `RecogLMDBDataset`. For example, the following example shows how to configure and load the **label-only lmdb** `label.lmdb` from the toy dataset.
-
-   ```python
-    data_root = 'tests/data/rec_toy_dataset/'
-
-    lmdb_dataset = dict(
-        type='RecogLMDBDataset',
-        data_root=data_root,
-        ann_file='label.lmdb',
-        data_prefix=dict(img_path='imgs'),
-        pipeline=[])
-   ```
+3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) supports LMDB format dataset (img+labels) for text recognition. You just need to add a new dataset config to `configs/textrecog/_base_/datasets` and specify its dataset type as `RecogLMDBDataset`. For example, the following example shows how to configure and load the **both labels and images** `imgs.lmdb` from the toy dataset.
 
-   When the `lmdb` file contains **both labels and images**, in addition to setting the dataset type to `RecogLMDBDataset` as in the above example, you also need to replace the [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) with [`LoadImageFromLMDB`](mmocr.datasets.transforms.LoadImageFromLMDB) in the data pipelines.
+- set the dataset type to `RecogLMDBDataset`
 
-   ```python
-   # Specify the dataset type as RecogLMDBDataset
-    data_root = 'tests/data/rec_toy_dataset/'
+```python
+# Specify the dataset type as RecogLMDBDataset
+ data_root = 'tests/data/rec_toy_dataset/'
 
-    lmdb_dataset = dict(
-        type='RecogLMDBDataset',
-        data_root=data_root,
-        ann_file='imgs.lmdb',
-        data_prefix=dict(img_path='imgs.lmdb'), # setting the img_path as the lmdb name
-        pipeline=[])
-   ```
+ lmdb_dataset = dict(
+     type='RecogLMDBDataset',
+     data_root=data_root,
+     ann_file='imgs.lmdb',
+     pipeline=None)
+```
 
-   Also, replacing the image loading transforms in `train_pipeline` and `test_pipeline`, for example：
+- replace the [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) with [`LoadImageFromNDArray`](mmocr.datasets.transforms.LoadImageFromNDArray) in the data pipelines in `train_pipeline` and `test_pipeline`., for example：
 
-   ```python
-    train_pipeline = [dict(type='LoadImageFromLMDB', color_type='grayscale', ignore_empty=True)]
-   ```
+```python
+ train_pipeline = [dict(type='LoadImageFromNDArray')]
+```
@@ -232,20 +232,7 @@ python tools/dataset_converters/textrecog/data_migrator.py ${IN_PATH} ${OUT_PATH
         pipeline=[])
    ```
 
-3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) 支持 0.x 版本文本识别任务的 `LMDB` 标注格式。只需要在 `configs/textrecog/_base_/datasets` 中添加新的数据集配置文件，并指定其数据集类型为 `RecogLMDBDataset` 即可。例如，以下示例展示了如何配置并读取 toy dataset 中的 `label.lmdb`，该 `lmdb` 文件**仅包含标签信息**。
-
-   ```python
-    data_root = 'tests/data/rec_toy_dataset/'
-
-    lmdb_dataset = dict(
-        type='RecogLMDBDataset',
-        data_root=data_root,
-        ann_file='label.lmdb',
-        data_prefix=dict(img_path='imgs'),
-        pipeline=[])
-   ```
-
-   当 `lmdb` 文件中既包含标签信息又包含图像时，我们除了需要将数据集类型设定为 `RecogLMDBDataset` 以外，还需要将数据流水线中的图像读取方法由 [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) 替换为 [`LoadImageFromLMDB`](mmocr.datasets.transforms.LoadImageFromLMDB)。
+3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) 支持 0.x 版本文本识别任务**图像+文字**的 `LMDB` 标注格式。只需要在 `configs/textrecog/_base_/datasets` 中添加新的数据集配置文件，并指定其数据集类型为 `RecogLMDBDataset` 即可。例如，以下示例展示了如何配置并读取 toy dataset 中的 `imgs.lmdb`，该 `lmdb` 文件**包含标签和图像**。
 
    ```python
    # 将数据集类型设定为 RecogLMDBDataset
@@ -255,12 +242,11 @@ python tools/dataset_converters/textrecog/data_migrator.py ${IN_PATH} ${OUT_PATH
         type='RecogLMDBDataset',
         data_root=data_root,
         ann_file='imgs.lmdb',
-        data_prefix=dict(img_path='imgs.lmdb'), # 将 img_path 设定为 lmdb 文件名
-        pipeline=[])
+        pipeline=None)
    ```
 
-   还需把 `train_pipeline` 及 `test_pipeline` 中的数据读取方法进行替换：
+   还需把 `train_pipeline` 及 `test_pipeline` 中的数据读取方法如 [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) 替换为 [`LoadImageFromNDArray`](mmocr.datasets.transforms.LoadImageFromNDArray)：
 
    ```python
-    train_pipeline = [dict(type='LoadImageFromLMDB', color_type='grayscale', ignore_empty=True)]
+    train_pipeline = [dict(type='LoadImageFromNDArray')]
    ```
@@ -1,37 +1,35 @@
 # Copyright (c) OpenMMLab. All rights reserved.
-import json
-import os.path as osp
-import warnings
-from typing import Callable, List, Optional, Sequence, Union
+from typing import Any, Callable, List, Optional, Sequence, Tuple, Union
 
+import mmcv
 from mmengine.dataset import BaseDataset
-from mmengine.utils import is_abs
 
-from mmocr.registry import DATASETS, TASK_UTILS
+from mmocr.registry import DATASETS
 
 
 @DATASETS.register_module()
 class RecogLMDBDataset(BaseDataset):
     r"""RecogLMDBDataset for text recognition.
 
-    The annotation format should be in lmdb format. We support two lmdb
-    formats, one is the lmdb file with only labels generated by txt2lmdb
-    (deprecated), and another one is the lmdb file generated by recog2lmdb.
+    The annotation format should be in lmdb format. The lmdb file should
+    contain three keys: 'num-samples', 'label-xxxxxxxxx' and 'image-xxxxxxxxx',
+    where 'xxxxxxxxx' is the index of the image. The value of 'num-samples' is
+    the total number of images. The value of 'label-xxxxxxx' is the text label
+    of the image, and the value of 'image-xxxxxxx' is the image data.
 
-    The former format stores string in `filename text` format directly in lmdb,
-    while the latter uses `image_key` as well as `label_key` for querying.
+    following keys:
+    Each item fetched from this dataset will be a dict containing the
+    following keys:
+
+        - img (ndarray): The loaded image.
+        - img_path (str): The image key.
+        - instances (list[dict]): The list of annotations for the image.
 
     Args:
         ann_file (str): Annotation file path. Defaults to ''.
-        parse_cfg (dict, optional): Config of parser for parsing annotations.
-            Use ``LineJsonParser`` when the annotation file is in jsonl format
-            with keys of ``filename`` and ``text``. The keys in parse_cfg
-            should be consistent with the keys in jsonl annotations. The first
-            key in parse_cfg should be the key of the path in jsonl
-            annotations. The second key in parse_cfg should be the key of the
-            text in jsonl Use ``LineStrParser`` when the annotation file is in
-            txt format. Defaults to
-            ``dict(type='LineJsonParser', keys=['filename', 'text'])``.
+        img_color_type (str): The flag argument for :func:``mmcv.imfrombytes``,
+            which determines how the image bytes will be parsed. Defaults to
+            'color'.
         metainfo (dict, optional): Meta information for dataset, such as class
             information. Defaults to None.
         data_root (str): The root directory for ``data_prefix`` and
@@ -60,50 +58,21 @@ class RecogLMDBDataset(BaseDataset):
             image. Defaults to 1000.
     """
 
-    def __init__(self,
-                 ann_file: str = '',
-                 parser_cfg: Optional[dict] = dict(
-                     type='LineJsonParser', keys=['filename', 'text']),
-                 metainfo: Optional[dict] = None,
-                 data_root: Optional[str] = '',
-                 data_prefix: dict = dict(img_path=''),
-                 filter_cfg: Optional[dict] = None,
-                 indices: Optional[Union[int, Sequence[int]]] = None,
-                 serialize_data: bool = True,
-                 pipeline: List[Union[dict, Callable]] = [],
-                 test_mode: bool = False,
-                 lazy_init: bool = False,
-                 max_refetch: int = 1000) -> None:
-        if parser_cfg['type'] != 'LineJsonParser':
-            raise ValueError('We only support using LineJsonParser '
-                             'to parse lmdb file. Please use LineJsonParser '
-                             'in the dataset config')
-        self.parser = TASK_UTILS.build(parser_cfg)
-        self.ann_file = ann_file
-        self.deprecated_format = False
-        env = self._get_env(root=data_root)
-        with env.begin(write=False) as txn:
-            try:
-                self.total_number = int(
-                    txn.get(b'num-samples').decode('utf-8'))
-            except AttributeError:
-                warnings.warn(
-                    'DeprecationWarning: The lmdb dataset generated with '
-                    'txt2lmdb will be deprecate, please use the latest '
-                    'tools/data/utils/recog2lmdb to generate lmdb dataset. '
-                    'See https://mmocr.readthedocs.io/en/latest/tools.html#'
-                    'convert-text-recognition-dataset-to-lmdb-format for '
-                    'details.', UserWarning)
-                self.total_number = int(
-                    txn.get(b'total_number').decode('utf-8'))
-                self.deprecated_format = True
-            # The lmdb file may contain only the label, or it may contain both
-            # the label and the image, so we use image_key here for probing.
-            image_key = f'image-{1:09d}'
-            if txn.get(image_key.encode('utf-8')) is None:
-                self.label_only = True
-            else:
-                self.label_only = False
+    def __init__(
+        self,
+        ann_file: str = '',
+        img_color_type: str = 'color',
+        metainfo: Optional[dict] = None,
+        data_root: Optional[str] = '',
+        data_prefix: dict = dict(img_path=''),
+        filter_cfg: Optional[dict] = None,
+        indices: Optional[Union[int, Sequence[int]]] = None,
+        serialize_data: bool = True,
+        pipeline: List[Union[dict, Callable]] = [],
+        test_mode: bool = False,
+        lazy_init: bool = False,
+        max_refetch: int = 1000,
+    ) -> None:
 
         super().__init__(
             ann_file=ann_file,
@@ -118,40 +87,34 @@ def __init__(self,
             lazy_init=lazy_init,
             max_refetch=max_refetch)
 
+        self.color_type = img_color_type
+
     def load_data_list(self) -> List[dict]:
         """Load annotations from an annotation file named as ``self.ann_file``
 
         Returns:
             List[dict]: A list of annotation.
         """
         if not hasattr(self, 'env'):
-            self.env = self._get_env()
+            self._make_env()
+            with self.env.begin(write=False) as txn:
+                self.total_number = int(
+                    txn.get(b'num-samples').decode('utf-8'))
 
         data_list = []
         with self.env.begin(write=False) as txn:
             for i in range(self.total_number):
-                if self.deprecated_format:
-                    line = txn.get(str(i).encode('utf-8')).decode('utf-8')
-                    filename, text = line.strip('/n').split(' ')
-                    line = json.dumps(
-                        dict(filename=filename, text=text), ensure_ascii=False)
-                else:
-                    i = i + 1
-                    label_key = f'label-{i:09d}'
-                    if self.label_only:
-                        line = txn.get(
-                            label_key.encode('utf-8')).decode('utf-8')
-                    else:
-                        img_key = f'image-{i:09d}'
-                        text = txn.get(
-                            label_key.encode('utf-8')).decode('utf-8')
-                        line = json.dumps(
-                            dict(filename=img_key, text=text),
-                            ensure_ascii=False)
+                idx = i + 1
+                label_key = f'label-{idx:09d}'
+                img_key = f'image-{idx:09d}'
+                text = txn.get(label_key.encode('utf-8')).decode('utf-8')
+                line = [img_key, text]
                 data_list.append(self.parse_data_info(line))
         return data_list
 
-    def parse_data_info(self, raw_anno_info: str) -> Union[dict, List[dict]]:
+    def parse_data_info(self,
+                        raw_anno_info: Tuple[Optional[str],
+                                             str]) -> Union[dict, List[dict]]:
         """Parse raw annotation to target format.
 
         Args:
@@ -162,16 +125,32 @@ def parse_data_info(self, raw_anno_info: str) -> Union[dict, List[dict]]:
             (dict): Parsed annotation.
         """
         data_info = {}
-        parsed_anno = self.parser(raw_anno_info)
-        img_path = osp.join(self.data_prefix['img_path'],
-                            parsed_anno[self.parser.keys[0]])
-
-        data_info['img_path'] = img_path
-        data_info['instances'] = [dict(text=parsed_anno[self.parser.keys[1]])]
+        img_key, text = raw_anno_info
+        data_info['img_path'] = img_key
+        data_info['instances'] = [dict(text=text)]
         return data_info
 
-    def _get_env(self, root=''):
-        """Get lmdb environment from self.ann_file.
+    def prepare_data(self, idx) -> Any:
+        """Get data processed by ``self.pipeline``.
+
+        Args:
+            idx (int): The index of ``data_info``.
+
+        Returns:
+            Any: Depends on ``self.pipeline``.
+        """
+        data_info = self.get_data_info(idx)
+        with self.env.begin(write=False) as txn:
+            img_bytes = txn.get(data_info['img_path'].encode('utf-8'))
+            if img_bytes is None:
+                return None
+            data_info['img'] = mmcv.imfrombytes(
+                img_bytes, flag=self.color_type)
+        return self.pipeline(data_info)
+
+    def _make_env(self):
+        """Create lmdb environment from self.ann_file and save it to
+        ``self.env``.
 
         Returns:
             Lmdb environment.
@@ -181,10 +160,11 @@ def _get_env(self, root=''):
         except ImportError:
             raise ImportError(
                 'Please install lmdb to enable RecogLMDBDataset.')
-        lmdb_path = self.ann_file if is_abs(self.ann_file) else osp.join(
-            root, self.ann_file)
-        return lmdb.open(
-            lmdb_path,
+        if hasattr(self, 'env'):
+            return
+
+        self.env = lmdb.open(
+            self.ann_file,
             max_readers=1,
             readonly=True,
             lock=False,
 
@@ -1,9 +1,8 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from .adapters import MMDet2MMOCR, MMOCR2MMDet
 from .formatting import PackKIEInputs, PackTextDetInputs, PackTextRecogInputs
-from .loading import (LoadImageFromFile, LoadImageFromLMDB,
-                      LoadImageFromNDArray, LoadKIEAnnotations,
-                      LoadOCRAnnotations)
+from .loading import (LoadImageFromFile, LoadImageFromNDArray,
+                      LoadKIEAnnotations, LoadOCRAnnotations)
 from .ocr_transforms import (FixInvalidPolygon, RandomCrop, RandomRotate,
                              RemoveIgnored, Resize)
 from .textdet_transforms import (BoundedScaleAspectJitter, RandomFlip,
@@ -21,7 +20,7 @@
     'PackTextRecogInputs', 'RescaleToHeight', 'PadToWidth',
     'ShortScaleAspectJitter', 'RandomFlip', 'BoundedScaleAspectJitter',
     'PackKIEInputs', 'LoadKIEAnnotations', 'FixInvalidPolygon', 'MMDet2MMOCR',
-    'MMOCR2MMDet', 'LoadImageFromLMDB', 'LoadImageFromFile',
-    'LoadImageFromNDArray', 'CropHeight', 'TextRecogGeneralAug',
-    'ImageContentJitter', 'ReversePixels', 'RemoveIgnored', 'ConditionApply'
+    'MMOCR2MMDet', 'LoadImageFromFile', 'LoadImageFromNDArray', 'CropHeight',
+    'TextRecogGeneralAug', 'ImageContentJitter', 'ReversePixels',
+    'RemoveIgnored', 'ConditionApply'
 ]