Skip to content

Commit 6992923

Browse files
[Enhancement] Discard deprecated lmdb dataset format and only support img+label now (#1681)
* [Enhance] Discard deprecated lmdb dataset format and only support img+label now * rename * update * add ut * updata document * update docs * update test * update test * Update dataset.md Co-authored-by: liukuikun <[email protected]>
1 parent b64565c commit 6992923

File tree

7 files changed

+108
-387
lines changed

7 files changed

+108
-387
lines changed

docs/en/migration/dataset.md

Lines changed: 16 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -230,38 +230,26 @@ Specifically, we provide three dataset classes [IcdarDataset](mmocr.datasets.Icd
230230
parser_cfg=dict(
231231
type='LineJsonParser',
232232
keys=['filename', 'text'],
233-
pipeline=[])
233+
pipeline=[]))
234234
```
235235

236-
3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) supports LMDB format annotations for text recognition. You just need to add a new dataset config to `configs/textrecog/_base_/datasets` and specify its dataset type as `RecogLMDBDataset`. For example, the following example shows how to configure and load the **label-only lmdb** `label.lmdb` from the toy dataset.
237-
238-
```python
239-
data_root = 'tests/data/rec_toy_dataset/'
240-
241-
lmdb_dataset = dict(
242-
type='RecogLMDBDataset',
243-
data_root=data_root,
244-
ann_file='label.lmdb',
245-
data_prefix=dict(img_path='imgs'),
246-
pipeline=[])
247-
```
236+
3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) supports LMDB format dataset (img+labels) for text recognition. You just need to add a new dataset config to `configs/textrecog/_base_/datasets` and specify its dataset type as `RecogLMDBDataset`. For example, the following example shows how to configure and load the **both labels and images** `imgs.lmdb` from the toy dataset.
248237

249-
When the `lmdb` file contains **both labels and images**, in addition to setting the dataset type to `RecogLMDBDataset` as in the above example, you also need to replace the [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) with [`LoadImageFromLMDB`](mmocr.datasets.transforms.LoadImageFromLMDB) in the data pipelines.
238+
- set the dataset type to `RecogLMDBDataset`
250239

251-
```python
252-
# Specify the dataset type as RecogLMDBDataset
253-
data_root = 'tests/data/rec_toy_dataset/'
240+
```python
241+
# Specify the dataset type as RecogLMDBDataset
242+
data_root = 'tests/data/rec_toy_dataset/'
254243

255-
lmdb_dataset = dict(
256-
type='RecogLMDBDataset',
257-
data_root=data_root,
258-
ann_file='imgs.lmdb',
259-
data_prefix=dict(img_path='imgs.lmdb'), # setting the img_path as the lmdb name
260-
pipeline=[])
261-
```
244+
lmdb_dataset = dict(
245+
type='RecogLMDBDataset',
246+
data_root=data_root,
247+
ann_file='imgs.lmdb',
248+
pipeline=None)
249+
```
262250

263-
Also, replacing the image loading transforms in `train_pipeline` and `test_pipeline`, for example:
251+
- replace the [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) with [`LoadImageFromNDArray`](mmocr.datasets.transforms.LoadImageFromNDArray) in the data pipelines in `train_pipeline` and `test_pipeline`., for example:
264252

265-
```python
266-
train_pipeline = [dict(type='LoadImageFromLMDB', color_type='grayscale', ignore_empty=True)]
267-
```
253+
```python
254+
train_pipeline = [dict(type='LoadImageFromNDArray')]
255+
```

docs/zh_cn/migration/dataset.md

Lines changed: 4 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -232,20 +232,7 @@ python tools/dataset_converters/textrecog/data_migrator.py ${IN_PATH} ${OUT_PATH
232232
pipeline=[])
233233
```
234234

235-
3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) 支持 0.x 版本文本识别任务的 `LMDB` 标注格式。只需要在 `configs/textrecog/_base_/datasets` 中添加新的数据集配置文件,并指定其数据集类型为 `RecogLMDBDataset` 即可。例如,以下示例展示了如何配置并读取 toy dataset 中的 `label.lmdb`,该 `lmdb` 文件**仅包含标签信息**
236-
237-
```python
238-
data_root = 'tests/data/rec_toy_dataset/'
239-
240-
lmdb_dataset = dict(
241-
type='RecogLMDBDataset',
242-
data_root=data_root,
243-
ann_file='label.lmdb',
244-
data_prefix=dict(img_path='imgs'),
245-
pipeline=[])
246-
```
247-
248-
`lmdb` 文件中既包含标签信息又包含图像时,我们除了需要将数据集类型设定为 `RecogLMDBDataset` 以外,还需要将数据流水线中的图像读取方法由 [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) 替换为 [`LoadImageFromLMDB`](mmocr.datasets.transforms.LoadImageFromLMDB)。
235+
3. [RecogLMDBDataset](mmocr.datasets.RecogLMDBDataset) 支持 0.x 版本文本识别任务**图像+文字**`LMDB` 标注格式。只需要在 `configs/textrecog/_base_/datasets` 中添加新的数据集配置文件,并指定其数据集类型为 `RecogLMDBDataset` 即可。例如,以下示例展示了如何配置并读取 toy dataset 中的 `imgs.lmdb`,该 `lmdb` 文件**包含标签和图像**
249236

250237
```python
251238
# 将数据集类型设定为 RecogLMDBDataset
@@ -255,12 +242,11 @@ python tools/dataset_converters/textrecog/data_migrator.py ${IN_PATH} ${OUT_PATH
255242
type='RecogLMDBDataset',
256243
data_root=data_root,
257244
ann_file='imgs.lmdb',
258-
data_prefix=dict(img_path='imgs.lmdb'), # 将 img_path 设定为 lmdb 文件名
259-
pipeline=[])
245+
pipeline=None)
260246
```
261247

262-
还需把 `train_pipeline``test_pipeline` 中的数据读取方法进行替换
248+
还需把 `train_pipeline``test_pipeline` 中的数据读取方法如 [`LoadImageFromFile`](mmocr.datasets.transforms.LoadImageFromFile) 替换为 [`LoadImageFromNDArray`](mmocr.datasets.transforms.LoadImageFromNDArray)
263249

264250
```python
265-
train_pipeline = [dict(type='LoadImageFromLMDB', color_type='grayscale', ignore_empty=True)]
251+
train_pipeline = [dict(type='LoadImageFromNDArray')]
266252
```

mmocr/datasets/recog_lmdb_dataset.py

Lines changed: 76 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,35 @@
11
# Copyright (c) OpenMMLab. All rights reserved.
2-
import json
3-
import os.path as osp
4-
import warnings
5-
from typing import Callable, List, Optional, Sequence, Union
2+
from typing import Any, Callable, List, Optional, Sequence, Tuple, Union
63

4+
import mmcv
75
from mmengine.dataset import BaseDataset
8-
from mmengine.utils import is_abs
96

10-
from mmocr.registry import DATASETS, TASK_UTILS
7+
from mmocr.registry import DATASETS
118

129

1310
@DATASETS.register_module()
1411
class RecogLMDBDataset(BaseDataset):
1512
r"""RecogLMDBDataset for text recognition.
1613
17-
The annotation format should be in lmdb format. We support two lmdb
18-
formats, one is the lmdb file with only labels generated by txt2lmdb
19-
(deprecated), and another one is the lmdb file generated by recog2lmdb.
14+
The annotation format should be in lmdb format. The lmdb file should
15+
contain three keys: 'num-samples', 'label-xxxxxxxxx' and 'image-xxxxxxxxx',
16+
where 'xxxxxxxxx' is the index of the image. The value of 'num-samples' is
17+
the total number of images. The value of 'label-xxxxxxx' is the text label
18+
of the image, and the value of 'image-xxxxxxx' is the image data.
2019
21-
The former format stores string in `filename text` format directly in lmdb,
22-
while the latter uses `image_key` as well as `label_key` for querying.
20+
following keys:
21+
Each item fetched from this dataset will be a dict containing the
22+
following keys:
23+
24+
- img (ndarray): The loaded image.
25+
- img_path (str): The image key.
26+
- instances (list[dict]): The list of annotations for the image.
2327
2428
Args:
2529
ann_file (str): Annotation file path. Defaults to ''.
26-
parse_cfg (dict, optional): Config of parser for parsing annotations.
27-
Use ``LineJsonParser`` when the annotation file is in jsonl format
28-
with keys of ``filename`` and ``text``. The keys in parse_cfg
29-
should be consistent with the keys in jsonl annotations. The first
30-
key in parse_cfg should be the key of the path in jsonl
31-
annotations. The second key in parse_cfg should be the key of the
32-
text in jsonl Use ``LineStrParser`` when the annotation file is in
33-
txt format. Defaults to
34-
``dict(type='LineJsonParser', keys=['filename', 'text'])``.
30+
img_color_type (str): The flag argument for :func:``mmcv.imfrombytes``,
31+
which determines how the image bytes will be parsed. Defaults to
32+
'color'.
3533
metainfo (dict, optional): Meta information for dataset, such as class
3634
information. Defaults to None.
3735
data_root (str): The root directory for ``data_prefix`` and
@@ -60,50 +58,21 @@ class RecogLMDBDataset(BaseDataset):
6058
image. Defaults to 1000.
6159
"""
6260

63-
def __init__(self,
64-
ann_file: str = '',
65-
parser_cfg: Optional[dict] = dict(
66-
type='LineJsonParser', keys=['filename', 'text']),
67-
metainfo: Optional[dict] = None,
68-
data_root: Optional[str] = '',
69-
data_prefix: dict = dict(img_path=''),
70-
filter_cfg: Optional[dict] = None,
71-
indices: Optional[Union[int, Sequence[int]]] = None,
72-
serialize_data: bool = True,
73-
pipeline: List[Union[dict, Callable]] = [],
74-
test_mode: bool = False,
75-
lazy_init: bool = False,
76-
max_refetch: int = 1000) -> None:
77-
if parser_cfg['type'] != 'LineJsonParser':
78-
raise ValueError('We only support using LineJsonParser '
79-
'to parse lmdb file. Please use LineJsonParser '
80-
'in the dataset config')
81-
self.parser = TASK_UTILS.build(parser_cfg)
82-
self.ann_file = ann_file
83-
self.deprecated_format = False
84-
env = self._get_env(root=data_root)
85-
with env.begin(write=False) as txn:
86-
try:
87-
self.total_number = int(
88-
txn.get(b'num-samples').decode('utf-8'))
89-
except AttributeError:
90-
warnings.warn(
91-
'DeprecationWarning: The lmdb dataset generated with '
92-
'txt2lmdb will be deprecate, please use the latest '
93-
'tools/data/utils/recog2lmdb to generate lmdb dataset. '
94-
'See https://mmocr.readthedocs.io/en/latest/tools.html#'
95-
'convert-text-recognition-dataset-to-lmdb-format for '
96-
'details.', UserWarning)
97-
self.total_number = int(
98-
txn.get(b'total_number').decode('utf-8'))
99-
self.deprecated_format = True
100-
# The lmdb file may contain only the label, or it may contain both
101-
# the label and the image, so we use image_key here for probing.
102-
image_key = f'image-{1:09d}'
103-
if txn.get(image_key.encode('utf-8')) is None:
104-
self.label_only = True
105-
else:
106-
self.label_only = False
61+
def __init__(
62+
self,
63+
ann_file: str = '',
64+
img_color_type: str = 'color',
65+
metainfo: Optional[dict] = None,
66+
data_root: Optional[str] = '',
67+
data_prefix: dict = dict(img_path=''),
68+
filter_cfg: Optional[dict] = None,
69+
indices: Optional[Union[int, Sequence[int]]] = None,
70+
serialize_data: bool = True,
71+
pipeline: List[Union[dict, Callable]] = [],
72+
test_mode: bool = False,
73+
lazy_init: bool = False,
74+
max_refetch: int = 1000,
75+
) -> None:
10776

10877
super().__init__(
10978
ann_file=ann_file,
@@ -118,40 +87,34 @@ def __init__(self,
11887
lazy_init=lazy_init,
11988
max_refetch=max_refetch)
12089

90+
self.color_type = img_color_type
91+
12192
def load_data_list(self) -> List[dict]:
12293
"""Load annotations from an annotation file named as ``self.ann_file``
12394
12495
Returns:
12596
List[dict]: A list of annotation.
12697
"""
12798
if not hasattr(self, 'env'):
128-
self.env = self._get_env()
99+
self._make_env()
100+
with self.env.begin(write=False) as txn:
101+
self.total_number = int(
102+
txn.get(b'num-samples').decode('utf-8'))
129103

130104
data_list = []
131105
with self.env.begin(write=False) as txn:
132106
for i in range(self.total_number):
133-
if self.deprecated_format:
134-
line = txn.get(str(i).encode('utf-8')).decode('utf-8')
135-
filename, text = line.strip('/n').split(' ')
136-
line = json.dumps(
137-
dict(filename=filename, text=text), ensure_ascii=False)
138-
else:
139-
i = i + 1
140-
label_key = f'label-{i:09d}'
141-
if self.label_only:
142-
line = txn.get(
143-
label_key.encode('utf-8')).decode('utf-8')
144-
else:
145-
img_key = f'image-{i:09d}'
146-
text = txn.get(
147-
label_key.encode('utf-8')).decode('utf-8')
148-
line = json.dumps(
149-
dict(filename=img_key, text=text),
150-
ensure_ascii=False)
107+
idx = i + 1
108+
label_key = f'label-{idx:09d}'
109+
img_key = f'image-{idx:09d}'
110+
text = txn.get(label_key.encode('utf-8')).decode('utf-8')
111+
line = [img_key, text]
151112
data_list.append(self.parse_data_info(line))
152113
return data_list
153114

154-
def parse_data_info(self, raw_anno_info: str) -> Union[dict, List[dict]]:
115+
def parse_data_info(self,
116+
raw_anno_info: Tuple[Optional[str],
117+
str]) -> Union[dict, List[dict]]:
155118
"""Parse raw annotation to target format.
156119
157120
Args:
@@ -162,16 +125,32 @@ def parse_data_info(self, raw_anno_info: str) -> Union[dict, List[dict]]:
162125
(dict): Parsed annotation.
163126
"""
164127
data_info = {}
165-
parsed_anno = self.parser(raw_anno_info)
166-
img_path = osp.join(self.data_prefix['img_path'],
167-
parsed_anno[self.parser.keys[0]])
168-
169-
data_info['img_path'] = img_path
170-
data_info['instances'] = [dict(text=parsed_anno[self.parser.keys[1]])]
128+
img_key, text = raw_anno_info
129+
data_info['img_path'] = img_key
130+
data_info['instances'] = [dict(text=text)]
171131
return data_info
172132

173-
def _get_env(self, root=''):
174-
"""Get lmdb environment from self.ann_file.
133+
def prepare_data(self, idx) -> Any:
134+
"""Get data processed by ``self.pipeline``.
135+
136+
Args:
137+
idx (int): The index of ``data_info``.
138+
139+
Returns:
140+
Any: Depends on ``self.pipeline``.
141+
"""
142+
data_info = self.get_data_info(idx)
143+
with self.env.begin(write=False) as txn:
144+
img_bytes = txn.get(data_info['img_path'].encode('utf-8'))
145+
if img_bytes is None:
146+
return None
147+
data_info['img'] = mmcv.imfrombytes(
148+
img_bytes, flag=self.color_type)
149+
return self.pipeline(data_info)
150+
151+
def _make_env(self):
152+
"""Create lmdb environment from self.ann_file and save it to
153+
``self.env``.
175154
176155
Returns:
177156
Lmdb environment.
@@ -181,10 +160,11 @@ def _get_env(self, root=''):
181160
except ImportError:
182161
raise ImportError(
183162
'Please install lmdb to enable RecogLMDBDataset.')
184-
lmdb_path = self.ann_file if is_abs(self.ann_file) else osp.join(
185-
root, self.ann_file)
186-
return lmdb.open(
187-
lmdb_path,
163+
if hasattr(self, 'env'):
164+
return
165+
166+
self.env = lmdb.open(
167+
self.ann_file,
188168
max_readers=1,
189169
readonly=True,
190170
lock=False,

mmocr/datasets/transforms/__init__.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
# Copyright (c) OpenMMLab. All rights reserved.
22
from .adapters import MMDet2MMOCR, MMOCR2MMDet
33
from .formatting import PackKIEInputs, PackTextDetInputs, PackTextRecogInputs
4-
from .loading import (LoadImageFromFile, LoadImageFromLMDB,
5-
LoadImageFromNDArray, LoadKIEAnnotations,
6-
LoadOCRAnnotations)
4+
from .loading import (LoadImageFromFile, LoadImageFromNDArray,
5+
LoadKIEAnnotations, LoadOCRAnnotations)
76
from .ocr_transforms import (FixInvalidPolygon, RandomCrop, RandomRotate,
87
RemoveIgnored, Resize)
98
from .textdet_transforms import (BoundedScaleAspectJitter, RandomFlip,
@@ -21,7 +20,7 @@
2120
'PackTextRecogInputs', 'RescaleToHeight', 'PadToWidth',
2221
'ShortScaleAspectJitter', 'RandomFlip', 'BoundedScaleAspectJitter',
2322
'PackKIEInputs', 'LoadKIEAnnotations', 'FixInvalidPolygon', 'MMDet2MMOCR',
24-
'MMOCR2MMDet', 'LoadImageFromLMDB', 'LoadImageFromFile',
25-
'LoadImageFromNDArray', 'CropHeight', 'TextRecogGeneralAug',
26-
'ImageContentJitter', 'ReversePixels', 'RemoveIgnored', 'ConditionApply'
23+
'MMOCR2MMDet', 'LoadImageFromFile', 'LoadImageFromNDArray', 'CropHeight',
24+
'TextRecogGeneralAug', 'ImageContentJitter', 'ReversePixels',
25+
'RemoveIgnored', 'ConditionApply'
2726
]

0 commit comments

Comments
 (0)