Skip to content

Commit e54417c

Browse files
committed
feat: add LMDB support for multimodal resources
- Implement LMDB database integration for efficient loading of large multimodal datasets - Add caching mechanism for LMDB environments and transactions to improve performance - Update documentation with LMDB usage examples for both Chinese and English - Update type hints in vision_utils.py to reflect new functionality - Add graceful handling for environments without LMDB installed
1 parent fa3d2d6 commit e54417c

File tree

3 files changed

+57
-4
lines changed

3 files changed

+57
-4
lines changed

docs/source/Customization/自定义数据集.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -136,20 +136,26 @@ alpaca格式:
136136

137137
### 多模态
138138

139-
对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源的url或者path(推荐使用绝对路径),`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置,ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换,参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的四条示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。
139+
对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源的url或者path(推荐使用绝对路径),`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置,ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换,参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。
140+
141+
SWIFT 支持从 LMDB 数据库加载多模态资源,使用格式为 `lmdb://key@path_to_lmdb`。这对于存储和访问大量图像、视频、音频等资源非常有效,特别适合训练和推理时处理大规模多模态数据集。使用前请确保已安装 LMDB:`pip install lmdb`
140142

141143
预训练:
142-
```
144+
```jsonl
143145
{"messages": [{"role": "assistant", "content": "预训练的文本在这里"}]}
144146
{"messages": [{"role": "assistant", "content": "<image>是一只小狗,<image>是一只小猫"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
147+
{"messages": [{"role": "assistant", "content": "<image>是一只从LMDB加载的小兔子"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]}
145148
{"messages": [{"role": "assistant", "content": "<audio>描述了今天天气真不错"}], "audios": ["/xxx/x.wav"]}
146149
{"messages": [{"role": "assistant", "content": "<image>是一个大象,<video>是一只狮子在跑步"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
150+
{"messages": [{"role": "assistant", "content": "<video>展示了太空中的星系"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]}
147151
```
148152

149153
微调:
150154
```jsonl
151155
{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "浙江的省会在杭州。"}]}
152156
{"messages": [{"role": "user", "content": "<image><image>两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫,后一张是小狗"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
157+
{"messages": [{"role": "user", "content": "<image>这个动物是什么?"}, {"role": "assistant", "content": "这是一只棕色的熊猫,很罕见的物种。"}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]}
158+
{"messages": [{"role": "user", "content": "<image>和<image>这两种动物有什么区别?"}, {"role": "assistant", "content": "第一张图是老虎,第二张图是狮子。"}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]}
153159
{"messages": [{"role": "user", "content": "<audio>语音说了什么"}, {"role": "assistant", "content": "今天天气真好呀"}], "audios": ["/xxx/x.mp3"]}
154160
{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么,<video>视频中是什么"}, {"role": "assistant", "content": "图片中是一个大象,视频中是一只小狗在草地上奔跑"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
155161
```

docs/source_en/Customization/Custom-dataset.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,22 +143,27 @@ Please refer to [Reranker training document](../BestPractices/Reranker.md#datase
143143

144144
### Multimodal
145145

146-
For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The four examples below respectively demonstrate the data format for plain text, as well as formats containing image, video, and audio data.
146+
For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The examples below demonstrate the data format for plain text, as well as formats containing image, video, and audio data.
147147

148+
SWIFT supports loading multimodal resources from LMDB databases using the format `lmdb://key@path_to_lmdb`. This is highly effective for storing and accessing large collections of images, videos, audio files, and other resources, especially when training and inferencing with large-scale multimodal datasets. Make sure to install LMDB first: `pip install lmdb`.
148149

149150
Pre-training:
150151
```jsonl
151152
{"messages": [{"role": "assistant", "content": "Pre-trained text goes here"}]}
152153
{"messages": [{"role": "assistant", "content": "<image>is a puppy, <image>is a kitten"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
154+
{"messages": [{"role": "assistant", "content": "<image>is a rabbit loaded from LMDB"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]}
153155
{"messages": [{"role": "assistant", "content": "<audio>describes how nice the weather is today"}], "audios": ["/xxx/x.wav"]}
154156
{"messages": [{"role": "assistant", "content": "<image>is an elephant, <video>is a lion running"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
157+
{"messages": [{"role": "assistant", "content": "<video>shows galaxies in space"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]}
155158
```
156159

157160
Supervised Fine-tuning:
158161

159162
```jsonl
160163
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]}
161164
{"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}, {"role": "assistant", "content": "The first one is a kitten, and the second one is a puppy."}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
165+
{"messages": [{"role": "user", "content": "<image>What is this animal?"}, {"role": "assistant", "content": "This is a brown panda, a very rare species."}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]}
166+
{"messages": [{"role": "user", "content": "<image>and<image>What's the difference between these two animals?"}, {"role": "assistant", "content": "The first image is a tiger, and the second image is a lion."}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]}
162167
{"messages": [{"role": "user", "content": "<audio>What did the audio say?"}, {"role": "assistant", "content": "The weather is really nice today."}], "audios": ["/xxx/x.mp3"]}
163168
{"messages": [{"role": "system", "content": "You are a helpful and harmless assistant."}, {"role": "user", "content": "<image>What is in the image, <video>What is in the video?"}, {"role": "assistant", "content": "The image shows an elephant, and the video shows a puppy running on the grass."}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
164169
```

swift/llm/template/vision_utils.py

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import os
55
import re
66
from io import BytesIO
7-
from typing import Any, Callable, List, TypeVar, Union
7+
from typing import Any, Callable, Dict, List, Optional, TypeVar, Union
88

99
import numpy as np
1010
import requests
@@ -13,6 +13,13 @@
1313

1414
from swift.utils import get_env_args
1515

16+
# Try to import lmdb, but don't fail if it's not available
17+
try:
18+
import lmdb
19+
LMDB_AVAILABLE = True
20+
except ImportError:
21+
LMDB_AVAILABLE = False
22+
1623
# >>> internvl
1724
IMAGENET_MEAN = (0.485, 0.456, 0.406)
1825
IMAGENET_STD = (0.229, 0.224, 0.225)
@@ -99,6 +106,9 @@ def rescale_image(img: Image.Image, max_pixels: int) -> Image.Image:
99106

100107
_T = TypeVar('_T')
101108

109+
# Cache for LMDB environments and read transactions to avoid reopening
110+
_LMDB_ENV_CACHE: Dict[str, Any] = {}
111+
_LMDB_TXN_CACHE: Dict[str, Any] = {}
102112

103113
def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]:
104114
res = path
@@ -111,6 +121,38 @@ def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]:
111121
request_kwargs['timeout'] = timeout
112122
content = requests.get(path, **request_kwargs).content
113123
res = BytesIO(content)
124+
elif path.startswith('lmdb://'):
125+
if not LMDB_AVAILABLE:
126+
raise ImportError(
127+
"LMDB support requires the 'lmdb' package to be installed. "
128+
"Please install it with 'pip install lmdb'."
129+
)
130+
# Parse LMDB path format: lmdb://key@path_to_lmdb
131+
_, _, lmdb_url = path.partition('lmdb://')
132+
key, sep, lmdb_dir = lmdb_url.partition('@')
133+
134+
# Verify format validity with a single check
135+
if not sep or not key or not lmdb_dir or '@' in lmdb_dir:
136+
raise ValueError("LMDB path must be in format: lmdb://key@path_to_lmdb (with exactly one '@')")
137+
138+
# Use cached environment or create a new one
139+
env = _LMDB_ENV_CACHE.get(lmdb_dir)
140+
if env is None:
141+
env = lmdb.open(lmdb_dir, readonly=True, lock=False, max_readers=1024, max_spare_txns=2)
142+
_LMDB_ENV_CACHE[lmdb_dir] = env
143+
144+
# Get or create read transaction
145+
txn = _LMDB_TXN_CACHE.get(lmdb_dir)
146+
if txn is None:
147+
txn = env.begin(write=False)
148+
_LMDB_TXN_CACHE[lmdb_dir] = txn
149+
150+
# Get data using the cached transaction
151+
encoded_key = key.encode()
152+
data = txn.get(encoded_key)
153+
if data is None:
154+
raise KeyError(f"Key '{key}' not found in LMDB at '{lmdb_dir}'")
155+
res = BytesIO(data)
114156
elif os.path.exists(path) or (not path.startswith('data:') and len(path) <= 200):
115157
path = os.path.abspath(os.path.expanduser(path))
116158
with open(path, 'rb') as f:

0 commit comments

Comments
 (0)