feat: add LMDB support for multimodal resources

xiaotinghe · xiaotinghe · commit e54417c726da · 2025-08-20T18:49:53.000Z
- Implement LMDB database integration for efficient loading of large multimodal datasets
- Add caching mechanism for LMDB environments and transactions to improve performance
- Update documentation with LMDB usage examples for both Chinese and English
- Update type hints in vision_utils.py to reflect new functionality
- Add graceful handling for environments without LMDB installed
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -136,20 +136,26 @@ alpaca格式:
 
 ### 多模态
 
-对于多模态数据集，和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key，分别代表多模态资源的url或者path（推荐使用绝对路径），`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置，ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换，参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的四条示例分别展示了纯文本，以及包含图像、视频和音频数据的数据格式。
+对于多模态数据集，和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key，分别代表多模态资源的url或者path（推荐使用绝对路径），`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置，ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换，参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的示例分别展示了纯文本，以及包含图像、视频和音频数据的数据格式。
+
+SWIFT 支持从 LMDB 数据库加载多模态资源，使用格式为 `lmdb://key@path_to_lmdb`。这对于存储和访问大量图像、视频、音频等资源非常有效，特别适合训练和推理时处理大规模多模态数据集。使用前请确保已安装 LMDB：`pip install lmdb`。
 
 预训练：
-```
+```jsonl
 {"messages": [{"role": "assistant", "content": "预训练的文本在这里"}]}
 {"messages": [{"role": "assistant", "content": "<image>是一只小狗，<image>是一只小猫"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
+{"messages": [{"role": "assistant", "content": "<image>是一只从LMDB加载的小兔子"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]}
 {"messages": [{"role": "assistant", "content": "<audio>描述了今天天气真不错"}], "audios": ["/xxx/x.wav"]}
 {"messages": [{"role": "assistant", "content": "<image>是一个大象，<video>是一只狮子在跑步"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
+{"messages": [{"role": "assistant", "content": "<video>展示了太空中的星系"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]}
 ```
 
 微调：
 ```jsonl
 {"messages": [{"role": "user", "content": "浙江的省会在哪？"}, {"role": "assistant", "content": "浙江的省会在杭州。"}]}
 {"messages": [{"role": "user", "content": "<image><image>两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫，后一张是小狗"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
+{"messages": [{"role": "user", "content": "<image>这个动物是什么？"}, {"role": "assistant", "content": "这是一只棕色的熊猫，很罕见的物种。"}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]}
+{"messages": [{"role": "user", "content": "<image>和<image>这两种动物有什么区别？"}, {"role": "assistant", "content": "第一张图是老虎，第二张图是狮子。"}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]}
 {"messages": [{"role": "user", "content": "<audio>语音说了什么"}, {"role": "assistant", "content": "今天天气真好呀"}], "audios": ["/xxx/x.mp3"]}
 {"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么，<video>视频中是什么"}, {"role": "assistant", "content": "图片中是一个大象，视频中是一只小狗在草地上奔跑"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
 ```
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -143,22 +143,27 @@ Please refer to [Reranker training document](../BestPractices/Reranker.md#datase
 
 ### Multimodal
 
-For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The four examples below respectively demonstrate the data format for plain text, as well as formats containing image, video, and audio data.
+For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The examples below demonstrate the data format for plain text, as well as formats containing image, video, and audio data.
 
+SWIFT supports loading multimodal resources from LMDB databases using the format `lmdb://key@path_to_lmdb`. This is highly effective for storing and accessing large collections of images, videos, audio files, and other resources, especially when training and inferencing with large-scale multimodal datasets. Make sure to install LMDB first: `pip install lmdb`.
 
 Pre-training:
 ```jsonl
 {"messages": [{"role": "assistant", "content": "Pre-trained text goes here"}]}
 {"messages": [{"role": "assistant", "content": "<image>is a puppy, <image>is a kitten"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
+{"messages": [{"role": "assistant", "content": "<image>is a rabbit loaded from LMDB"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]}
 {"messages": [{"role": "assistant", "content": "<audio>describes how nice the weather is today"}], "audios": ["/xxx/x.wav"]}
 {"messages": [{"role": "assistant", "content": "<image>is an elephant, <video>is a lion running"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
+{"messages": [{"role": "assistant", "content": "<video>shows galaxies in space"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]}
 ```
 
 Supervised Fine-tuning:
 
 ```jsonl
 {"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]}
 {"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}, {"role": "assistant", "content": "The first one is a kitten, and the second one is a puppy."}], "images": ["/xxx/x.jpg", "/xxx/x.png"]}
+{"messages": [{"role": "user", "content": "<image>What is this animal?"}, {"role": "assistant", "content": "This is a brown panda, a very rare species."}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]}
+{"messages": [{"role": "user", "content": "<image>and<image>What's the difference between these two animals?"}, {"role": "assistant", "content": "The first image is a tiger, and the second image is a lion."}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]}
 {"messages": [{"role": "user", "content": "<audio>What did the audio say?"}, {"role": "assistant", "content": "The weather is really nice today."}], "audios": ["/xxx/x.mp3"]}
 {"messages": [{"role": "system", "content": "You are a helpful and harmless assistant."}, {"role": "user", "content": "<image>What is in the image, <video>What is in the video?"}, {"role": "assistant", "content": "The image shows an elephant, and the video shows a puppy running on the grass."}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
 ```
diff --git a/swift/llm/template/vision_utils.py b/swift/llm/template/vision_utils.py
@@ -4,7 +4,7 @@
 import os
 import re
 from io import BytesIO
-from typing import Any, Callable, List, TypeVar, Union
+from typing import Any, Callable, Dict, List, Optional, TypeVar, Union
 
 import numpy as np
 import requests
@@ -13,6 +13,13 @@
 
 from swift.utils import get_env_args
 
+# Try to import lmdb, but don't fail if it's not available
+try:
+    import lmdb
+    LMDB_AVAILABLE = True
+except ImportError:
+    LMDB_AVAILABLE = False
+
 # >>> internvl
 IMAGENET_MEAN = (0.485, 0.456, 0.406)
 IMAGENET_STD = (0.229, 0.224, 0.225)
@@ -99,6 +106,9 @@ def rescale_image(img: Image.Image, max_pixels: int) -> Image.Image:
 
 _T = TypeVar('_T')
 
+# Cache for LMDB environments and read transactions to avoid reopening
+_LMDB_ENV_CACHE: Dict[str, Any] = {}
+_LMDB_TXN_CACHE: Dict[str, Any] = {}
 
 def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]:
     res = path
@@ -111,6 +121,38 @@ def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]:
                 request_kwargs['timeout'] = timeout
             content = requests.get(path, **request_kwargs).content
             res = BytesIO(content)
+        elif path.startswith('lmdb://'):
+            if not LMDB_AVAILABLE:
+                raise ImportError(
+                    "LMDB support requires the 'lmdb' package to be installed. "
+                    "Please install it with 'pip install lmdb'."
+                )
+            # Parse LMDB path format: lmdb://key@path_to_lmdb
+            _, _, lmdb_url = path.partition('lmdb://')
+            key, sep, lmdb_dir = lmdb_url.partition('@')
+            
+            # Verify format validity with a single check
+            if not sep or not key or not lmdb_dir or '@' in lmdb_dir:
+                raise ValueError("LMDB path must be in format: lmdb://key@path_to_lmdb (with exactly one '@')")
+            
+            # Use cached environment or create a new one
+            env = _LMDB_ENV_CACHE.get(lmdb_dir)
+            if env is None:
+                env = lmdb.open(lmdb_dir, readonly=True, lock=False, max_readers=1024, max_spare_txns=2)
+                _LMDB_ENV_CACHE[lmdb_dir] = env
+            
+            # Get or create read transaction
+            txn = _LMDB_TXN_CACHE.get(lmdb_dir)
+            if txn is None:
+                txn = env.begin(write=False)
+                _LMDB_TXN_CACHE[lmdb_dir] = txn
+            
+            # Get data using the cached transaction
+            encoded_key = key.encode()
+            data = txn.get(encoded_key)
+            if data is None:
+                raise KeyError(f"Key '{key}' not found in LMDB at '{lmdb_dir}'")
+            res = BytesIO(data)
         elif os.path.exists(path) or (not path.startswith('data:') and len(path) <= 200):
             path = os.path.abspath(os.path.expanduser(path))
             with open(path, 'rb') as f: