[Feature] Add Molmo2 model and template support#9063
[Feature] Add Molmo2 model and template support#9063Kagura-0001 wants to merge 2 commits intomodelscope:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Molmo2 model family, including registration of 4B, 8B, and O-7B variants, along with a dedicated template for image and video understanding. Key additions include the Molmo2Loader with compatibility patches for transformers and vision attention, and the Molmo2Template for handling multi-modal inputs. Feedback focuses on correcting a version requirement typo, improving the robustness of module splitting logic, preventing potential division-by-zero errors in FPS calculation, and replacing assertions with explicit value errors for input validation.
| model_arch=ModelArch.molmo, | ||
| architectures=['Molmo2ForConditionalGeneration'], | ||
| tags=['vision', 'video'], | ||
| requires=['transformers>=4.57.1', 'decord'], |
There was a problem hiding this comment.
The version requirement transformers>=4.57.1 appears to be a typo, as this version does not exist yet (the current stable version is around 4.48). Molmo models typically require transformers>=4.45.0.
| requires=['transformers>=4.57.1', 'decord'], | |
| requires=['transformers>=4.45.0', 'decord'], |
There was a problem hiding this comment.
I kept transformers>=4.57.1 here. 4.57.1 is a real released version now, and the local Molmo2 smoke validation for this PR succeeded with transformers==4.57.3. I would prefer to keep the newer minimum for the current Molmo2 processor/runtime path rather than relax it to 4.45.0 without additional compatibility coverage.
swift/model/models/molmo2.py
Outdated
| def get_model(self, model_dir: str, *args, **kwargs) -> PreTrainedModel: | ||
| from transformers import AutoModelForImageTextToText | ||
| model_cls = get_class_from_dynamic_module('modeling_molmo2.Molmo2ForConditionalGeneration', model_dir) | ||
| model_cls._no_split_modules = getattr(model_cls, '_no_split_modules', []) or ['MolmoSequentialBlock'] |
There was a problem hiding this comment.
The current logic for setting _no_split_modules will overwrite the default list if it's empty, but it won't append to it if it already contains other modules. It's safer to ensure MolmoSequentialBlock is included in the list without discarding existing entries.
| model_cls._no_split_modules = getattr(model_cls, '_no_split_modules', []) or ['MolmoSequentialBlock'] | |
| no_split_modules = getattr(model_cls, '_no_split_modules', []) or [] | |
| if 'MolmoSequentialBlock' not in no_split_modules: | |
| model_cls._no_split_modules = no_split_modules + ['MolmoSequentialBlock'] |
There was a problem hiding this comment.
Updated in commit 6eaf502: I now preserve existing _no_split_modules entries and only append MolmoSequentialBlock when it is missing.
swift/template/templates/molmo2.py
Outdated
| if len(timestamps) <= 1: | ||
| fps = 1.0 | ||
| else: | ||
| fps = 1.0 / float(np.median(np.diff(timestamps))) |
There was a problem hiding this comment.
Calculating fps by dividing by the median of timestamp differences can lead to a ZeroDivisionError if the input timestamps are identical or if the median difference is zero. It's safer to add a check or a fallback value.
| fps = 1.0 / float(np.median(np.diff(timestamps))) | |
| median_diff = np.median(np.diff(timestamps)) | |
| fps = 1.0 / float(median_diff) if median_diff > 0 else 1.0 |
There was a problem hiding this comment.
Updated in commit 6eaf502: I added a guard for zero/non-positive median timestamp deltas and fall back to fps = 1.0 in that case.
swift/template/templates/molmo2.py
Outdated
| media_inputs.update(image_inputs) | ||
|
|
||
| if inputs.videos: | ||
| assert len(inputs.videos) == 1, 'Molmo2 currently only supports single-video samples.' |
There was a problem hiding this comment.
Using assert for input validation is discouraged as it can be optimized away in production (when running with -O). It is better to raise a ValueError to provide a clear error message to the user.
| assert len(inputs.videos) == 1, 'Molmo2 currently only supports single-video samples.' | |
| if len(inputs.videos) != 1: | |
| raise ValueError('Molmo2 currently only supports single-video samples.') |
There was a problem hiding this comment.
Updated in commit 6eaf502: I replaced the assert with an explicit ValueError so the validation is preserved in optimized runs as well.
PR type
PR information
This PR adds initial Molmo2 support to ms-swift.
What changed
molmo2as a new MLLM model type and template typeswift/model/models/molmo2.pyswift/template/templates/molmo2.pyallenai/Molmo2-4Ballenai/Molmo2-8Ballenai/Molmo2-O-7BNotes
flash_attention_2tosdpawhen needed to avoid padded video batch failuresValidation
pre-commit run --all-filespython tests/run.py --test_dir tests/general --pattern test_model.py/mnt/bn/strategy-mllm-train/user/weisong/repo/motion_benchmarks/pretrained_models/Molmo2-4Bconfirmedget_model_processor(..., load_model=False)andtemplate.encode(...)succeedRelated issue
Experiment results
N/A