Skip to content

[feat] NanoVLM Training support#134

Merged
kcz358 merged 4 commits intoEvolvingLMMs-Lab:mainfrom
Purshow:main
Feb 12, 2026
Merged

[feat] NanoVLM Training support#134
kcz358 merged 4 commits intoEvolvingLMMs-Lab:mainfrom
Purshow:main

Conversation

@Purshow
Copy link
Contributor

@Purshow Purshow commented Feb 9, 2026

No description provided.

Comment on lines 3 to 6
DATASET_PATH="/mnt/umm/users/pufanyi/workspace/Show/lmms-engine/data/llava_next.yaml"
PROCESSOR_NAME="/mnt/umm/users/pufanyi/workspace/Show/CKPT/Qwen/Qwen3-0.6B"
MODEL_PATH="/mnt/umm/users/pufanyi/workspace/Show/CKPT/Qwen/Qwen3-0.6B"
SIGLIP_PROCESSOR="/mnt/umm/users/pufanyi/workspace/Show/CKPT/google/siglip2-so400m-patch16-naflex"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later change to public path (in the repo) or hf path would be better

Comment on lines 271 to 293
def _normalize_messages_for_template(self, hf_messages):
normalized = []
for message in hf_messages:
content = message.get("content")
if isinstance(content, list):
parts = []
for item in content:
if not isinstance(item, dict):
parts.append(str(item))
continue
item_type = item.get("type")
if item_type in ["image", "image_url"] or "image" in item:
parts.append("<|vision_start|><|image_pad|><|vision_end|>\n")
elif item_type in ["video", "video_url"] or "video" in item:
parts.append("<|vision_start|><|video_pad|><|vision_end|>\n")
elif item_type in ["audio", "audio_url"] or "audio" in item:
parts.append("<|AUDIO|>\n")
elif "text" in item:
parts.append(item["text"])
normalized.append({"role": message["role"], "content": "".join(parts)})
else:
normalized.append(message)
return normalized
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for using a placeholder. But seems to be a bit hardcoded for the visual tokens. In examples such as https://github.com/EvolvingLMMs-Lab/lmms-engine/blob/main/src/lmms_engine/datasets/processor/qwen3_vl_processor.py does not use this but use chat template for example apply chat template on {"type": "image"} would directly generate a special image tokens and then expand on this. But I'm fine with this if this is necessary.

Comment on lines 32 to 50
vision_model: Optional[PreTrainedModel] = None,
language_model: Optional[PreTrainedModel] = None,
**kwargs,
):
super().__init__(config)
attn_implementation = kwargs.pop("attn_implementation", None)
torch_dtype = kwargs.pop("torch_dtype", None)

if language_model is None:
language_model = AutoModelForCausalLM.from_pretrained(
config.llm_model_name,
attn_implementation=attn_implementation,
torch_dtype=torch_dtype,
)
if vision_model is None:
vision_model = Siglip2VisionModel.from_pretrained(
config.vision_model_name,
torch_dtype=torch_dtype,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually in a transformers styled model, we don't pass in an object and use from_pretrained in the init. Otherwise we do a double from_pretrained when use the model class. Usually I think the practice of transformers currently is to only init using config and then call AutoXXX.from_config or something similar

Examples:
https://github.com/huggingface/transformers/blob/7769f660935b5d48b73bf6711d0a78b6f8f98739/src/transformers/models/llava_onevision/modeling_llava_onevision.py#L267-L283

@@ -1,12 +1,13 @@
#!/bin/bash
export PYTHONPATH=/mnt/afs/niuyuwei/Job/lmms-engine/src:$PYTHONPATH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local Python path

MODEL_PATH="/mnt/umm/users/pufanyi/workspace/Show/CKPT/Qwen/Qwen3-0.6B"
SIGLIP_PROCESSOR="/mnt/umm/users/pufanyi/workspace/Show/CKPT/google/siglip2-so400m-patch16-naflex"

DATASET_PATH="./data/llava_next.yaml"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to add the example or instruction/script for preparing data to example so that the speedrun can be setup conveniently.

Copy link
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for me. I left a few comments for some small revision. Also can use the pre-commit to pass the lint check. Thanks!

@kcz358 kcz358 merged commit 42dd631 into EvolvingLMMs-Lab:main Feb 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants