[feat] NanoVLM Training support#134
Conversation
examples/nanovlm/nanovlm_train.sh
Outdated
| DATASET_PATH="/mnt/umm/users/pufanyi/workspace/Show/lmms-engine/data/llava_next.yaml" | ||
| PROCESSOR_NAME="/mnt/umm/users/pufanyi/workspace/Show/CKPT/Qwen/Qwen3-0.6B" | ||
| MODEL_PATH="/mnt/umm/users/pufanyi/workspace/Show/CKPT/Qwen/Qwen3-0.6B" | ||
| SIGLIP_PROCESSOR="/mnt/umm/users/pufanyi/workspace/Show/CKPT/google/siglip2-so400m-patch16-naflex" |
There was a problem hiding this comment.
Later change to public path (in the repo) or hf path would be better
| def _normalize_messages_for_template(self, hf_messages): | ||
| normalized = [] | ||
| for message in hf_messages: | ||
| content = message.get("content") | ||
| if isinstance(content, list): | ||
| parts = [] | ||
| for item in content: | ||
| if not isinstance(item, dict): | ||
| parts.append(str(item)) | ||
| continue | ||
| item_type = item.get("type") | ||
| if item_type in ["image", "image_url"] or "image" in item: | ||
| parts.append("<|vision_start|><|image_pad|><|vision_end|>\n") | ||
| elif item_type in ["video", "video_url"] or "video" in item: | ||
| parts.append("<|vision_start|><|video_pad|><|vision_end|>\n") | ||
| elif item_type in ["audio", "audio_url"] or "audio" in item: | ||
| parts.append("<|AUDIO|>\n") | ||
| elif "text" in item: | ||
| parts.append(item["text"]) | ||
| normalized.append({"role": message["role"], "content": "".join(parts)}) | ||
| else: | ||
| normalized.append(message) | ||
| return normalized |
There was a problem hiding this comment.
Okay for using a placeholder. But seems to be a bit hardcoded for the visual tokens. In examples such as https://github.com/EvolvingLMMs-Lab/lmms-engine/blob/main/src/lmms_engine/datasets/processor/qwen3_vl_processor.py does not use this but use chat template for example apply chat template on {"type": "image"} would directly generate a special image tokens and then expand on this. But I'm fine with this if this is necessary.
| vision_model: Optional[PreTrainedModel] = None, | ||
| language_model: Optional[PreTrainedModel] = None, | ||
| **kwargs, | ||
| ): | ||
| super().__init__(config) | ||
| attn_implementation = kwargs.pop("attn_implementation", None) | ||
| torch_dtype = kwargs.pop("torch_dtype", None) | ||
|
|
||
| if language_model is None: | ||
| language_model = AutoModelForCausalLM.from_pretrained( | ||
| config.llm_model_name, | ||
| attn_implementation=attn_implementation, | ||
| torch_dtype=torch_dtype, | ||
| ) | ||
| if vision_model is None: | ||
| vision_model = Siglip2VisionModel.from_pretrained( | ||
| config.vision_model_name, | ||
| torch_dtype=torch_dtype, | ||
| ) |
There was a problem hiding this comment.
Usually in a transformers styled model, we don't pass in an object and use from_pretrained in the init. Otherwise we do a double from_pretrained when use the model class. Usually I think the practice of transformers currently is to only init using config and then call AutoXXX.from_config or something similar
examples/nanovlm/nanovlm_train.sh
Outdated
| @@ -1,12 +1,13 @@ | |||
| #!/bin/bash | |||
| export PYTHONPATH=/mnt/afs/niuyuwei/Job/lmms-engine/src:$PYTHONPATH | |||
examples/nanovlm/nanovlm_train.sh
Outdated
| MODEL_PATH="/mnt/umm/users/pufanyi/workspace/Show/CKPT/Qwen/Qwen3-0.6B" | ||
| SIGLIP_PROCESSOR="/mnt/umm/users/pufanyi/workspace/Show/CKPT/google/siglip2-so400m-patch16-naflex" | ||
|
|
||
| DATASET_PATH="./data/llava_next.yaml" |
There was a problem hiding this comment.
Feel free to add the example or instruction/script for preparing data to example so that the speedrun can be setup conveniently.
No description provided.