This folder (vlm/) contains the code for our Llama Vision-Language Model (VLM) setup used in Forewarn.
We remove the original vision encoder and instead project world-model (WM) latents into the language space for finetuning and inference.
The main code lives in llama-recipes/:
-
🧩 Llama Recipes:
llama-recipes/
Scripts and utilities related to Llama VLM training/inference. -
📦 Datasets:
recipes/quickstart/finetuning/datasets/
Place the datasets for cup / bag / fork tasks here, along with the dataset-loading code. -
🧠 Model:
mllama_model.py
Modified Llama model that:- removes the original vision encoder
- projects WM latents into the LLM language embedding space
-
🔥 Training:
finetuning_wm.py
Script to run finetuning. -
🔎 Inference:
llama_wm_infer.py
Script to run inference for:- behavior description generation (Stage 1)
- behavior selection (Stage 2)
Download the dataset from Hugging Face and put it under:
./llama-recipes/recipes/quickstart/finetuning/datasets/
huggingface-cli download yilin-wu/Forewarn_VLM_data --local-dir ./llama-recipes/recipes/quickstart/finetuning/datasetsDownload the modified Llama-3.2-11B-Vision-Instruct base model from:
This base model is modified to:
- remove the original vision encoder
- add a randomly-initialized projection layer
Put it under: llama-recipes/mllama/
mkdir -p llama-recipes/mllama
huggingface-cli download yilin-wu/Forewarn_VLMs --local-dir ./llama-recipes/mllama --include "Llama-3.2-11B-Vision-Instruct/**"If you want to try our finetuned PEFT models (cup/bag/fork), download the *ckpt folders and store them anywhere.
Example:
mkdir -p /data/finetuned_models
huggingface-cli download yilin-wu/Forewarn_VLMs --local-dir /data/finetuned_models --include "*ckpt/**"✅ When running inference, pass your local path via:
--peft_model_name <peft_model_path>
After downloading the dataset and base model:
cd llama-recipesRun one of the scripts below:
- ☕ Cup task
bash run_exp_cup.sh- 👜 Bag task
bash run_exp_bag.sh- 🍴 Fork task
bash run_exp_fork_all.shUpdate paths inside these
.shfiles to match your local environment (dataset/model/checkpoint dirs).
We use the same entrypoint:
recipes/quickstart/inference/local_inference/llama_wm_infer.py
There are two modes:
- ✍️ Stage 1: Generate behavior descriptions (
answer_type="open-word") - ✅ Stage 2: Select behavior descriptions (
answer_type="text") with a scenario key
Template command:
CUDA_VISIBLE_DEVICES=0 python recipes/quickstart/inference/local_inference/llama_wm_infer.py --dataset "custom_dataset" --custom_dataset.file <file_path> --custom_dataset.data_path <data_path> --custom_dataset.answer_type "open-word" --custom_dataset.num_images 16 --custom_dataset.sample_size 16 --custom_dataset.num_history_images 1 --custom_dataset.imagined_steps 63 --custom_dataset.latent_mode "all" --model_name "mllama/Llama-3.2-11B-Vision-Instruct/custom" --batch_size_training 10 --custom_dataset.test_split "test" --custom_dataset.start_index <start_idx> --peft_model_name <peft_model_path> --use_sentence True --print-labels-predictions TrueFork (stage 1):
--custom_dataset.file "recipes/quickstart/finetuning/datasets/realfork_dataset_latent.py"
--custom_dataset.data_path "realfork_data"
--custom_dataset.start_index 60
--peft_model_name /data/finetuned_models/Forewarn_VLMs/fork_ckptBag (stage 1):
--custom_dataset.file "recipes/quickstart/finetuning/datasets/realbag_dataset_latent.py"
--custom_dataset.data_path "realbag_data"
--custom_dataset.start_index 0
--peft_model_name /data/finetuned_models/Forewarn_VLMs/bag_ckptCup (stage 1):
--custom_dataset.file "recipes/quickstart/finetuning/datasets/realcup_dataset_latent.py"
--custom_dataset.data_path "realcup_data" --custom_dataset.start_index 35
--peft_model_name /data/finetuned_models/Forewarn_VLMs/cup_ckptIn this mode, you set a scenario via --custom_dataset.question_key.
Template command:
CUDA_VISIBLE_DEVICES=0 python recipes/quickstart/inference/local_inference/llama_wm_infer.py --temperature 0.01 --top_p 0.9 --dataset "custom_dataset" --custom_dataset.file <file_path> --custom_dataset.data_path <data_path> --custom_dataset.answer_type "text" --custom_dataset.num_images 16 --custom_dataset.sample_size 16 --custom_dataset.num_history_images 1 --custom_dataset.imagined_steps 63 --custom_dataset.latent_mode "all" --model_name "mllama/Llama-3.2-11B-Vision-Instruct/custom" --batch_size_training 10 --custom_dataset.test_split "test" --custom_dataset.start_index 0 --peft_model_name <peft_model_path> --use_sentence True --custom_dataset.question_key <key> --print-labels-predictions TrueCup (stage 2): scenarios = handle / interior
--custom_dataset.file "recipes/quickstart/finetuning/datasets/realcup_dataset_text.py"
--custom_dataset.data_path "realcup_data" --custom_dataset.question_key "handle"
--peft_model_name /data/finetuned_models/Forewarn_VLMs/cup_ckptBag (stage 2): scenarios = edge / middle
--custom_dataset.file "recipes/quickstart/finetuning/datasets/realbag_dataset_text.py"
--custom_dataset.data_path "realbag_data" --custom_dataset.question_key "middle"
--peft_model_name /data/finetuned_models/Forewarn_VLMs/bag_ckptFork (stage 2): scenarios = grasp-handle / grasp-tines / place-bowl
--custom_dataset.file "recipes/quickstart/finetuning/datasets/realfork_dataset_text.py"
--custom_dataset.data_path "realfork_data"
--custom_dataset.question_key "grasp-handle" --peft_model_name /data/finetuned_models/Forewarn_VLMs/fork_ckpt- Confirm
--peft_model_namepoints to the correct*_ckptdirectory on your machine. - If you relocate datasets/models, update paths in both:
run_exp_*.sh- inference commands above