-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Hi there,
First of all, thank you for sharing the GUI-Odyssey dataset and code. I've been studying your ICCV 2025 paper with great interest. While I was analyzing the code to understand the multimodal capabilities of OdysseyAgent, I came across a few points where I would appreciate some clarification.
-
Image Processing in Training (src/finetune.py)
I noticed that images are loaded in LazySupervisedDataset, but it seems the pixel data is not being passed to the preprocess function or the model's training loop. Is the current implementation intended to focus primarily on the text-based conversations, or is there a specific part I should look at to see how the visual features are integrated? -
Model Forward Logic (OdysseyAgent/modeling_qwen.py)
In modeling_qwen.py (lines 757-760), I see that torch.zeros is used to create placeholder images during training. Could you provide some guidance on how to modify this part to utilize the actual screenshot data from the dataset, as described in the paper? -
Visual Inputs during Evaluation (src/eval_mm/evaluate_GUIOdyssey.py)
I also noticed that the evaluation script doesn't seem to pass visual inputs when calling model.generate. What would be the best way to feed the screenshot images into the model during inference to reproduce the multimodal navigation results reported in the paper?
Thank you again for your contribution to the community. I would be very grateful if you could provide some guidance or point me toward the code responsible for the actual multimodal training and inference.
Looking forward to your response!