Questions regarding the implementation of multimodal inputs for OdysseyAgent

Hi there,

First of all, thank you for sharing the GUI-Odyssey dataset and code. I've been studying your ICCV 2025 paper with great interest. While I was analyzing the code to understand the multimodal capabilities of OdysseyAgent, I came across a few points where I would appreciate some clarification.

1. Image Processing in Training (src/finetune.py)
I noticed that images are loaded in LazySupervisedDataset, but it seems the pixel data is not being passed to the preprocess function or the model's training loop. Is the current implementation intended to focus primarily on the text-based conversations, or is there a specific part I should look at to see how the visual features are integrated?

2. Model Forward Logic (OdysseyAgent/modeling_qwen.py)
In modeling_qwen.py (lines 757-760), I see that torch.zeros is used to create placeholder images during training. Could you provide some guidance on how to modify this part to utilize the actual screenshot data from the dataset, as described in the paper?

3. Visual Inputs during Evaluation (src/eval_mm/evaluate_GUIOdyssey.py)
I also noticed that the evaluation script doesn't seem to pass visual inputs when calling model.generate. What would be the best way to feed the screenshot images into the model during inference to reproduce the multimodal navigation results reported in the paper?

Thank you again for your contribution to the community. I would be very grateful if you could provide some guidance or point me toward the code responsible for the actual multimodal training and inference.

Looking forward to your response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions regarding the implementation of multimodal inputs for OdysseyAgent #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions regarding the implementation of multimodal inputs for OdysseyAgent #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions