-
Notifications
You must be signed in to change notification settings - Fork 52
Description
I have been testing the inference_demo.py (specifically on Visual Probe tasks) and identified two major issues.
- Logic Error in Conversation History (Fix Verified)
The Problem: The original script causes the model to get stuck in an infinite loop or fail to generate a final . This is caused by two factors in the while loop:
a. Missing Context: The model's generated response_message (Thought/Action) is not appended to the chat history.
b. Wrong Role: The code execution result is appended with "role": "assistant", causing the model to treat the observation as its own past hallucination rather than new feedback.
The Fix: I modified the code to:
a. Append the response_message to chat_message before tool execution.
b. Change the execution result message role from "assistant" to "user".
Result: After these changes, the agent correctly recognizes previous actions, stops repeating itself, and successfully generates an .
- Persistent Visual Grounding / Bbox Accuracy Issues
The Problem: Even with the logic fix above, the model frequently fails to crop the correct Region of Interest (ROI) in the first turn. The predicted bounding boxes are often significantly offset from the target.
Debugging Attempts: I suspected maybe_resize_image was downscaling the image too aggressively.
Result: I disabled maybe_resize_image (passing the original full-res image), but the Bbox accuracy did not improve. The coordinates remain inaccurate.
Questions: Since image resizing is not the cause, could you clarify the expected input format for coordinate generation?
a. Coordinate Space: Does the model output coordinates in a normalized 0-1000 space (requiring conversion) or absolute pixel values? The current prompt/script seems to assume pixel values.
b. Prompt Engineering: Is there a specific prompt trigger (e.g., asking for image.size first) required to ground the model before it generates crop coordinates?
Looking forward to your advice on improving the grounding accuracy.