-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: Handle normalized coordinates from local grounding models (e.g. InternVL) #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Handle normalized coordinates from local grounding models (e.g. InternVL) #159
Conversation
WalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
gui_agents/s3/agents/grounding.py (2)
244-249: Regex pattern may miss decimals without leading digits.The pattern
r"\d+\.?\d*"correctly captures most floating-point numbers but will not match numbers starting with a decimal point (e.g., ".456"). While most models output "0.456" format, consider usingr"\d+\.?\d*|\.\d+"for complete coverage.🔎 More robust regex pattern
- # Regex to find floating point numbers (0.xxxx) or integers - numericals = re.findall(r"\d+\.?\d*", response) + # Regex to find floating point numbers (0.xxxx, .xxxx) or integers + numericals = re.findall(r"\d+\.?\d*|\.\d+", response)
242-242: Consider using logger instead of print for debugging.The code uses
print()for debugging output. For consistency with the rest of the codebase (which useslogging.getLogger("desktopenv.agent")at line 16), consider using the logger instead.🔎 Proposed change
- print("RAW GROUNDING MODEL RESPONSE:", response) + logger.debug(f"RAW GROUNDING MODEL RESPONSE: {response}")
| # If coordinates are normalized (0-1), scale them up | ||
| if x <= 1.0 and y <= 1.0: | ||
| x = int(x * self.engine_params_for_grounding["grounding_width"]) | ||
| y = int(y * self.engine_params_for_grounding["grounding_height"]) | ||
| else: | ||
| x = int(x) | ||
| y = int(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add defensive checks for required engine parameters.
The code directly accesses self.engine_params_for_grounding["grounding_width"] and ["grounding_height"] without verifying these keys exist. If these parameters are missing from the configuration, the code will raise a KeyError and crash the agent.
🔎 Proposed fix with defensive checks
# If coordinates are normalized (0-1), scale them up
if x <= 1.0 and y <= 1.0:
- x = int(x * self.engine_params_for_grounding["grounding_width"])
- y = int(y * self.engine_params_for_grounding["grounding_height"])
+ grounding_width = self.engine_params_for_grounding.get("grounding_width")
+ grounding_height = self.engine_params_for_grounding.get("grounding_height")
+ if grounding_width is None or grounding_height is None:
+ raise ValueError("grounding_width and grounding_height must be specified in engine_params_for_grounding when using normalized coordinates")
+ x = int(x * grounding_width)
+ y = int(y * grounding_height)
else:
x = int(x)
y = int(y)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # If coordinates are normalized (0-1), scale them up | |
| if x <= 1.0 and y <= 1.0: | |
| x = int(x * self.engine_params_for_grounding["grounding_width"]) | |
| y = int(y * self.engine_params_for_grounding["grounding_height"]) | |
| else: | |
| x = int(x) | |
| y = int(y) | |
| # If coordinates are normalized (0-1), scale them up | |
| if x <= 1.0 and y <= 1.0: | |
| grounding_width = self.engine_params_for_grounding.get("grounding_width") | |
| grounding_height = self.engine_params_for_grounding.get("grounding_height") | |
| if grounding_width is None or grounding_height is None: | |
| raise ValueError("grounding_width and grounding_height must be specified in engine_params_for_grounding when using normalized coordinates") | |
| x = int(x * grounding_width) | |
| y = int(y * grounding_height) | |
| else: | |
| x = int(x) | |
| y = int(y) |
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 251 to 257, the code accesses
self.engine_params_for_grounding["grounding_width"] and ["grounding_height"]
directly which can raise KeyError; add defensive checks: verify the keys exist
and their values are numeric and >0 before using them (e.g., use dict.get with a
sensible default or explicitly raise a clear ValueError if missing/invalid),
convert to int only after validation, and ensure the fallback behavior for
normalized coordinates still makes sense when defaults are applied.
Consider edge cases in normalized coordinate detection.
The condition x <= 1.0 and y <= 1.0 may misidentify absolute coordinates as normalized in edge cases:
- Absolute coordinates like (0, 0), (1, 1), (0, 500), or (1, 768) where one or both values are ≤ 1.0 would trigger incorrect scaling
- Mixed coordinates (one normalized, one absolute) like (0.5, 500) would be treated as absolute, potentially causing incorrect clicks
While these edge cases are unlikely in typical usage, consider adding a more robust heuristic, such as checking if both coordinates are in the range [0.0, 1.0] and the model is known to output normalized coordinates (via a configuration flag).
🔎 Suggested improvement with configuration flag
Add a configuration parameter to explicitly indicate coordinate format:
# In __init__ or configuration
self.uses_normalized_coords = engine_params_for_grounding.get("normalized_coordinates", False)Then update the detection logic:
- # If coordinates are normalized (0-1), scale them up
- if x <= 1.0 and y <= 1.0:
+ # If coordinates are normalized (0-1), scale them up
+ should_scale = (x <= 1.0 and y <= 1.0) if self.uses_normalized_coords else (x < 1.0 and y < 1.0)
+ if should_scale:
x = int(x * self.engine_params_for_grounding["grounding_width"])
y = int(y * self.engine_params_for_grounding["grounding_height"])
else:Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 251-257, the current check (x
<= 1.0 and y <= 1.0) can misclassify absolute coordinates as normalized; add a
configuration flag self.uses_normalized_coords =
engine_params_for_grounding.get("normalized_coordinates", False) (initialized in
__init__) and change the detection to only scale when
self.uses_normalized_coords is True and both x and y are within [0.0, 1.0];
otherwise treat as absolute coordinates (convert to int) and optionally emit a
debug/warning if one coordinate looks normalized and the other absolute to aid
debugging.
Problem
When using local grounding models like
OpenGVLab/InternVL3_5-4Bvia vLLM, the model may return normalized coordinates (0.0 - 1.0) instead of absolute pixel coordinates.The previous implementation only extracted integers (
re.findall(r"\d+", response)), which caused two issues:0.4518became0and4518).ycoordinate often became a large integer (e.g.,4518), causingpyautoguito trigger aFailSafeExceptionby hitting the screen edge.Solution
Updated
gui_agents/s3/agents/grounding.pyto:re.findall(r"\d+\.?\d*", response).grounding_widthandgrounding_heightto get correct screen pixel values.Testing
Tested with
agent_susingvllmprovider andInternVL3_5-4B. The agent now correctly clicks on target elements instead of crashing withpyautogui.FailSafeException.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.