Welcome to OmniParser Discussions! #203

yadong-lu · 2025-02-24T18:00:08Z

yadong-lu
Feb 24, 2025

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

cbas1986 · 2025-02-24T19:25:52Z

cbas1986
Feb 24, 2025

Hello people. How are you?

Can you guys point me in the right directions to integrate OmniParser with ChatGPT?

Thanks!

1 reply

yadong-lu Feb 25, 2025
Author

Hi @cbas1986, currently OmniTool already supports models with openai api (gpt4o, 4o-mini etc).

Raul2424 · 2025-02-26T11:44:16Z

Raul2424
Feb 26, 2025

Hi!
Is there any plan to add support for Azure OpenAI?
Thanks!

0 replies

teddybear082 · 2025-03-01T01:08:04Z

teddybear082
Mar 1, 2025

I built a connector to use omniparserserver with this project which uses AI to help video games: https://github.com/ShipBit/wingman-ai

Are you open to a PR to make using omniparser with the host computer (instead of a VM) possible? I assume not because it would have been easy enough for you all just to offer this option to begin with but if I’m wrong let me know.

0 replies

frei621 · 2025-03-02T03:09:22Z

frei621
Mar 2, 2025

Are there any hardware configuration requirements for running omniparser v2? What is the minimum configuration?

0 replies

JonSpeare · 2025-03-02T21:55:28Z

JonSpeare
Mar 2, 2025

I just want to express how sad I am that you are using Conda.

0 replies

yliu9276 · 2025-03-11T18:30:16Z

yliu9276
Mar 11, 2025

Hi, I have deployed OmniParser V2 + Omnitool for a PoC for a production in my company. But seems like the OS automation is not working ideally. For example, "entering gmail.com in the URL bar and hit enter to access the website". But it cannot finish this command. I guess I may not entering a good prompt for this application? Any suggestions?

Otherwise, I think we do need a prompt refining process to improve the app

0 replies

SymphonX · 2025-03-17T08:31:18Z

SymphonX
Mar 17, 2025

Great, great work! One question: Can the model also detect activation states in an UI? Specifically, activation indicated by highlighting, icons (like dots, arrows, etc.), color, edges, and so on? I ran a couple of tests and interestingly, the model always managed to assign the same coloured box to the activated button. Is this coincidence or can I somehow access this in the output?

0 replies

sbigstar0310 · 2025-03-17T11:58:32Z

sbigstar0310
Mar 17, 2025

Hi, I’m a computer science student at KAIST.

First of all, thank you for hosting this discussion session! It really motivated me to start writing down my small idea.

As I recall, Omniparser-1 has three key limitations:
1. Alignment of UI elements
2. Inner text hyperlinks
3. Multi-meaning UI elements

Regarding the first two issues, would it be possible to address them by incorporating the context of the UI screen during training? From my perspective, each UI element is similar to a word in natural language—it varies in position and meaning. Given that modern LLMs effectively capture word context, could a similar approach be applied to UI elements?

Specifically, I was thinking of using the following input data for training:
• Bounding boxes of each UI element
• The full UI screen
• The target UI element

Would this approach help mitigate these limitations? I’d love to hear your thoughts!

1 reply

shtefcs Mar 25, 2025

Hi, I’m a computer science student at KAIST.

First of all, thank you for hosting this discussion session! It really motivated me to start writing down my small idea.

As I recall, Omniparser-1 has three key limitations: 1. Alignment of UI elements 2. Inner text hyperlinks 3. Multi-meaning UI elements

Regarding the first two issues, would it be possible to address them by incorporating the context of the UI screen during training? From my perspective, each UI element is similar to a word in natural language—it varies in position and meaning. Given that modern LLMs effectively capture word context, could a similar approach be applied to UI elements?

Specifically, I was thinking of using the following input data for training: • Bounding boxes of each UI element • The full UI screen • The target UI element

Would this approach help mitigate these limitations? I’d love to hear your thoughts!

Great questions, I am interested in knowing this as well.

question for you @sbigstar0310 . Are you planning to training your own version of OmniParser with enhanced context data?

if yes, how hard is to do that, and how much computing power/cost would that be?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to OmniParser Discussions! #203

Uh oh!

{{title}}

Uh oh!

Replies: 8 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Welcome to OmniParser Discussions! #203

Uh oh!

👋 Welcome!

Replies: 8 comments · 2 replies

Uh oh!

Uh oh!

yadong-lu Feb 25, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 8 comments 2 replies

yadong-lu Feb 25, 2025
Author