AI-powered browser automation with vision capabilities.
An intelligent agent that can see, understand, and interact with web pages like a human. Uses multimodal LLMs to analyze screenshots and perform actions.
| Feature | Description |
|---|---|
| 🔍 Vision-Based Navigation | Takes screenshots and uses AI to understand page layout |
| 🖱️ Coordinate Clicking | Clicks elements by x,y position, not fragile CSS selectors |
| ⌨️ Keyboard Input | Types text and presses keys (Enter, Tab, etc.) |
| 📜 Scroll Support | Scrolls up/down to reveal hidden content |
| 🔄 Multi-Step Tasks | Executes complex workflows with up to 100 steps |
| 💬 Interactive Chat | Clean terminal UI for real-time task execution |
| 📸 Smart Screenshot Management | Auto-trims history to stay within model limits |
| 🧠 Persistent Memory | Remembers context across tasks within a session |
- Python 3.11+
- uv package manager
- Groq API key (get one free)
# Clone the repo
git clone https://github.com/RaheesAhmed/browseragent.git
cd browseragent
# Install dependencies
uv sync
# Install Playwright browsers
uv run playwright install
# Set your API key
echo "GROQ_API_KEY=your_key_here" > .envuv run python main.py❯ go to google.com and search for "LangChain agents"
❯ navigate to github.com and find the trending repositories
❯ go to chatgpt.com and type "Hello, how are you?"
❯ visit hackernews and tell me the top 3 stories┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ User │ ──▶ │ LLM │ ──▶ │ Playwright │
│ Task │ │ (Vision) │ │ Browser │
└─────────────┘ └──────┬──────┘ └──────┬──────┘
│ │
Analyzes screenshot Executes action
│ │
◀────────────────────┘
Returns screenshot
- Navigate → Agent goes to the URL
- Screenshot → Captures the page
- Analyze → Vision LLM understands the layout
- Act → Clicks, types, scrolls based on visual understanding
- Repeat → Until task is complete
Edit src/agent.py line 34:
# Default - Llama 4 Scout (vision)
model = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct")
# Alternatives
model = ChatGroq(model="llama-3.2-90b-vision-preview") # Vision
model = ChatGroq(model="llama-3.3-70b-versatile") # Text-only
model = ChatGroq(model="qwen/qwen3-32b") # Text-onlyNote: Vision models see screenshots. Text-only models rely on extracted page text.
Edit src/browser_manager.py:
browser = BrowserManager(headless=True) # Run without visible browserEdit src/agent.py line 94:
max_steps: int = 100 # Increase for complex tasks| Command | Action |
|---|---|
exit / quit / q |
Close the agent |
clear |
Reset the screen |
help |
Show available commands |
memory |
Show memory stats (messages/tasks) |
forget |
Clear conversation memory |
browseragent/
├── main.py # Terminal UI entry point
├── src/
│ ├── agent.py # LLM agent logic & prompts
│ ├── browser_manager.py # Playwright automation
│ ├── memory_manager.py # Persistent conversation memory
│ └── schemas.py # Action type definitions
├── .env # API keys (create this)
└── pyproject.toml # Dependencies
| Action | Parameters | Description |
|---|---|---|
navigate |
url |
Go to a URL |
click |
x, y |
Click at coordinates |
type |
text |
Type text on focused element |
press |
key |
Press a key (Enter, Tab, Escape) |
scroll |
direction |
Scroll up or down |
done |
result |
Complete task with result |
MIT
Built with ❤️ using LangChain, Playwright, and Groq
