🌐 Browser Agent

AI-powered browser automation with vision capabilities.

An intelligent agent that can see, understand, and interact with web pages like a human. Uses multimodal LLMs to analyze screenshots and perform actions.

✨ Features

Feature	Description
🔍 Vision-Based Navigation	Takes screenshots and uses AI to understand page layout
🖱️ Coordinate Clicking	Clicks elements by x,y position, not fragile CSS selectors
⌨️ Keyboard Input	Types text and presses keys (Enter, Tab, etc.)
📜 Scroll Support	Scrolls up/down to reveal hidden content
🔄 Multi-Step Tasks	Executes complex workflows with up to 100 steps
💬 Interactive Chat	Clean terminal UI for real-time task execution
📸 Smart Screenshot Management	Auto-trims history to stay within model limits
🧠 Persistent Memory	Remembers context across tasks within a session

🚀 Quick Start

Prerequisites

Python 3.11+
uv package manager
Groq API key (get one free)

Installation

# Clone the repo
git clone https://github.com/RaheesAhmed/browseragent.git
cd browseragent

# Install dependencies
uv sync

# Install Playwright browsers
uv run playwright install

# Set your API key
echo "GROQ_API_KEY=your_key_here" > .env

Run

uv run python main.py

💡 Usage Examples

❯ go to google.com and search for "LangChain agents"
❯ navigate to github.com and find the trending repositories
❯ go to chatgpt.com and type "Hello, how are you?"
❯ visit hackernews and tell me the top 3 stories

🧠 How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   User      │ ──▶ │   LLM       │ ──▶ │  Playwright │
│   Task      │     │  (Vision)   │     │   Browser   │
└─────────────┘     └──────┬──────┘     └──────┬──────┘
                          │                    │
                   Analyzes screenshot    Executes action
                          │                    │
                          ◀────────────────────┘
                         Returns screenshot

Navigate → Agent goes to the URL
Screenshot → Captures the page
Analyze → Vision LLM understands the layout
Act → Clicks, types, scrolls based on visual understanding
Repeat → Until task is complete

⚙️ Configuration

Change Model

Edit src/agent.py line 34:

# Default - Llama 4 Scout (vision)
model = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct")

# Alternatives
model = ChatGroq(model="llama-3.2-90b-vision-preview")    # Vision
model = ChatGroq(model="llama-3.3-70b-versatile")         # Text-only
model = ChatGroq(model="qwen/qwen3-32b")                  # Text-only

Note: Vision models see screenshots. Text-only models rely on extracted page text.

Headless Mode

Edit src/browser_manager.py:

browser = BrowserManager(headless=True)  # Run without visible browser

Max Steps

Edit src/agent.py line 94:

max_steps: int = 100  # Increase for complex tasks

🎮 Commands

Command	Action
`exit` / `quit` / `q`	Close the agent
`clear`	Reset the screen
`help`	Show available commands
`memory`	Show memory stats (messages/tasks)
`forget`	Clear conversation memory

📁 Project Structure

browseragent/
├── main.py              # Terminal UI entry point
├── src/
│   ├── agent.py         # LLM agent logic & prompts
│   ├── browser_manager.py  # Playwright automation
│   ├── memory_manager.py   # Persistent conversation memory
│   └── schemas.py       # Action type definitions
├── .env                 # API keys (create this)
└── pyproject.toml       # Dependencies

🛠️ Supported Actions

Action	Parameters	Description
`navigate`	`url`	Go to a URL
`click`	`x`, `y`	Click at coordinates
`type`	`text`	Type text on focused element
`press`	`key`	Press a key (Enter, Tab, Escape)
`scroll`	`direction`	Scroll up or down
`done`	`result`	Complete task with result

📝 License

MIT

Built with ❤️ using LangChain, Playwright, and Groq

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.browser_agent_history		.browser_agent_history
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
Screenshot.png		Screenshot.png
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌐 Browser Agent

✨ Features

🚀 Quick Start

Prerequisites

Installation

Run

💡 Usage Examples

🧠 How It Works

⚙️ Configuration

Change Model

Headless Mode

Max Steps

🎮 Commands

📁 Project Structure

🛠️ Supported Actions

📝 License

About

Uh oh!

Releases

Packages

Languages

License

RaheesAhmed/browseragent

Folders and files

Latest commit

History

Repository files navigation

🌐 Browser Agent

✨ Features

🚀 Quick Start

Prerequisites

Installation

Run

💡 Usage Examples

🧠 How It Works

⚙️ Configuration

Change Model

Headless Mode

Max Steps

🎮 Commands

📁 Project Structure

🛠️ Supported Actions

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages