Skip to content

An intelligent agent that can see, understand, and interact with web pages like a human.

License

Notifications You must be signed in to change notification settings

RaheesAhmed/browseragent

Repository files navigation

🌐 Browser Agent

AI-powered browser automation with vision capabilities.

An intelligent agent that can see, understand, and interact with web pages like a human. Uses multimodal LLMs to analyze screenshots and perform actions.

Browser Agent


✨ Features

Feature Description
🔍 Vision-Based Navigation Takes screenshots and uses AI to understand page layout
🖱️ Coordinate Clicking Clicks elements by x,y position, not fragile CSS selectors
⌨️ Keyboard Input Types text and presses keys (Enter, Tab, etc.)
📜 Scroll Support Scrolls up/down to reveal hidden content
🔄 Multi-Step Tasks Executes complex workflows with up to 100 steps
💬 Interactive Chat Clean terminal UI for real-time task execution
📸 Smart Screenshot Management Auto-trims history to stay within model limits
🧠 Persistent Memory Remembers context across tasks within a session

🚀 Quick Start

Prerequisites

Installation

# Clone the repo
git clone https://github.com/RaheesAhmed/browseragent.git
cd browseragent

# Install dependencies
uv sync

# Install Playwright browsers
uv run playwright install

# Set your API key
echo "GROQ_API_KEY=your_key_here" > .env

Run

uv run python main.py

💡 Usage Examples

❯ go to google.com and search for "LangChain agents"
❯ navigate to github.com and find the trending repositories
❯ go to chatgpt.com and type "Hello, how are you?"
❯ visit hackernews and tell me the top 3 stories

🧠 How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   User      │ ──▶ │   LLM       │ ──▶ │  Playwright │
│   Task      │     │  (Vision)   │     │   Browser   │
└─────────────┘     └──────┬──────┘     └──────┬──────┘
                          │                    │
                   Analyzes screenshot    Executes action
                          │                    │
                          ◀────────────────────┘
                         Returns screenshot
  1. Navigate → Agent goes to the URL
  2. Screenshot → Captures the page
  3. Analyze → Vision LLM understands the layout
  4. Act → Clicks, types, scrolls based on visual understanding
  5. Repeat → Until task is complete

⚙️ Configuration

Change Model

Edit src/agent.py line 34:

# Default - Llama 4 Scout (vision)
model = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct")

# Alternatives
model = ChatGroq(model="llama-3.2-90b-vision-preview")    # Vision
model = ChatGroq(model="llama-3.3-70b-versatile")         # Text-only
model = ChatGroq(model="qwen/qwen3-32b")                  # Text-only

Note: Vision models see screenshots. Text-only models rely on extracted page text.

Headless Mode

Edit src/browser_manager.py:

browser = BrowserManager(headless=True)  # Run without visible browser

Max Steps

Edit src/agent.py line 94:

max_steps: int = 100  # Increase for complex tasks

🎮 Commands

Command Action
exit / quit / q Close the agent
clear Reset the screen
help Show available commands
memory Show memory stats (messages/tasks)
forget Clear conversation memory

📁 Project Structure

browseragent/
├── main.py              # Terminal UI entry point
├── src/
│   ├── agent.py         # LLM agent logic & prompts
│   ├── browser_manager.py  # Playwright automation
│   ├── memory_manager.py   # Persistent conversation memory
│   └── schemas.py       # Action type definitions
├── .env                 # API keys (create this)
└── pyproject.toml       # Dependencies

🛠️ Supported Actions

Action Parameters Description
navigate url Go to a URL
click x, y Click at coordinates
type text Type text on focused element
press key Press a key (Enter, Tab, Escape)
scroll direction Scroll up or down
done result Complete task with result

📝 License

MIT


Built with ❤️ using LangChain, Playwright, and Groq

About

An intelligent agent that can see, understand, and interact with web pages like a human.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages