WebAgent is an automated web browsing tool powered by AI that can navigate websites, understand page content, and complete tasks based on user instructions.
- AI-Powered Visual Understanding: Uses Google's Gemini API to analyze screenshots and understand webpage content
- Multiple Operation Modes: Choose between HTML-only mode (faster) or screenshot+HTML modes (more accurate)
- Batch Processing: Capture multiple screenshots before analysis for better context awareness
- Automated Navigation: Intelligently clicks, types, selects, and navigates through websites
- Cookie Consent Handling: Automatically handles cookie consent popups
- Scrollable Content Analysis: Captures content by scrolling through long pages
- Python 3.7+
- Chrome/Chromium browser
- ChromeDriver (compatible with your browser version)
- Google Gemini API key
selenium
beautifulsoup4
google-genai
- Clone this repository
- Install dependencies:
pip install -r requirements.txt - Make sure ChromeDriver is installed and in your PATH
- Set your Google Gemini API key in the
GEMINI_API_KEYvariable in the code (By creating .env file where writes "GEMINI_API_KEY=YOUR_KEY")
Run the script with:
python WebAgent.py
Then follow the interactive prompts to enter your task instructions.
-help: Display help information-url [new_url]: Change the target URL-set mode [option]: Set operation modehtml: HTML-only mode (no screenshots)batch: Batch mode - capture all screenshots before processing (default)standard: Process each screenshot immediately
-set scrolls [number]: Set maximum number of page scrollsquit: Exit the program
-url https://www.imdb.com
-set mode html
-set scrolls 5
Tell me top3 on IMDb this week
Uses only HTML content for analysis without capturing screenshots. Much faster but potentially less accurate for visually complex pages.
Captures multiple screenshots while scrolling through the page, then analyzes them together for better context understanding.
Processes each screenshot immediately after capture. Useful for very long pages where batch processing might hit token limits.
- Opens the specified URL in a Chrome browser
- Handles any cookie consent dialogs
- Captures content through screenshots and/or HTML parsing
- Analyzes content using Google's Gemini AI models
- Determines the best actions to fulfill the user's goal
- Executes actions on the page
- Repeats the process until an answer is found or maximum steps reached
- Screenshots are temporarily stored in the
tmp/folder and cleaned up after task completion - The Gemini API key should be kept secure and not committed to public repositories