This project is a browser automation agent that can understand natural language instructions and perform actions in a web browser, using Gemini LLMs and Playwright.
The agent consists of three main components:
- Planner: Breaks down a user's natural language query into high-level steps.
- Executor: Performs each step by interacting with the browser through Playwright.
- Verifier: Verifies if each step was completed successfully.
- Clone the repository
- Install the requirements:
pip install -r requirements.txt
- Install Playwright browsers:
playwright install chromium
- Create a
.env
file with your Gemini API key:GEMINI_API_KEY=your_api_key_here
Run the main script:
python interact.py
Enter your natural language query when prompted. For example:
login to reddit.com, search ghibli, save the first three posts with metadata like date, author etc.
The agent will:
- Plan the steps needed to achieve your goal
- Execute each step in the browser
- Verify each step was completed successfully
- Report progress and results
Edit config.py
to modify the agent's behavior:
- Change model names
- Enable/disable headless mode
- Adjust retry limits
- Enable/disable human-in-the-loop mode
interact.py
: Entry point scriptmodels.py
: Data modelsplanner.py
: Step planning moduleexecutor.py
: Browser interaction moduleverifier.py
: Step verification moduleutils.py
: Utility functionsconfig.py
: Configuration settings