Skip to content

Commit d31fde0

Browse files
authored
feat(browser.py): add BrowserReplayStrategy; support browser modes record/replay (#872)
* add BrowserReplayStrategy; support browser modes record/replay * minor refactor * black/flake8 * update README * improve README * add BrowserReplayStrategy to README * add strategies/visual_browser.py * fix Action.from_dict and test_action_from_dict to support <cmd>-t * calculate_tokens_and_cost; bugfix ActionEvent.fromdict; add ActionEvent.next_event; add TODOs; add visual_browser.py::SKIP_MOVE_BEFORE_CLICK * handle mousemove/scroll; add_screen_tlbr forwards and backwards; RAW_PRECISE/IMPRECISE_MOUSE_EVENTS; openai.MAX_IMAGES = 90; fix merge_consecutive_mouse_scroll_events and tests; filter_invalid_window_events; dump_state timeout; * add TODO * noqa
1 parent f4bfc90 commit d31fde0

File tree

20 files changed

+1923
-274
lines changed

20 files changed

+1923
-274
lines changed

README.md

Lines changed: 19 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,8 @@ with the power of Large Multimodal Modals (LMMs) by:
3535
- Recording screenshots and associated user input
3636
- Aggregating and visualizing user input and recordings for development
3737
- Converting screenshots and user input into tokenized format
38-
- Generating synthetic input via transformer model completions
39-
- Generating task trees by analyzing recordings (work-in-progress)
40-
- Replaying synthetic input to complete tasks (work-in-progress)
38+
- Generating and replaying synthetic input via transformer model completions
39+
- Generating process graphs by analyzing recording logs (work-in-progress)
4140

4241
The goal is similar to that of
4342
[Robotic Process Automation](https://en.wikipedia.org/wiki/Robotic_process_automation),
@@ -165,37 +164,6 @@ pointing the cursor and left or right clicking, as described in this
165164
[open issue](https://github.com/OpenAdaptAI/OpenAdapt/issues/145)
166165

167166

168-
### Capturing Browser Events
169-
170-
To capture (record) browser events in Chrome, follow these steps:
171-
172-
1. Go to: [Chrome Extension Page](chrome://extensions/)
173-
174-
2. Enable `Developer mode` (located at the top right):
175-
176-
![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/c97eb9fb-05d6-465d-85b3-332694556272)
177-
178-
3. Click `Load unpacked` (located at the top left).
179-
180-
![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/00c8adf5-074a-4655-b132-fd87644007fc)
181-
182-
4. Select the `chrome_extension` directory:
183-
184-
![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/71610ed3-f8d4-431a-9a22-d901127b7b0c)
185-
186-
5. You should see the following confirmation, indicating that the extension is loaded:
187-
188-
![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/7ee19da9-37e0-448f-b9ab-08ef99110e85)
189-
190-
6. Set the flag to `true` if it is currently `false`:
191-
192-
![image](https://github.com/user-attachments/assets/8eba24a3-7c68-4deb-8fbe-9d03cece1482)
193-
194-
7. Start recording. Once recording begins, navigate to the Chrome browser, browse some pages, and perform a few clicks. Then, stop the recording and let it complete successfully.
195-
196-
8. After recording, check the `openadapt.db` table `browser_event`. It should contain all your browser activity logs. You can verify the data's correctness using the `sqlite3` CLI or an extension like `SQLite Viewer` in VS Code to open `data/openadapt.db`.
197-
198-
199167
### Visualize
200168

201169
Quickly visualize the latest recording you created by running the following command:
@@ -243,6 +211,7 @@ Other replay strategies include:
243211
- [`StatefulReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/stateful.py): Early proof-of-concept which uses the OpenAI GPT-4 API with prompts constructed via OS-level window data.
244212
- (*)[`VisualReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py): Uses [Fast Segment Anything Model (FastSAM)](https://github.com/CASIA-IVA-Lab/FastSAM) to segment active window.
245213
- (*)[`VanillaReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/vanilla.py): Assumes the model is capable of directly reasoning on states and actions accurately. With future frontier models, we hope that this script will suddenly work a lot better.
214+
- (*)[`VisualBrowserReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual_browser.py): Like VisualReplayStrategy but generates segments from the visible DOM read by the browser extension.
246215

247216

248217
The (*) prefix indicates strategies which accept an "instructions" parameter that is used to modify the recording, e.g.:
@@ -253,6 +222,22 @@ python -m openadapt.replay VanillaReplayStrategy --instructions "calculate 9-8"
253222

254223
See https://github.com/OpenAdaptAI/OpenAdapt/tree/main/openadapt/strategies for a complete list. More ReplayStrategies coming soon! (see [Contributing](#Contributing)).
255224

225+
### Browser integration
226+
227+
To record browser events in Google Chrome (required by the `BrowserReplayStrategy`), follow these steps:
228+
229+
1. Go to your Chrome extensions page by entering [chrome://extensions](chrome://extensions/) in your address bar.
230+
231+
2. Enable `Developer mode` (located at the top right).
232+
233+
3. Click `Load unpacked` (located at the top left).
234+
235+
4. Select the `chrome_extension` directory in the OpenAdapt repo.
236+
237+
5. Make sure the Chrome extension is enabled (the switch to the right of the OpenAdapt extension widget is turned on).
238+
239+
6. Set the `RECORD_BROWSER_EVENTS` flag to `true` in `openadapt/data/config.json`.
240+
256241
## Features
257242

258243
### State-of-the-art GUI understanding via [Segment Anything in High Quality](https://github.com/SysCV/sam-hq):
@@ -306,13 +291,6 @@ We're looking forward to your contributions. Let's build the future 🚀
306291

307292
## Contributing
308293

309-
### Notable Works-in-progress (incomplete, see https://github.com/OpenAdaptAI/OpenAdapt/pulls and https://github.com/OpenAdaptAI/OpenAdapt/issues/ for more)
310-
311-
- [Video Recording Hardware Acceleration](https://github.com/OpenAdaptAI/OpenAdapt/issues/570) (help wanted)
312-
- [Audio Narration](https://github.com/OpenAdaptAI/OpenAdapt/pull/346) (help wanted)
313-
- [Chrome Extension](https://github.com/OpenAdaptAI/OpenAdapt/pull/364) (help wanted)
314-
- [Gemini Vision](https://github.com/OpenAdaptAI/OpenAdapt/issues/551) (help wanted)
315-
316294
### Replay Problem Statement
317295

318296
Our goal is to automate the task described and demonstrated in a `Recording`.

chrome_extension/background.js

Lines changed: 60 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,28 @@
11
/**
22
* @file background.js
3-
* @description Creates a new background script that listens for messages from the content script
4-
* and sends them to a WebSocket server.
5-
*/
3+
* @description Background script that maintains the current mode and communicates with content scripts.
4+
*/
65

76
let socket;
7+
let currentMode = null; // Maintain the current mode here
88
let timeOffset = 0; // Global variable to store the time offset
99

10-
/*
11-
* TODO:
12-
* Ideally we read `WS_SERVER_PORT`, `WS_SERVER_ADDRESS` and
13-
* `RECONNECT_TIMEOUT_INTERVAL` from config.py,
14-
* or it gets passed in somehow.
15-
*/
10+
/*
11+
* Note: these need to match the corresponding values in config[.defaults].json
12+
*/
1613
let RECONNECT_TIMEOUT_INTERVAL = 1000; // ms
1714
let WS_SERVER_PORT = 8765;
1815
let WS_SERVER_ADDRESS = "localhost";
1916
let WS_SERVER_URL = "ws://" + WS_SERVER_ADDRESS + ":" + WS_SERVER_PORT;
2017

21-
2218
function socketSend(socket, message) {
2319
console.log({ message });
2420
socket.send(JSON.stringify(message));
2521
}
2622

27-
2823
/*
2924
* Function to connect to the WebSocket server.
30-
*/
25+
*/
3126
function connectWebSocket() {
3227
socket = new WebSocket(WS_SERVER_URL);
3328

@@ -38,11 +33,34 @@ function connectWebSocket() {
3833
socket.onmessage = function(event) {
3934
console.log("Message from server:", event.data);
4035
const message = JSON.parse(event.data);
36+
37+
// Handle mode messages
38+
if (message.type === 'SET_MODE') {
39+
currentMode = message.mode; // Update the current mode
40+
console.log(`Mode set to: ${currentMode}`);
41+
42+
// Send the mode to all active tabs
43+
chrome.tabs.query(
44+
{
45+
active: true,
46+
},
47+
function(tabs) {
48+
tabs.forEach(function(tab) {
49+
chrome.tabs.sendMessage(tab.id, message, function(response) {
50+
if (chrome.runtime.lastError) {
51+
console.error("Error sending message to content script in tab " + tab.id, chrome.runtime.lastError.message);
52+
} else {
53+
console.log("Message sent to content script in tab " + tab.id, response);
54+
}
55+
});
56+
});
57+
}
58+
);
59+
}
4160
};
4261

4362
socket.onclose = function(event) {
4463
console.log("WebSocket connection closed", event);
45-
// Reconnect after 5 seconds if the connection is lost
4664
setTimeout(connectWebSocket, RECONNECT_TIMEOUT_INTERVAL);
4765
};
4866

@@ -66,3 +84,32 @@ chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
6684
sendResponse({ status: "WebSocket connection not open" });
6785
}
6886
});
87+
88+
/* Listen for tab activation */
89+
chrome.tabs.onActivated.addListener((activeInfo) => {
90+
// Send current mode to the newly active tab if it's not null
91+
if (currentMode) {
92+
const message = { type: 'SET_MODE', mode: currentMode };
93+
chrome.tabs.sendMessage(activeInfo.tabId, message, function(response) {
94+
if (chrome.runtime.lastError) {
95+
console.error("Error sending message to content script in tab " + activeInfo.tabId, chrome.runtime.lastError.message);
96+
} else {
97+
console.log("Message sent to content script in tab " + activeInfo.tabId, response);
98+
}
99+
});
100+
}
101+
});
102+
103+
/* Listen for tab updates to handle new pages or reloading */
104+
chrome.tabs.onUpdated.addListener((tabId, changeInfo, tab) => {
105+
if (changeInfo.status === 'complete' && currentMode) {
106+
const message = { type: 'SET_MODE', mode: currentMode };
107+
chrome.tabs.sendMessage(tabId, message, function(response) {
108+
if (chrome.runtime.lastError) {
109+
console.error("Error sending message to content script in tab " + tabId, chrome.runtime.lastError.message);
110+
} else {
111+
console.log("Message sent to content script in tab " + tabId, response);
112+
}
113+
});
114+
}
115+
});

0 commit comments

Comments
 (0)