Skip to content

Commit 6bc20d0

Browse files
authored
Merge pull request #1 from MacPaw/upload_files
Upload files
2 parents 2f4bdf9 + 8c5346e commit 6bc20d0

File tree

23 files changed

+8273
-1
lines changed

23 files changed

+8273
-1
lines changed

LICENSE

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
MIT License
2+
Copyright (c) 2025, Anonymous
3+
Permission is hereby granted, free of charge, to any person obtaining a copy
4+
of this software and associated documentation files (the “Software”), to deal
5+
in the Software without restriction, including without limitation the rights
6+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7+
copies of the Software, and to permit persons to whom the Software is
8+
furnished to do so, subject to the following conditions:
9+
The above copyright notice and this permission notice shall be included in all
10+
copies or substantial portions of the Software.
11+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
12+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
13+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
14+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
15+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
16+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
17+
SOFTWARE.
18+
---
19+
This project includes and builds upon the BLIP model developed by Salesforce.com, Inc., which is licensed under the BSD 3-Clause License:
20+
BSD 3-Clause License
21+
Copyright (c) 2022, Salesforce.com, Inc.
22+
All rights reserved.
23+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
24+
1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.
25+
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution.
26+
3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
27+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 133 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,134 @@
1+
[![MacPaw Research](https://pbs.twimg.com/profile_banners/3993798502/1720615716/1500x500)](https://research.macpaw.com)
2+
13
# Screen2AX
2-
Repository for Screen2AX paper
4+
5+
A research-driven project for generating accessibility of macOS applications using computer vision and deep learning. Read more about the project in our [paper]().
6+
7+
---
8+
9+
## 📁 Datasets
10+
11+
- [Screen2AX-Tree](https://huggingface.co/datasets/macpaw-research/Screen2AX-Tree)
12+
- [Screen2AX-Element](https://huggingface.co/datasets/macpaw-research/Screen2AX-Element)
13+
- [Screen2AX-Group](https://huggingface.co/datasets/macpaw-research/Screen2AX-Group)
14+
- [Screen2AX-Task](https://huggingface.co/datasets/macpaw-research/Screen2AX-Task)
15+
16+
## 🤖 Models
17+
18+
- [YOLOv11l — UI Elements Detection](https://huggingface.co/macpaw-research/yolov11l-ui-elements-detection)
19+
- [BLIP — UI Elements Captioning](https://huggingface.co/macpaw-research/blip-icon-captioning)
20+
- [YOLOv11l — UI Groups Detection](https://huggingface.co/macpaw-research/yolov11l-ui-groups-detection)
21+
22+
---
23+
24+
## 🛠 Requirements
25+
26+
- macOS
27+
- Python (recommended ≥ 3.11)
28+
- Conda
29+
- Pip
30+
31+
---
32+
33+
## ⚙️ Installation
34+
35+
Create and activate the project environment:
36+
37+
```bash
38+
conda create -n screen2ax python=3.11
39+
conda activate screen2ax
40+
pip install -r requirements.txt
41+
```
42+
43+
## 🚀 Usage
44+
45+
> ⚠️ The first run may take longer due to model downloads and initial setup.
46+
47+
### Accessibility generation
48+
Run the accessibility generation script:
49+
50+
```bash
51+
python -m hierarchy_dl.hierarchy --help
52+
```
53+
#### Available Options
54+
55+
```
56+
usage: hierarchy.py [-h] [--image IMAGE] [--save] [--filename FILENAME] [--save_dir SAVE_DIR] [--flat]
57+
58+
options:
59+
-h, --help show this help message and exit
60+
--image IMAGE Path to the image
61+
--save Save the result
62+
--filename FILENAME Filename to save the result
63+
--save_dir SAVE_DIR Directory to save the result. Default is './results/'
64+
--flat Generate flat hierarchy (no groups)
65+
```
66+
67+
##### Example
68+
Run the accessibility generation script on a screenshot of the Spotify app:
69+
70+
```bash
71+
python -m hierarchy_dl.hierarchy --image ./screenshots/spotify.png --save --filename spotify.json
72+
```
73+
74+
This will generate a JSON file with the accessibility of the app in the results folder.
75+
76+
### Screen Reader
77+
Run the screen reader:
78+
79+
```bash
80+
python -m screen_reader.screen_reader --help
81+
```
82+
83+
#### Available Options
84+
85+
```
86+
usage: screen_reader.py [-h] [-b BUNDLE_ID] [-n NAME] [-dw] [-dh] [-r RATE] [-v VOICE] [-sa] [-sk SKIP_GROUPS]
87+
88+
options:
89+
-h, --help show this help message and exit
90+
-b, --bundle_id BUNDLE_ID Bundle ID of the target application
91+
-n, --name NAME Name of the target application (alternative to bundle_id)
92+
-dw, --deactivate_welcome Skip the "Welcome to the ScreenReader." message
93+
-dh, --deactivate_help Skip reading the help message on startup
94+
-r, --rate RATE Set speech rate for macOS `say` command (default: 190)
95+
-v, --voice VOICE Set voice for macOS `say` command (see `say -v "?" | grep en`)
96+
-sa, --system_accessibility Use macOS system accessibility data instead of vision-generated
97+
-sk, --skip-groups N Skip groups with fewer than N children (default: 5)
98+
```
99+
100+
##### Example
101+
102+
Run the screen reader for the Spotify app:
103+
```bash
104+
python -m screen_reader.screen_reader --name Spotify
105+
```
106+
107+
## 📜 License
108+
### 🔍 YOLO Models
109+
The YOLO models used for UI elements and UI groups detection are licensed under the GNU Affero General Public License (AGPL). This is inherited from the original YOLO model licensing.
110+
111+
### 🧠 BLIP Model
112+
The BLIP model for captioning UI elements is provided under the MIT License.
113+
114+
### 📂 Datasets
115+
All datasets (Screen2AX-Tree, Screen2AX-Element, Screen2AX-Group, Screen2AX-Task) are released under the Apache 2.0 license.
116+
117+
### 💻 Codebase
118+
All source code in this repository is licensed under the MIT License. See the [LICENSE](LICENSE) file for full terms and conditions.
119+
120+
## 📚 Citation
121+
If you use this code in your research, please cite our paper:
122+
123+
```bibtex
124+
...
125+
```
126+
127+
## 🙌 Acknowledgements
128+
We would like to express our deepest gratitude to the Armed Forces of Ukraine. Your courage and unwavering defense of our country make it possible for us to live, work, and create in freedom. This work would not be possible without your sacrifice. Thank you.
129+
130+
## MacPaw Research
131+
132+
Visit our site to learn more 😉
133+
134+
https://research.macpaw.com

hierarchy_dl/application.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
import time
2+
import threading
3+
4+
import tkinter as tk
5+
6+
from hierarchy import generate_hierarchy
7+
from screen_reader.screenshot import screenshot_app, open_app_in_foreground
8+
9+
10+
run = True
11+
thread = None
12+
13+
14+
def start_action():
15+
bundle_id = entry.get()
16+
open_app_in_foreground(bundle_id, wait_time=2)
17+
18+
global run
19+
run = True
20+
21+
i = 0
22+
while run:
23+
try:
24+
start = time.time()
25+
open_app_in_foreground(bundle_id, wait_time=0.25)
26+
screen_path = screenshot_app(bundle_id, f"./screenshots/")[0]
27+
28+
tree = generate_hierarchy(screen_path, save=True, save_dir=f"./result/{bundle_id}/")
29+
30+
end = time.time()
31+
32+
i += 1
33+
print(f"Frame #{i}, time taken: {end - start}")
34+
35+
except Exception as e:
36+
print(f"Error: {e}")
37+
break
38+
39+
40+
def stop_action():
41+
global run, thread
42+
run = False
43+
print(f"Stopping process")
44+
45+
if thread:
46+
thread.join()
47+
48+
print(f"Thread has stopped")
49+
50+
51+
def start_thread():
52+
global thread
53+
thread = threading.Thread(target=start_action, daemon=True)
54+
thread.start()
55+
56+
57+
if __name__ == "__main__":
58+
# Create main window
59+
root = tk.Tk()
60+
root.title("Bundle ID Manager")
61+
root.geometry("300x200")
62+
63+
# Create input field
64+
label = tk.Label(root, text="bundle_id:")
65+
label.pack(pady=5)
66+
67+
entry = tk.Entry(root)
68+
entry.pack(pady=5)
69+
70+
# Copyable text with suggestion
71+
suggestion = tk.Label(root, text="osascript -e 'id of app \"Spotify\"' \n e.g. com.spotify.client")
72+
suggestion.pack(pady=5)
73+
74+
# Create buttons
75+
start_button = tk.Button(root, text="Start", command=start_thread)
76+
start_button.pack(pady=5)
77+
78+
stop_button = tk.Button(root, text="Stop", command=stop_action)
79+
stop_button.pack(pady=5)
80+
81+
# Run application
82+
root.mainloop()

hierarchy_dl/blip.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
import torch
2+
from PIL import Image
3+
from transformers import BlipProcessor, BlipForConditionalGeneration
4+
5+
if torch.cuda.is_available():
6+
device = torch.device("cuda")
7+
elif torch.backends.mps.is_available():
8+
device = torch.device("mps")
9+
else:
10+
device = torch.device("cpu")
11+
12+
print(f"Using device: {device}")
13+
14+
cache_dir = "./.models"
15+
16+
model_path = "macpaw-research/blip-icon-captioning"
17+
processor = BlipProcessor.from_pretrained(model_path, cache_dir=cache_dir)
18+
model = BlipForConditionalGeneration.from_pretrained(model_path, cache_dir=cache_dir).to(device)
19+
model.eval()
20+
21+
@torch.no_grad()
22+
def generate_captions(images: list[Image.Image]) -> list[str]:
23+
inputs = processor(images, return_tensors="pt").to(device)
24+
outputs = model.generate(**inputs, max_new_tokens=25)
25+
return processor.batch_decode(outputs, skip_special_tokens=True)

hierarchy_dl/hierarchy.py

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
import os
2+
import json
3+
import time
4+
from os import path
5+
from typing import Optional
6+
7+
import numpy as np
8+
from PIL import Image
9+
from ocrmac import ocrmac
10+
from ultralytics import YOLO
11+
12+
from hierarchy_dl.utils import *
13+
14+
from huggingface_hub import hf_hub_download
15+
cache_dir = "./.models"
16+
17+
ui_elements_model_path = hf_hub_download(
18+
repo_id="macpaw-research/yolov11l-ui-elements-detection",
19+
filename="ui-elements-detection.pt",
20+
cache_dir=cache_dir
21+
)
22+
23+
ui_groups_model_path = hf_hub_download(
24+
repo_id="macpaw-research/yolov11l-ui-groups-detection",
25+
filename="ui-groups-detection.pt",
26+
cache_dir=cache_dir
27+
)
28+
29+
ui_elements_model = YOLO(ui_elements_model_path)
30+
ui_groups_model = YOLO(ui_groups_model_path)
31+
32+
33+
def generate_hierarchy(
34+
img: str | Image.Image | np.ndarray,
35+
save_dir: str = "./results/",
36+
save: bool = False,
37+
filename: Optional[str] = None,
38+
flat: bool = False
39+
) -> UIElement:
40+
"""
41+
Generate UI hierarchy from an image
42+
"""
43+
# load image
44+
if isinstance(img, str):
45+
img_pil = Image.open(img)
46+
47+
if isinstance(img, np.ndarray):
48+
img_pil = Image.fromarray(img)
49+
50+
if isinstance(img, Image.Image):
51+
img_pil = img
52+
53+
width, height = img_pil.size
54+
55+
# detect ui elements
56+
ui_elements = ui_elements_model(img_pil, verbose=False)[0].boxes
57+
ui_elements = [UIElement(box, cls) for box, cls in zip(ui_elements.xyxy, ui_elements.cls)]
58+
59+
# detect ui groups
60+
ui_groups = ui_groups_model(img_pil, conf=0.5, verbose=False)[0].boxes
61+
ui_groups = [UIElement(box, "Group") for box in ui_groups.xyxy]
62+
63+
# ocr
64+
annotations = ocrmac.OCR(img_pil, language_preference=['en-US']).recognize(px=True)
65+
annotations = [UIElement(box, "Text", value=val) for val, _, box in annotations]
66+
67+
# merge texts and elements
68+
annotations = group_texts(annotations)
69+
ui_elements = merge_text_and_elements(ui_elements, annotations, iou_threshold=0.2)
70+
71+
# icons
72+
ui_elements = caption_buttons(ui_elements, img_pil, batch_size=16)
73+
74+
if not flat:
75+
# build tree
76+
tree = build_tree(ui_groups, ui_elements, (width, height), iou_threshold=0.0)
77+
clean_tree(tree)
78+
79+
if len(tree.children) == 1:
80+
tree = tree.children[0]
81+
else:
82+
ui_elements.sort(key=lambda x: x.box[0] ** 2 + x.box[1] ** 2)
83+
tree = UIElement(
84+
box=[0, 0, width, height],
85+
cls="Group",
86+
value="Screen"
87+
)
88+
tree.children = ui_elements
89+
90+
if save or filename:
91+
os.makedirs(save_dir, exist_ok=True)
92+
93+
filename = f"{path.basename(img)}.json" if isinstance(img, str) and not filename else filename
94+
filename = filename or f"{time.time()}.json"
95+
96+
full_path = path.join(save_dir, filename)
97+
98+
with open(full_path, "w", encoding='utf-8') as f:
99+
json.dump(tree.to_dict(), f, indent=4)
100+
101+
return tree
102+
103+
if __name__ == "__main__":
104+
import argparse
105+
106+
parser = argparse.ArgumentParser()
107+
parser.add_argument("--image", type=str, default="./screen.png", help="Path to the image")
108+
parser.add_argument("--save", action="store_true", help="Save the result")
109+
parser.add_argument("--filename", type=str, default=None, help="Filename to save the result")
110+
parser.add_argument("--save_dir", type=str, default="./results/", help="Directory to save the result. Default is './results/'")
111+
parser.add_argument("--flat", action="store_true", help="Generate flat hierarchy (no groups)")
112+
args = parser.parse_args()
113+
114+
image = args.image
115+
save_dir = args.save_dir
116+
save = args.save
117+
filename = args.filename
118+
flat = args.flat
119+
120+
tree = generate_hierarchy(image, save_dir, save, filename, flat)

0 commit comments

Comments
 (0)