Skip to content

Commit 9d1100c

Browse files
committed
documentation for ocr software done
1 parent d8488d4 commit 9d1100c

File tree

2 files changed

+194
-0
lines changed

2 files changed

+194
-0
lines changed
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
2+
# Webpage Screenshot and OCR Analysis
3+
4+
## Overview
5+
6+
This script captures screenshots of a specified webpage, uses OCR (Optical Character Recognition) to extract text from these screenshots, and saves both the screenshots and extracted text for further analysis. The extracted data is stored in a CSV file named `scraped_data.csv`.
7+
8+
## Table of Contents
9+
10+
- [Overview](#overview)
11+
- [Installation](#installation)
12+
- [Usage](#usage)
13+
- [Features](#features)
14+
- [Screenshots](#screenshots)
15+
- [Contributing](#contributing)
16+
- [License](#license)
17+
- [Acknowledgements](#acknowledgements)
18+
19+
## Installation
20+
21+
### Prerequisites
22+
23+
- Python 3.x
24+
- Google Chrome browser
25+
- ChromeDriver
26+
- Tesseract-OCR
27+
28+
### Step-by-Step Guide
29+
30+
1. **Clone the repository**
31+
```sh
32+
git clone https://github.com/yourusername/yourrepo.git
33+
cd yourrepo
34+
```
35+
36+
2. **Install Python dependencies**
37+
```sh
38+
pip install pytesseract pillow selenium
39+
```
40+
41+
3. **Download and install Tesseract-OCR**
42+
- [Download Tesseract-OCR](https://github.com/tesseract-ocr/tesseract)
43+
- Install Tesseract and note the installation path. Update the path in the script accordingly:
44+
```python
45+
pytesseract.pytesseract.tesseract_cmd = r'C:\Path\To\Tesseract-OCR\tesseract.exe'
46+
```
47+
48+
4. **Download ChromeDriver**
49+
- [Download ChromeDriver](https://sites.google.com/a/chromium.org/chromedriver/downloads)
50+
- Ensure the ChromeDriver version matches your installed Chrome browser version.
51+
52+
## Usage
53+
54+
### Running the Script
55+
56+
1. **Set the URL to analyze**
57+
- Modify the `url_to_analyze` variable in the script to the desired URL.
58+
```python
59+
url_to_analyze = "https://www.myntra.com/"
60+
```
61+
62+
2. **Run the script**
63+
```sh
64+
python script_name.py
65+
```
66+
67+
3. **Output**
68+
- Screenshots are saved in the `Screenshots` directory.
69+
- Extracted text and screenshot paths are saved in `scraped_data.csv`.
70+
71+
### Example Output
72+
73+
After running the script, you should see output similar to:
74+
```sh
75+
Extracted Text from screenshot 1: [Extracted text]
76+
Extracted Text from screenshot 2: [Extracted text]
77+
...
78+
Scraped data written to scraped_data.csv
79+
```
80+
81+
## Features
82+
83+
- **Headless Browser Operation**: Uses a headless Chrome browser to capture screenshots.
84+
- **Random Scrolling**: Scrolls a random amount to capture different parts of the webpage.
85+
- **OCR Extraction**: Uses Tesseract-OCR to extract text from screenshots.
86+
- **CSV Output**: Saves extracted data in a CSV file for easy analysis.
87+
88+
## Screenshots
89+
90+
Screenshots captured by the script are stored in the `Screenshots` directory. Below are examples of the screenshots taken:
91+
92+
![Screenshot 1](Screenshots/screenshot_1.png)
93+
![Screenshot 2](Screenshots/screenshot_2.png)
94+
![Screenshot 3](Screenshots/screenshot_3.png)
95+
![Screenshot 4](Screenshots/screenshot_4.png)
96+
97+
## Contributing
98+
99+
Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on contributing to this project.
100+
101+
## License
102+
103+
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
104+
105+
## Acknowledgements
106+
107+
- **Tesseract-OCR**: The OCR engine used to extract text from images.
108+
- **Selenium**: The tool used for web browser automation.
109+
- **Pillow**: The Python Imaging Library used to handle image operations.
110+
111+
---
112+
113+
## Script
114+
115+
```python
116+
import pytesseract
117+
from PIL import Image
118+
from selenium import webdriver
119+
from selenium.webdriver.chrome.options import Options
120+
from selenium.webdriver.support.ui import WebDriverWait
121+
from selenium.webdriver.support import expected_conditions as EC
122+
from selenium.common.exceptions import TimeoutException
123+
from io import BytesIO
124+
import random
125+
import os
126+
import csv
127+
import time
128+
129+
# Update this line with your Tesseract installation path
130+
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\kulitesh\Scrape-ML\Tesseract-OCR\tesseract.exe'
131+
132+
# URL to analyze
133+
url_to_analyze = "https://www.myntra.com/"
134+
135+
def take_screenshot_and_analyze(url, num_screenshots=4):
136+
options = Options()
137+
options.headless = True
138+
139+
try:
140+
driver = webdriver.Chrome(options=options)
141+
driver.get(url)
142+
WebDriverWait(driver, 20).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')
143+
144+
# Create a directory to store screenshots if it doesn't exist
145+
if not os.path.exists("Screenshots"):
146+
os.makedirs("Screenshots")
147+
148+
data = [] # List to store scraped data
149+
150+
for i in range(num_screenshots):
151+
# Scroll down a random amount
152+
scroll_amount = random.randint(500, 1000) # Adjust as needed
153+
driver.execute_script(f"window.scrollBy(0, {scroll_amount});")
154+
# Add some waiting time after scrolling
155+
time.sleep(1) # Adjust scroll time as needed
156+
157+
# Capture screenshot
158+
screenshot = driver.get_screenshot_as_png()
159+
image = Image.open(BytesIO(screenshot))
160+
161+
# Save screenshot to file
162+
screenshot_path = f"Screenshots/screenshot_{i + 1}.png"
163+
image.save(screenshot_path)
164+
165+
# Use Tesseract OCR to extract text
166+
extracted_text = pytesseract.image_to_string(image)
167+
print(f"Extracted Text from screenshot {i + 1}:", extracted_text)
168+
169+
# Add the extracted text to the data list
170+
data.append({"Screenshot": screenshot_path, "Extracted Text": extracted_text})
171+
172+
# Write the scraped data to a CSV file
173+
write_to_csv(data)
174+
175+
except TimeoutException:
176+
print("Timed out waiting for page to load")
177+
178+
finally:
179+
if 'driver' in locals():
180+
driver.quit()
181+
182+
def write_to_csv(data):
183+
# Define CSV file path
184+
csv_file = "scraped_data.csv"
185+
186+
# Write data to CSV file
187+
with open(csv_file, 'w', newline='', encoding='utf-8') as file:
188+
writer = csv.DictWriter(file, fieldnames=["Screenshot", "Extracted Text"])
189+
writer.writeheader()
190+
writer.writerows(data)
191+
192+
print(f"Scraped data written to {csv_file}")
193+
194+
# Perform screenshot and analysis for multiple screenshots

Scrap_ml_ocr.zip

1.31 MB
Binary file not shown.

0 commit comments

Comments
 (0)