Skip to content

Commit 54fc635

Browse files
author
SoulSniper1212
committed
Update Web Scraper README with detailed documentation
Signed-off-by: SoulSniper1212 <[email protected]>
1 parent ce24464 commit 54fc635

File tree

5 files changed

+378
-7
lines changed

5 files changed

+378
-7
lines changed

Web Scraper/README.md

Lines changed: 81 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,87 @@
1-
In this script, we use the `requests` library to send a GET request to the Python.org blogs page. We then use the `BeautifulSoup` library to parse the HTML content of the page.
1+
# Web Scraper
22

3-
We find all the blog titles on the page by searching for `h2` elements with the class `blog-title`. We then print each title found and save them to a file named `blog_titles.txt`.
3+
This repository contains two web scraping scripts:
44

5+
## 1. Traditional Web Scraper (`Web_Scraper.py`)
6+
7+
This script uses the `requests` library to send a GET request to the Python.org blogs page. It then uses the `BeautifulSoup` library to parse the HTML content of the page.
8+
9+
It finds all the blog titles on the page by searching for `h2` elements with the class `blog-title`. It then prints each title found and saves them to a file named `blog_titles.txt`.
10+
11+
### Usage
512
To run this script, first install the required libraries:
613

714
```bash
815
pip install requests beautifulsoup4
16+
```
17+
18+
Then run:
19+
20+
```bash
21+
python Web_Scraper.py
22+
```
23+
24+
## 2. Google Custom Search Scraper (`google_web_scraper.py`)
25+
26+
This enhanced CLI web scraper uses the Google Custom Search API to extract URLs, titles, and snippets from search results. This approach is more robust than traditional web scraping methods as it:
27+
28+
- Bypasses CAPTCHA challenges that may occur during direct web scraping
29+
- Retrieves structured data (title, URL, and snippet/description)
30+
- Handles dynamic websites more reliably
31+
- Is less prone to breaking when website structures change
32+
- Allows searching by keyword to retrieve multiple metadata fields
33+
34+
### Prerequisites
35+
Before using this script, you need:
36+
1. A Google API Key from [Google Cloud Console](https://console.cloud.google.com/apis/credentials)
37+
2. A Custom Search Engine ID from [Google Programmable Search Engine](https://programmablesearchengine.google.com/)
38+
39+
### Installation
40+
```bash
41+
pip install -r requirements.txt
42+
```
43+
44+
### Setup
45+
Set your API credentials as environment variables:
46+
```bash
47+
export GOOGLE_API_KEY='your_google_api_key'
48+
export SEARCH_ENGINE_ID='your_search_engine_id'
49+
```
50+
51+
Alternatively, you can pass them directly as command-line arguments.
52+
53+
### Usage
54+
Basic usage:
55+
```bash
56+
python google_web_scraper.py --query "Python tutorials" --results 10
57+
```
58+
59+
Save results in JSON format:
60+
```bash
61+
python google_web_scraper.py --query "machine learning blogs" --results 20 --format json
62+
```
63+
64+
Specify output file:
65+
```bash
66+
python google_web_scraper.py --query "web development news" --output my_search.json --format json
67+
```
68+
69+
With API credentials as arguments:
70+
```bash
71+
python google_web_scraper.py --query "Python tutorials" --api-key YOUR_API_KEY --engine-id YOUR_ENGINE_ID
72+
```
73+
74+
### Options
75+
- `--query, -q`: Search query to use for web scraping (required)
76+
- `--results, -r`: Number of search results to retrieve (default: 10)
77+
- `--output, -o`: Output file name (default: search_results.txt)
78+
- `--format, -f`: Output format: txt or json (default: txt)
79+
- `--api-key, -k`: Google API Key (optional)
80+
- `--engine-id, -e`: Google Custom Search Engine ID (optional)
81+
82+
### Features
83+
- Command-line interface with configurable options
84+
- Support for both TXT and JSON output formats
85+
- Environment variable support for credentials
86+
- Error handling and user-friendly messages
87+
- Ability to retrieve multiple pages of results

Web Scraper/Web_Scraper.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
import requests
22
from bs4 import BeautifulSoup
33

4+
print("This is the traditional web scraper using BeautifulSoup.")
5+
print("For a more robust solution using Google Custom Search API, see 'google_web_scraper.py'")
6+
print()
7+
48
# URL to scrape data from
59
URL = "https://www.python.org/blogs/"
610

@@ -23,8 +27,4 @@
2327
for title in titles:
2428
file.write(title.get_text(strip=True) + "\n")
2529

26-
print("\nBlog titles saved to 'blog_titles.txt'.")
27-
28-
29-
30-
30+
print("\nBlog titles saved to 'blog_titles.txt'.")

Web Scraper/example_usage.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Example usage of the Google Custom Search Scraper
4+
This demonstrates how to use the scraper programmatically
5+
"""
6+
import os
7+
from google_web_scraper import GoogleSearchScraper
8+
9+
10+
def example_basic_usage():
11+
"""Example of basic usage"""
12+
# Initialize the scraper with API credentials
13+
# These can be set as environment variables: GOOGLE_API_KEY and SEARCH_ENGINE_ID
14+
api_key = os.getenv('GOOGLE_API_KEY')
15+
search_engine_id = os.getenv('SEARCH_ENGINE_ID')
16+
17+
if not api_key or not search_engine_id:
18+
print("Please set GOOGLE_API_KEY and SEARCH_ENGINE_ID environment variables")
19+
return
20+
21+
try:
22+
scraper = GoogleSearchScraper(api_key=api_key, search_engine_id=search_engine_id)
23+
24+
# Search for Python tutorials
25+
results = scraper.search("Python tutorials", num_results=5)
26+
27+
print(f"Found {len(results)} results:")
28+
for i, result in enumerate(results, 1):
29+
title = result.get('title', 'No title')
30+
link = result.get('link', 'No URL')
31+
snippet = result.get('snippet', 'No snippet')
32+
print(f"{i}. {title}")
33+
print(f" URL: {link}")
34+
print(f" Snippet: {snippet}")
35+
print()
36+
except Exception as e:
37+
print(f"Error during search: {e}")
38+
39+
40+
def example_multiple_pages():
41+
"""Example of searching multiple pages"""
42+
api_key = os.getenv('GOOGLE_API_KEY')
43+
search_engine_id = os.getenv('SEARCH_ENGINE_ID')
44+
45+
if not api_key or not search_engine_id:
46+
print("Please set GOOGLE_API_KEY and SEARCH_ENGINE_ID environment variables")
47+
return
48+
49+
try:
50+
scraper = GoogleSearchScraper(api_key=api_key, search_engine_id=search_engine_id)
51+
52+
# Search for multiple pages of results
53+
results = scraper.search_multiple_pages("machine learning", total_results=15)
54+
55+
print(f"Found {len(results)} results for 'machine learning':")
56+
for i, result in enumerate(results, 1):
57+
title = result.get('title', 'No title')
58+
link = result.get('link', 'No URL')
59+
print(f"{i:2d}. {title}")
60+
print(f" URL: {link}")
61+
print()
62+
except Exception as e:
63+
print(f"Error during search: {e}")
64+
65+
66+
if __name__ == "__main__":
67+
print("=== Basic Usage Example ===")
68+
example_basic_usage()
69+
print("\n=== Multiple Pages Example ===")
70+
example_multiple_pages()

0 commit comments

Comments
 (0)