|
1 | | -In this script, we use the `requests` library to send a GET request to the Python.org blogs page. We then use the `BeautifulSoup` library to parse the HTML content of the page. |
| 1 | +# Web Scraper |
2 | 2 |
|
3 | | -We find all the blog titles on the page by searching for `h2` elements with the class `blog-title`. We then print each title found and save them to a file named `blog_titles.txt`. |
| 3 | +This repository contains two web scraping scripts: |
4 | 4 |
|
| 5 | +## 1. Traditional Web Scraper (`Web_Scraper.py`) |
| 6 | + |
| 7 | +This script uses the `requests` library to send a GET request to the Python.org blogs page. It then uses the `BeautifulSoup` library to parse the HTML content of the page. |
| 8 | + |
| 9 | +It finds all the blog titles on the page by searching for `h2` elements with the class `blog-title`. It then prints each title found and saves them to a file named `blog_titles.txt`. |
| 10 | + |
| 11 | +### Usage |
5 | 12 | To run this script, first install the required libraries: |
6 | 13 |
|
7 | 14 | ```bash |
8 | 15 | pip install requests beautifulsoup4 |
| 16 | +``` |
| 17 | + |
| 18 | +Then run: |
| 19 | + |
| 20 | +```bash |
| 21 | +python Web_Scraper.py |
| 22 | +``` |
| 23 | + |
| 24 | +## 2. Google Custom Search Scraper (`google_web_scraper.py`) |
| 25 | + |
| 26 | +This enhanced CLI web scraper uses the Google Custom Search API to extract URLs, titles, and snippets from search results. This approach is more robust than traditional web scraping methods as it: |
| 27 | + |
| 28 | +- Bypasses CAPTCHA challenges that may occur during direct web scraping |
| 29 | +- Retrieves structured data (title, URL, and snippet/description) |
| 30 | +- Handles dynamic websites more reliably |
| 31 | +- Is less prone to breaking when website structures change |
| 32 | +- Allows searching by keyword to retrieve multiple metadata fields |
| 33 | + |
| 34 | +### Prerequisites |
| 35 | +Before using this script, you need: |
| 36 | +1. A Google API Key from [Google Cloud Console](https://console.cloud.google.com/apis/credentials) |
| 37 | +2. A Custom Search Engine ID from [Google Programmable Search Engine](https://programmablesearchengine.google.com/) |
| 38 | + |
| 39 | +### Installation |
| 40 | +```bash |
| 41 | +pip install -r requirements.txt |
| 42 | +``` |
| 43 | + |
| 44 | +### Setup |
| 45 | +Set your API credentials as environment variables: |
| 46 | +```bash |
| 47 | +export GOOGLE_API_KEY='your_google_api_key' |
| 48 | +export SEARCH_ENGINE_ID='your_search_engine_id' |
| 49 | +``` |
| 50 | + |
| 51 | +Alternatively, you can pass them directly as command-line arguments. |
| 52 | + |
| 53 | +### Usage |
| 54 | +Basic usage: |
| 55 | +```bash |
| 56 | +python google_web_scraper.py --query "Python tutorials" --results 10 |
| 57 | +``` |
| 58 | + |
| 59 | +Save results in JSON format: |
| 60 | +```bash |
| 61 | +python google_web_scraper.py --query "machine learning blogs" --results 20 --format json |
| 62 | +``` |
| 63 | + |
| 64 | +Specify output file: |
| 65 | +```bash |
| 66 | +python google_web_scraper.py --query "web development news" --output my_search.json --format json |
| 67 | +``` |
| 68 | + |
| 69 | +With API credentials as arguments: |
| 70 | +```bash |
| 71 | +python google_web_scraper.py --query "Python tutorials" --api-key YOUR_API_KEY --engine-id YOUR_ENGINE_ID |
| 72 | +``` |
| 73 | + |
| 74 | +### Options |
| 75 | +- `--query, -q`: Search query to use for web scraping (required) |
| 76 | +- `--results, -r`: Number of search results to retrieve (default: 10) |
| 77 | +- `--output, -o`: Output file name (default: search_results.txt) |
| 78 | +- `--format, -f`: Output format: txt or json (default: txt) |
| 79 | +- `--api-key, -k`: Google API Key (optional) |
| 80 | +- `--engine-id, -e`: Google Custom Search Engine ID (optional) |
| 81 | + |
| 82 | +### Features |
| 83 | +- Command-line interface with configurable options |
| 84 | +- Support for both TXT and JSON output formats |
| 85 | +- Environment variable support for credentials |
| 86 | +- Error handling and user-friendly messages |
| 87 | +- Ability to retrieve multiple pages of results |
0 commit comments