A web crawler designed to scrape Civilization V wiki articles from the Civilization Fandom wiki.
- Crawls the Civilization V wiki starting from the main page
- Saves only valid Civilization V articles (URLs ending with "(Civ5)")
- Excludes Civilopedia pages (URLs ending with "/Civilopedia")
- Respects the server by adding delays between requests
- Logs crawling progress and errors
- Limits the number of articles saved to prevent excessive downloads
- Python 3.6+
- Required packages: requests, beautifulsoup4
- Clone this repository
- Install the required packages:
pip install -r requirements.txtRun the crawler with:
python scrape.pyThe crawler will:
- Start at the Civilization V main page
- Follow links to find Civilization V articles
- Save HTML content to the
./datadirectory - Create a mapping file (
url_mapping.txt) to track which files correspond to which URLs - Create a log file (
crawl_log.txt) with details about the crawling process
You can modify the following variables in scrape.py to customize the crawler:
START_URL: The starting URL for the crawlerOUTPUT_DIR: The directory where HTML files will be savedMAX_ARTICLES: The maximum number of articles to saveDELAY: The delay between requests in seconds
- The crawler uses a breadth-first search approach to find articles
- It avoids visiting the same URL twice
- It handles errors gracefully and logs them
- You can interrupt the crawler at any time with Ctrl+C