A pair of Python utilities for website link discovery and redirect validation. Perfect for website audits, domain migrations, and SEO validation.
This toolkit contains two complementary scripts:
crawler.py– Crawls a website and extracts all internal links, saving them to a CSV fileredirect_validator.py– Validates that URLs have been successfully redirected to a new subdomain and logs any failures
Both scripts automatically organize output files in a data/ folder and are designed for ease of use with interactive input prompts.
- 🔗 Discovers all internal links on a website recursively
- 🛡️ Respects depth limits to prevent infinite crawling
- 📊 Exports results to CSV with URL normalization
- ⏱️ Built-in rate limiting (0.3s delay between requests)
- 🔄 Automatic duplicate detection
- ✅ Tests whether old URLs redirect successfully to new subdomains
- 📝 Logs HTTP errors and request failures with detailed error messages
- 🚀 Real-time console feedback during validation
- 📈 Supports large CSV inputs
- 📂 Organized output in
data/folder
- Python 3.7+
requestslibrarybeautifulsoup4library
Install dependencies:
pip install requests beautifulsoup4Run the crawler script:
python crawler.pyYou'll be prompted to enter:
- Base URL: The starting URL to crawl (e.g.,
https://example.com) - Output CSV filename: Name of the output file (e.g.,
links.csv)
The script will:
- Crawl all internal links up to 8 levels deep
- Normalize URLs (removing trailing slashes, fragments)
- Prevent duplicate entries
- Save results to
data/<your_filename>.csv
Output Format:
URL
https://example.com
https://example.com/about
https://example.com/contact
...
Run the validator script:
python redirect_validator.pyYou'll be prompted to enter:
- Input CSV filename: The CSV file containing URLs (e.g.,
links.csv) - OLD subdomain: The subdomain being migrated from (e.g.,
docs.example.org) - NEW subdomain: The subdomain being migrated to (e.g.,
hub.example.org)
The script will:
- Replace the subdomain in each URL
- Test if the new URL responds successfully (status < 400)
- Log any HTTP errors or connection failures
- Save failures to
data/redirect_errors.csv
Output Format:
Old URL,New URL,Status Code,Error
https://docs.example.org/api,https://hub.example.org/api,404,HTTP Error
https://docs.example.org/guide,https://hub.example.org/guide,Connection timeout,Request Failed
...
website-crawler/
├── crawler.py # Main crawling script
├── redirect_validator.py # URL validation script
├── README.md # This file
├── .gitignore # Git ignore rules (data/ folder)
└── data/ # Output directory (created automatically)
├── links.csv
└── redirect_errors.csv
$ python crawler.py
Enter the base URL to crawl (e.g., https://ciroh.com): https://docs.mycompany.com
Enter the output CSV file name (e.g., urls.csv): docs_links.csv
Crawling: https://docs.mycompany.com
Crawling: https://docs.mycompany.com/getting-started
Crawling: https://docs.mycompany.com/api-reference
...
Done crawling safely.$ python redirect_validator.py
Enter input CSV filename (e.g., ciroh_links.csv): docs_links.csv
Enter OLD subdomain (e.g., docs.ciroh.org): docs.mycompany.com
Enter NEW subdomain (e.g., hub.ciroh.org): hub.mycompany.com
Starting redirect validation...
✅ OK 200: https://hub.mycompany.com/getting-started
✅ OK 301: https://hub.mycompany.com/api-reference
❌ ERROR 404: https://hub.mycompany.com/deprecated-page
...
Validation complete.
Errors saved to data/redirect_errors.csvBoth scripts use sensible defaults:
- Max crawl depth: 8 levels (configurable in
crawler.py) - Request timeout: 10 seconds (configurable in both scripts)
- Rate limit: 0.3 seconds between requests (configurable in
crawler.py)
To modify these, edit the constants at the top of each script.
All output files are automatically saved to the data/ folder. The folder is created on first run and is excluded from git (via .gitignore).
- Invalid URLs are skipped gracefully
- Network timeouts don't crash the script
- Partial results are preserved on error
- CSV files are written incrementally
- You can stop and resume crawling by running the crawler again (it appends to existing files)
| Issue | Solution |
|---|---|
Import Error: No module named 'requests' |
Run pip install requests beautifulsoup4 |
Connection timeout |
Check your internet connection or increase timeout value |
404 errors in validation |
Ensure the new subdomain is properly configured and accessible |
Empty output CSV |
Check that the base URL is correct and accessible |
MIT License - Feel free to use and modify for your projects.
Contributions are welcome! Please feel free to submit issues or pull requests to improve these tools.
Q: Can I crawl external links?
A: No, the crawler is designed to find only internal links. Modify is_internal() function if needed.
Q: How long does crawling take?
A: Depends on site size and rate limit. The default 0.3s delay is respectful; adjust if needed.
Q: Can I use this for SEO?
A: Yes! The crawler output is perfect for sitemap generation and link audits.
Q: Does this respect robots.txt?
A: No, the current implementation doesn't check robots.txt. Ensure you have permission to crawl the site.