Skip to content

Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.

License

Notifications You must be signed in to change notification settings

PKHarsimran/website-downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌐 Website Downloader CLI

CI – Website Downloader

Website Downloader CLI is a tiny, pure-Python site-mirroring tool that lets you grab a complete, browsable offline copy of any publicly-reachable website:

  • Recursively crawls every same-origin link (including “pretty” /about/ URLs)
  • Downloads all assets (images, CSS, JS, …)
  • Rewrites internal links so pages open flawlessly from your local disk
  • Streams files concurrently with automatic retry / back-off
  • Generates a clean, flat directory tree (example_com/index.html, example_com/about/index.html, …)

Perfect for web-archiving, pentesting labs, long flights, or just poking around a site without an internet connection.


🚀 Quick Start

# 1.  Grab the code
git clone https://github.com/PKHarsimran/website-downloader.git
cd website-downloader

# 2.  Install deps (only two runtime libs!)
pip install -r requirements.txt

# 3.  Mirror a site – no prompts needed
python website_downloader.py \
        --url https://harsim.ca \
        --destination harsim_ca_backup \
        --max-pages 100 \
        --threads 8

🛠️ Libraries Used

Library Emoji Purpose in this project
requests + urllib3.Retry 🌐 High-level HTTP client with automatic retry / back-off for flaky hosts
BeautifulSoup (bs4) 🍜 Parses downloaded HTML and extracts every <a>, <img>, <script>, and <link>
argparse 🛠️ Powers the modern CLI (--url, --destination, --max-pages, --threads, …)
logging 📝 Dual console / file logging with colour + crawl-time stats
threading & queue ⚙️ Lightweight thread-pool that streams images/CSS/JS concurrently
pathlib & os 📂 Cross-platform file-system helpers (Path magic, directory creation, etc.)
time ⏱️ Measures per-page latency and total crawl duration
urllib.parse 🔗 Safely joins / analyses URLs and rewrites them to local relative paths
sys 🖥️ Directs log output to stdout and handles graceful interrupts (Ctrl-C)

🗂️ Project Structure

Path What it is Key features
website_downloader.py Single-entry CLI that performs the entire crawl and link-rewriting pipeline. • Persistent requests.Session with automatic retries
• Breadth-first crawl capped by --max-pages (default = 50)
• Thread-pool (configurable via --threads, default = 6) to fetch images/CSS/JS in parallel
• Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ index.html, plain paths ➜ .html)
• Smart output folder naming (example.comexample_com)
• Colourised console + file logging with per-page latency and crawl summary
requirements.txt Minimal dependency pin-list. Only requests and beautifulsoup4 are third-party; everything else is Python ≥ 3.10 std-lib.
web_scraper.log Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails.
README.md The document you’re reading – quick-start, flags, and architecture notes.
(output folder) Created at runtime (example_com/ …) – mirrors the remote directory tree with index.html stubs and all static assets.

Removed: The old check_download.py verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.

🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

📜 License

This project is licensed under the MIT License.

About

Website-downloader is a powerful and versatile Python script designed to download entire websites along with all their assets. This tool allows you to create a local copy of a website, including HTML pages, images, CSS, JavaScript files, and other resources. It is ideal for web archiving, offline browsing, and web development.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages