Skip to content

csmb/rusty_spider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Image Crawler

A Rust-based web crawler that downloads JPG and GIF images from a website and its subpages within the same domain.

Features

  • Recursively crawls websites while staying within the same domain
  • Downloads JPG and GIF images
  • Handles both relative and absolute URLs
  • Concurrent processing for better performance
  • Rate limiting to be respectful to servers
  • Deduplicates URLs and images
  • Shows progress and summary statistics
  • Organizes downloads by format, domain, and size categories

Installation

  1. Clone the repository:
git clone https://github.com/csmb/rusty_spider.git
cd rusty_spider
  1. Build the project:
cargo build --release

Usage

Run the crawler with a URL as an argument:

cargo run -- https://example.com

Images will be downloaded to the downloads directory with the following organization:

downloads/
├── jpg/                    # All JPG images
│   ├── example.com/        # Grouped by domain
│   │   ├── small/          # < 100KB
│   │   ├── medium/         # 100KB - 1MB
│   │   └── large/          # > 1MB
│   └── another-site.com/
└── gif/                    # All GIF images
    └── example.com/
        ├── small/
        ├── medium/
        └── large/

The crawler will:

  1. Create all necessary directories automatically
  2. Save only the highest quality version of each image
  3. Organize images by format (jpg/gif), domain, and size category
  4. Show progress as it downloads and organizes images

License

MIT License

About

a lightweight web crawler written in rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages