Skip to content

Commit 635cedf

Browse files
chore: add documentation
1 parent 451402f commit 635cedf

File tree

1 file changed

+158
-0
lines changed

1 file changed

+158
-0
lines changed
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
# Web Scraper Architecture
2+
3+
## Project Definition
4+
5+
### What is it?
6+
The web scraper is a command-line application written in Rust that recursively downloads and processes web pages starting from given URLs. It crawls websites by following links, downloads pages concurrently using multiple async workers, and stores them locally in a depth-organized directory structure that maintains domain hierarchy while tracking the crawling depth of each page.
7+
8+
### Goals
9+
- **Concurrent web crawling**: Download multiple pages simultaneously using async/await and tokio
10+
- **Recursive link following**: Discover and follow links up to a specified depth with same-domain filtering
11+
- **Depth-organized storage**: Organize downloaded content in folders that track crawling depth (depth_0, depth_1, etc.)
12+
- **Command-line interface**: Provide an intuitive CLI with configurable output directory, depth, and worker count
13+
- **Robust error handling**: Gracefully handle network errors, invalid URLs, and file system issues with detailed logging
14+
15+
## Components and Modules
16+
17+
### 1. CLI Module (`cli.rs`)
18+
**Purpose**: Handle command-line argument parsing and validation.
19+
- Parse command-line arguments (URL, output directory, depth, concurrency)
20+
- Validate input parameters
21+
- Display help information
22+
23+
### 2. Crawler Engine (`crawler.rs`)
24+
**Purpose**: Core crawling logic and coordination using SimpleCrawler.
25+
- Manage the crawling queue and visited URLs HashSet for deduplication
26+
- Coordinate multiple async worker tasks via mpsc channels
27+
- Implement depth-limited crawling with round-robin work distribution
28+
- Handle graceful worker shutdown and result processing
29+
30+
### 3. Downloader Module (`downloader.rs`)
31+
**Purpose**: Handle HTTP requests and page downloading with async support.
32+
- Make asynchronous HTTP requests using reqwest with 30-second timeout
33+
- Custom user-agent and proper error handling
34+
- Return page content and metadata for processing
35+
36+
### 4. Parser Module (`parser.rs`)
37+
**Purpose**: Extract links from downloaded HTML pages with filtering.
38+
- Parse HTML content using scraper crate with CSS selectors
39+
- Extract and normalize URLs from anchor tags (`<a href="...">`)
40+
- Filter to same-domain links only (excludes external sites)
41+
- Remove URL fragments and handle duplicates
42+
43+
### 5. Storage Module (`storage.rs`)
44+
**Purpose**: Manage file system operations with depth-based organization.
45+
- Create hierarchical directory structures organized by crawling depth
46+
- Save downloaded pages to depth-specific folders (depth_0, depth_1, etc.)
47+
- Handle file naming conflicts and path sanitization
48+
- Convert URLs to appropriate file paths maintaining domain structure
49+
50+
### 6. Worker Module (`worker.rs`)
51+
**Purpose**: Handle concurrent downloading tasks with message-passing coordination.
52+
- Define WorkItem and WorkResult message types for communication
53+
- Implement async workers that process URLs from a shared channel
54+
- Coordinate downloader, parser, and storage operations
55+
- Handle round-robin work distribution through mpsc channels
56+
57+
## Module Interactions
58+
59+
```
60+
CLI
61+
|
62+
v
63+
Crawler ←→ Worker Pool
64+
| |
65+
v v
66+
Parser ←→ Downloader
67+
| |
68+
v v
69+
Storage
70+
```
71+
72+
1. **CLI** parses arguments and initializes the **Crawler**
73+
2. **Crawler** creates a pool of **Workers** and manages the crawling queue
74+
3. **Workers** use the **Downloader** to fetch pages
75+
4. Downloaded content is processed by the **Parser** to extract links
76+
5. **Storage** saves pages and creates directory structure
77+
6. New links are fed back to the **Crawler** queue
78+
79+
### Architecture Justification
80+
81+
This modular design provides:
82+
- **Separation of concerns**: Each module has a single responsibility
83+
- **Testability**: Modules can be unit tested independently (29 comprehensive unit tests included)
84+
- **Concurrency**: Async worker-based design enables efficient parallel processing
85+
- **Extensibility**: Easy to add features like robots.txt support or different output formats
86+
- **Error isolation**: Failures in one component don't crash the entire application
87+
88+
### Key Technologies
89+
- **Rust 2024 Edition**: Memory-safe systems programming with excellent async support
90+
- **Tokio**: Async runtime for concurrent operations and channels
91+
- **Reqwest**: HTTP client for reliable web requests with timeout handling
92+
- **Scraper**: HTML parsing with CSS selector support
93+
- **Clap**: Command-line argument parsing with derive macros
94+
- **Anyhow**: Unified error handling across all modules
95+
96+
## Usage
97+
98+
### Installation
99+
To use the `webcrawl` command directly from anywhere in your system:
100+
101+
```bash
102+
# Install to ~/.cargo/bin (make sure it's in your PATH)
103+
cargo install --path .
104+
105+
# Then you can use webcrawl directly
106+
webcrawl --output ./crawled_url --depth 10 https://example.com
107+
```
108+
109+
### Basic Usage
110+
```bash
111+
# Crawl a website with default settings
112+
webcrawl https://example.com
113+
114+
# Specify output directory and depth
115+
webcrawl --output ./crawled_data --depth 3 https://example.com
116+
117+
# Control concurrency
118+
webcrawl --output ./output --depth 2 --workers 5 https://example.com
119+
```
120+
121+
### Command-line Options
122+
- `<URL>`: Starting URL to crawl (required)
123+
- `--output, -o`: Output directory for downloaded pages (default: "./crawled")
124+
- `--depth, -d`: Maximum crawling depth (default: 2)
125+
- `--workers, -w`: Number of concurrent workers (default: 4)
126+
- `--help, -h`: Display help information
127+
128+
### Output Structure
129+
The downloaded pages are organized in a hierarchical structure based on crawling depth and URL structure:
130+
131+
```
132+
output/
133+
├── depth_0/
134+
│ └── example.com/
135+
│ └── index.html # Root page (depth 0)
136+
├── depth_1/
137+
│ └── example.com/
138+
│ ├── about/
139+
│ │ └── index.html # /about page (depth 1)
140+
│ └── products/
141+
│ └── index.html # /products page (depth 1)
142+
└── depth_2/
143+
└── example.com/
144+
├── about/
145+
│ └── team/
146+
│ └── index.html # /about/team page (depth 2)
147+
└── products/
148+
└── software/
149+
└── index.html # /products/software page (depth 2)
150+
```
151+
152+
This depth-based organization allows easy tracking of how deep each page was discovered in the crawling process and provides clear separation between different crawling levels.
153+
154+
### Example Usage Scenarios
155+
156+
1. **Website backup**: `webcrawl --depth 5 --output ./backup https://mysite.com`
157+
2. **Content analysis**: `webcrawl --depth 2 --workers 8 https://news.site.com`
158+
3. **Link validation**: `webcrawl --depth 1 https://example.com` (shallow crawl)

0 commit comments

Comments
 (0)