A simple web crawler written in Go that fetches, parses, and archives web pages into MongoDB.
Built as a learning project to explore Go's concurrency primitives, HTTP handling, and HTML parsing.
Requires Go 1.24+ and a running MongoDB instance.
git clone https://github.com/sahitya-chandra/web-crawler.git
cd web-crawler
go mod downloadCreate a .env file:
MONGODB_URI=mongodb://localhost:27017
go run main.go # defaults: example.com, 500 pages
go run main.go -url https://go.dev/ -max 100 # custom URL and limit
go run main.go -delay 500ms # faster crawling| Flag | Default | Description |
|---|---|---|
-url |
https://example.com/ |
Starting URL to crawl |
-max |
500 |
Maximum pages to crawl |
-delay |
1s |
Delay between requests |
-db |
crawlerArchive |
MongoDB database name |
-collection |
webpages |
MongoDB collection name |
- Dequeues a URL from a FIFO queue
- Fetches the page via HTTP (with timeout and
User-Agent) - Parses HTML to extract the title, body text (first 500 words), and links
- Stores the page in MongoDB
- Enqueues discovered links that haven't been seen yet
- Waits for the configured delay, then repeats
Stops when the queue is empty, the page limit is reached, or Ctrl+C is pressed.
main.go Entry point — CLI flags, crawl loop, graceful shutdown
crawler/crawler.go HTTP fetching, HTML parsing, link normalization
queue/queue.go Thread-safe FIFO URL queue
db/db.go MongoDB connection and document insertion
make build # compile binary
make run # build and run
make test # run tests with race detector (go test -race ./...)
make fmt # format code
make check # fmt + vet + test
make clean # remove binary