This is a simple but very well done crawler project, implemented entirely in Go.
The input is a basic seed file, specifically a URL, from which the crawler starts crawling, parses the page, extracts the specified web elements (e.g. HTML suffixes) and saves the results to a local file. These crawling processes are ultimately controlled by the crawl depth parameter in the configuration file, which determines the crawler's crawl depth to the target.
This crawler project is essentially a distributed project with the following specific features, and arguably functions.
- support configuration of the maximum depth of crawling
- supports configuring the crawl interval and timeout time.
- support multi-threaded execution (multiple goroutines)
- support for configuring multiple crawl sources (starting point), support for configuring the output path of crawl results.
You'll need to install golang locally first
# clone repo
git clone https://github.com/vinsec/go-spider.git
# cd to project directory
cd go-spider
# create the required directories
mkdir -p {output,bin}
# build the project
cd src && go build -o ../bin/go-spider .
# change the seed file (the starting site for crawling) to the site you want to crawl
cd - && vim data/seed
# run the spider
cd bin && ./go-spider
The default configuration file is conf/spider.conf, and you can change these parameters at will before running
When you see "request queue nil, all sub spiders are idle, Spider exit.", the crawling process will end and the crawling results will be stored in the output directory
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
- vinsec - Go-Spider - vinsec
Project development supported by Jetbrains
This project uses Apache License 2.0