- 
                Notifications
    You must be signed in to change notification settings 
- Fork 139
Sparkler Usage
Once you have Sparkler installed and configured you can kick off your first crawl. There are various command line flags to help you do this.
./sparkler.sh inject -su bbc.co.uk -id test
./sparkler.sh crawl -id test
This example basically says crawl bbc.co.uk and label the id test. The id is optional, if you don't supply it then you'll get a custom job id in return.
Crawls are always in 2 steps, the inject phase just preseeds the database. Then the crawl phase iterates through the seeded urls and populates the database with the crawl results.
The default configuration file is in the conf directory named, sparkler-default.yaml. In here you will find sensible defaults for most things. You can set various plugins, headers, kafka config and more.
The main place to tweak settings is the fetcher properties. In here you can set the server delay, which is the pause between crawl requests, this stops Sparkler spamming the servers causing undue load and also trying to make us look a little less like a robot.
You can also set fetcher headers, in here are the standard headers that get sent with the request to make you look like a browser.
You can also enable the fetcher.user.agents property which will cycle through the headers in the file.