Skip to content

Sparkler Usage

Tom Barber edited this page Dec 24, 2020 · 14 revisions

Basics

A Simple Crawl

Once you have Sparkler installed and configured you can kick off your first crawl. There are various command line flags to help you do this.

./sparkler.sh inject -su bbc.co.uk -id test
./sparkler.sh crawl -id test

This example basically says crawl bbc.co.uk and label the id test. The id is optional, if you don't supply it then you'll get a custom job id in return.

Crawls are always in 2 steps, the inject phase just preseeds the database. Then the crawl phase iterates through the seeded urls and populates the database with the crawl results.

Configuration

The default configuration file is in the conf directory named, sparkler-default.yaml. In here you will find sensible defaults for most things. You can set various plugins, headers, kafka config and more.

Fetcher Properties

The main place to tweak settings is the fetcher properties. In here you can set the server delay, which is the pause between crawl requests, this stops Sparkler spamming the servers causing undue load and also trying to make us look a little less like a robot.

You can also set fetcher headers, in here are the standard headers that get sent with the request to make you look like a browser.

You can also enable the fetcher.user.agents property which will cycle through the headers in the file.

Enabling Plugins

Basic Plugins

Fetcher HTMLUnit
Regex
Samehost

Advanced Usage

Plugins

Fetcher Chrome

URL Injector

POST/PUT Commands

Config Override

Additional Fields

Clone this wiki locally