Sparkler Usage

Basics

A Simple Crawl

Once you have Sparkler installed and configured you can kick off your first crawl. There are various command line flags to help you do this.

./sparkler.sh inject -su bbc.co.uk -id test
./sparkler.sh crawl -id test

This example basically says crawl bbc.co.uk and label the id test. The id is optional, if you don't supply it then you'll get a custom job id in return.

Crawls are always in 2 steps, the inject phase just preseeds the database. Then the crawl phase iterates through the seeded urls and populates the database with the crawl results.

Configuration

The default configuration file is in the conf directory named, sparkler-default.yaml. In here you will find sensible defaults for most things. You can set various plugins, headers, kafka config and more.

Fetcher Properties

The main place to tweak settings is the fetcher properties. In here you can set the server delay, which is the pause between crawl requests, this stops Sparkler spamming the servers causing undue load and also trying to make us look a little less like a robot.

You can also set fetcher headers, in here are the standard headers that get sent with the request to make you look like a browser.

You can also enable the fetcher.user.agents property which will cycle through the headers in the file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparkler Usage

Basics

A Simple Crawl

Configuration

Fetcher Properties

Enabling Plugins

Basic Plugins

Fetcher HTMLUnit

Regex

Samehost

Advanced Usage

Plugins

Fetcher Chrome

URL Injector

POST/PUT Commands

Config Override

Additional Fields

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally