- 
                Notifications
    You must be signed in to change notification settings 
- Fork 139
Sparkler Usage
Once you have Sparkler installed and configured you can kick off your first crawl. There are various command line flags to help you do this.
./sparkler.sh inject -su bbc.co.uk -id test
./sparkler.sh crawl -id test
This example basically says crawl bbc.co.uk and label the id test. The id is optional, if you don't supply it then you'll get a custom job id in return.
Crawls are always in 2 steps, the inject phase just preseeds the database. Then the crawl phase iterates through the seeded urls and populates the database with the crawl results.
The default configuration file is in the conf directory named, sparkler-default.yaml. In here you will find sensible defaults for most things. You can set various plugins, headers, kafka config and more.
The main place to tweak settings is the fetcher properties. In here you can set the server delay, which is the pause between crawl requests, this stops Sparkler spamming the servers causing undue load and also trying to make us look a little less like a robot.
You can also set fetcher headers, in here are the standard headers that get sent with the request to make you look like a browser.
You can also enable the fetcher.user.agents property which will cycle through the headers in the file.
You enable plugins by editing the plugins.active block. This list of plugins is the defaults shipped with Sparkler and you can enable or disable any of the supplied plugins by adding or removing the # comment symbol.
Enabled by default are the urlfilter-regex and urlfilter-samehost.
These plugins provide a couple of sensible function that allow Sparkler to crawl without downloading the world. Regex will filter out some urls and links it picks up so it doesn't download loads of useless stuff. Samehost will, by default, ensure your crawl is limited to the same domain.
This plugin does what it says on the tin, ensures the crawl is limited to the same host, so that you don't end up in a completely different domain, crawling completely different stuff. Of course, you may want that, in which case disable this plugin.
This url provides more flexiblity over the samehost plugin. Out of the box it will prevent a number of file urls being picked up, so for example, you don't crawl PDFs, videos, images etc. It also filters out ftp sites and mailto addresses, infinite loops and local files.
To adjust the regex, you can simply edit the regex-urlfilter.txt file which manages all the regex expressions that are required for matching.
Also supplied with Sparkler is the fetcher htmlunit plugin. This plugin is a slightly different browser backend that allows you to crawl sites using a different engine. If you find the basic default(fastest) crawler doesn't work, then have a look at this one, and if this doesn't work checkout the other plugins below for more support.
Fetcher Chrome is a plugin that allows you to run crawls using a full chrome headless browser. This uses the Selenium engine to drive the browser and so the easiest way to do this is using the browserless/chrome docker image and pointing Sparkler at that. You can of course run your own, but this is tried and tested.
There are various settings you can configure to alter the default behaviour.
Firstly there is the chrome.dns setting. This allows users to set the IP address of the Chrome instance.
Then there is chrome.proxy.address. Because Selenium is a user interaction emulator it doesn't return http status codes, header information etc, to fix that problem we run the browserup proxy. If you don't provide an address is will launch a local proxy, configure it appropriately and run it in the Sparkler instance. If you require more flexibility, you can launch your own Browserup proxy and then point this configuration variable at it.
Lastly there are a number of selenium configuration options. This allows users to hit a site, let it render, then interact with that site before grabbing the output. This might allow you to enter something in a search box, click a button, filter a list, whatever. The syntax is reasonably strict, it understands click and keys and within that id, class and the pointer names. You can also set the chrome.wait.element. This is because you may be waiting on a specific element to be rendered before grabbing the text. If your site load is asyncronous, it will not wait for those elements to load and you will end up with a semi rendered state. To resolve this you can set the wait element and it will wait for that element to render before running the scrape. You can also set the wait type and the timeout when configuring this option.