sparkler 0.1

Table of Contents Sparkler v0.1 Requirements Steps Download Apache Solr Start Solr Local Mode Verify Solr Cloud mode Inject Seed URLs Run Crawl

Sparkler v0.1

Quick Start Guide

Requirements

Apache Solr (Tested on 6.0.1)

Steps

Download Apache Solr

# A place to keep all the files organized
mkdir ~/work/sparkler/ -p
cd ~/work/sparkler/
# Download Solr Binary
wget "http://archive.apache.org/dist/lucene/solr/6.0.1/solr-6.0.1.tgz"  # pick your version and mirror
# Extract Solr
tar xvzf solr-6.0.1.tgz
# Add crawldb config sets
cd solr-6.0.1/
cp -rv ${SPARKLER_GIT_SOURCE_PATH}/conf/solr/crawldb server/solr/configsets/

Start Solr

Local Mode

There are many ways to do this, Here is a relatively easy way to start solr with crawldb

# from the solr extracted directory
cp -r server/solr/configsets/crawldb server/solr/
./bin/solr start

Wait for a while to start the solr, Open http://localhost:8983/solr/#/~cores/ in your browser, Follow Add Core > then fill 'crawldb' for both name and instanceDir form fields and click Add Core.

Verify Solr

After above steps you should have a core named "crawldb" in solr. You can verify it by opening http://localhost:8983/solr/crawldb/select?q=* in your browser. This link should give a valid solr response with 0 documents.

Now the crawldb core is ready, go to Inject Seed URLs phase.

Cloud mode

// Coming soon

Inject Seed URLs

Open a file called seed.txt and enter your seed urls. Example :

http://nutch.apache.org/
http://tika.apache.org/

If not already, build the `sparkler-app` jar referring to Build and Deploy instructions.

To inject URLs, run the following command.

$ java -jar sparkler-app-0.1.jar inject -sf seed.txt
2016-06-07 19:22:49 INFO  Injector$:70 [main] - Injecting 2 seeds
>>jobId = sparkler-job-1465352569649

This step just injected 2 URLs. In addition, we got a jobId `sparkler-job-1465352569649`. Suppose, to inject more seeds to the crawldb later phase, we can update using this job id. Usage :

$ java -jar sparkler-app-0.1.jar inject 
 -id (--job-id) VAL        : Id of an existing Job to which the urls are to be
                             injected. No argument will create a new job
 -sf (--seed-file) FILE    : path to seed file
 -su (--seed-url) STRING[] : Seed Url(s)

For example:

   java -jar sparkler-app-0.1.jar inject -id sparkler-job-1465352569649 \
      -su http://www.bbc.com/news -su http://espn.go.com/

To see these URLS in crawldb : http://localhost:8983/solr/crawldb/query?q=*:*&facet=true&facet.field=status&facet.field=depth&facet.field=group

//NOTE: solr url can be updated in `sparkler-[default|site].properties` file

Run Crawl

To run a crawl:

$ java -jar sparkler-app-0.1.jar crawl
 -i (--iterations) N  : Number of iterations to run
 -id (--id) VAL       : Job id. When not sure, get the job id from injector
                        command
 -m (--master) VAL    : Spark Master URI. Ignore this if job is started by
                        spark-submit
 -o (--out) VAL       : Output path, default is job id
 -tg (--top-groups) N : Max Groups to be selected for fetch..
 -tn (--top-n) N      : Top urls per domain to be selected for a round

Example :

    java -jar sparkler-app-0.1.jar crawl -id sparkler-job-1465352569649  -m local[*] -i 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sparkler 0.1

Table of Contents

Sparkler v0.1

Requirements

Steps

Download Apache Solr

Start Solr

Local Mode

Verify Solr

Cloud mode

Inject Seed URLs

Run Crawl

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally