Skip to content

sparkler 0.1

Madhav Sharan edited this page Oct 14, 2016 · 17 revisions

Table of Contents

Sparkler v0.1

Quick Start Guide

Requirements

Apache Solr (Tested on 6.0.1)

Steps

Download Apache Solr

# A place to keep all the files organized
mkdir ~/work/sparkler/ -p
cd ~/work/sparkler/
# Download Solr Binary
wget "http://archive.apache.org/dist/lucene/solr/6.0.1/solr-6.0.1.tgz"  # pick your version and mirror
# Extract Solr
tar xvzf solr-6.0.1.tgz
# Add crawldb config sets
cd solr-6.0.1/
cp -rv ${SPARKLER_GIT_SOURCE_PATH}/conf/solr/crawldb server/solr/configsets/

Start Solr

Local Mode

There are many ways to do this, Here is a relatively easy way to start solr with crawldb

# from the solr extracted directory
cp -r server/solr/configsets/crawldb server/solr/
./bin/solr start
Verify Solr

After above steps you should have a core named "crawldb" in solr. You can verify it by opening http://localhost:8983/solr/crawldb/select?q=* in your browser. This link should give a valid solr response with 0 documents.

Now the crawldb core is ready, go to Inject Seed URLs phase.

Cloud mode

// Coming soon

Inject Seed URLs

Open a file called seed.txt and enter your seed urls. Example :

http://nutch.apache.org/
http://tika.apache.org/

If not already, build the `sparkler-app` jar referring to Build and Deploy instructions.

To inject URLs, run the following command.

$ java -jar sparkler-app-0.1.jar inject -sf seed.txt
2016-06-07 19:22:49 INFO  Injector$:70 [main] - Injecting 2 seeds
>>jobId = sparkler-job-1465352569649

This step just injected 2 URLs. In addition, we got a jobId `sparkler-job-1465352569649`. Suppose, to inject more seeds to the crawldb later phase, we can update using this job id. Usage :

$ java -jar sparkler-app-0.1.jar inject 
 -id (--job-id) VAL        : Id of an existing Job to which the urls are to be
                             injected. No argument will create a new job
 -sf (--seed-file) FILE    : path to seed file
 -su (--seed-url) STRING[] : Seed Url(s)

For example:

   java -jar sparkler-app-0.1.jar inject -id sparkler-job-1465352569649 \
      -su http://www.bbc.com/news -su http://espn.go.com/

To see these URLS in crawldb : http://localhost:8983/solr/crawldb/query?q=*:*&facet=true&facet.field=status&facet.field=depth&facet.field=group

//NOTE: solr url can be updated in `sparkler-[default|site].properties` file

Run Crawl

To run a crawl:

$ java -jar sparkler-app-0.1.jar crawl
 -i (--iterations) N  : Number of iterations to run
 -id (--id) VAL       : Job id. When not sure, get the job id from injector
                        command
 -m (--master) VAL    : Spark Master URI. Ignore this if job is started by
                        spark-submit
 -o (--out) VAL       : Output path, default is job id
 -tg (--top-groups) N : Max Groups to be selected for fetch..
 -tn (--top-n) N      : Top urls per domain to be selected for a round

Example :

    java -jar sparkler-app-0.1.jar crawl -id sparkler-job-1465352569649  -m local[*] -i 1
Clone this wiki locally