- 
                Notifications
    You must be signed in to change notification settings 
- Fork 139
sparkler 0.1
Quick Start Guide
Apache Solr (Tested on 6.0.1)
# A place to keep all the files organized
mkdir ~/work/sparkler/ -p
cd ~/work/sparkler/
# Download Solr Binary
wget "http://archive.apache.org/dist/lucene/solr/6.0.1/solr-6.0.1.tgz"  # pick your version and mirror
# Extract Solr
tar xvzf solr-6.0.1.tgz
# Add crawldb config sets
cd solr-6.0.1/
cp -rv ${SPARKLER_GIT_SOURCE_PATH}/conf/solr/crawldb server/solr/configsets/
There are many ways to do this, Here is a relatively easy way to start solr with crawldb
# from the solr extracted directory cp -r server/solr/configsets/crawldb server/solr/ ./bin/solr startWait for a while to start the solr, Open http://localhost:8983/solr/#/~cores/ in your browser, Follow Add Core > then fill 'crawldb' for both name and instanceDir form fields and click Add Core.
After above steps you should have a core named "crawldb" in solr. You can verify it by opening http://localhost:8983/solr/crawldb/select?q=* in your browser. This link should give a valid solr response with 0 documents.
Now the crawldb core is ready, go to Inject Seed URLs phase.
// Coming soon
Open a file called seed.txt and enter your seed urls. Example :
http://nutch.apache.org/ http://tika.apache.org/
If not already, build the `sparkler-app` jar referring to Build and Deploy instructions.
To inject URLs, run the following command.
$ java -jar sparkler-app-0.1.jar inject -sf seed.txt 2016-06-07 19:22:49 INFO Injector$:70 [main] - Injecting 2 seeds >>jobId = sparkler-job-1465352569649
This step just injected 2 URLs. In addition, we got a jobId `sparkler-job-1465352569649`. Suppose, to inject more seeds to the crawldb later phase, we can update using this job id. Usage :
$ java -jar sparkler-app-0.1.jar inject 
 -id (--job-id) VAL        : Id of an existing Job to which the urls are to be
                             injected. No argument will create a new job
 -sf (--seed-file) FILE    : path to seed file
 -su (--seed-url) STRING[] : Seed Url(s)
For example:
   java -jar sparkler-app-0.1.jar inject -id sparkler-job-1465352569649 \
      -su http://www.bbc.com/news -su http://espn.go.com/
To see these URLS in crawldb : http://localhost:8983/solr/crawldb/query?q=*:*&facet=true&facet.field=status&facet.field=depth&facet.field=group
//NOTE: solr url can be updated in `sparkler-[default|site].properties` file
To run a crawl:
$ java -jar sparkler-app-0.1.jar crawl
 -i (--iterations) N  : Number of iterations to run
 -id (--id) VAL       : Job id. When not sure, get the job id from injector
                        command
 -m (--master) VAL    : Spark Master URI. Ignore this if job is started by
                        spark-submit
 -o (--out) VAL       : Output path, default is job id
 -tg (--top-groups) N : Max Groups to be selected for fetch..
 -tn (--top-n) N      : Top urls per domain to be selected for a round
Example :
java -jar sparkler-app-0.1.jar crawl -id sparkler-job-1465352569649 -m local[*] -i 1