Skip to content

Clustering Tutorial

Thamme Gowda edited this page Apr 6, 2016 · 8 revisions

Visit [Build](build-instructions) for building the sources to obtain executable jar

Table of Contents

Step 0 : Requirements

  • Have crawl segments from Apache Nutch in consistent format.
  • Have latest version of autoext-spark-xx-SNAPSHOT.jar
NOTE: This tutorial runs spark in local mode. Thus -master local will be specified in the below steps. To run it in cluster mode, start the job with spark-submit command instead of java -jar

Step 1 : Partition data into parts

Clustering is computationally expensive job!

 So it is better to partition dataset for interesting documents to cluster. For instance, there is no point of trying to cluster images and other non-HTML web pages for structure and style. This step partitions
mkdir test
Clone this wiki locally