Clustering Tutorial

Visit [Build](build-instructions) for building the sources to obtain executable jar

Table of Contents Step 0 : Requirements Step 1 : Partition data into parts

Step 0 : Requirements

Have crawl segments from Apache Nutch in consistent format.
Have latest version of autoext-spark-xx-SNAPSHOT.jar

NOTE: This tutorial runs spark in local mode. Thus -master local will be specified in the below steps. To run it in cluster mode, start the job with spark-submit command instead of java -jar

Step 1 : Partition data into parts

Clustering is computationally expensive job!

 So it is better to partition dataset for interesting documents to cluster. For instance, there is no point of trying to cluster images and other non-HTML web pages for structure and style. This step partitions

mkdir test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clustering Tutorial

Table of Contents

Step 0 : Requirements

Step 1 : Partition data into parts

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally