forked from thammegowda/autoextractor
    
        
        - 
                Notifications
    You must be signed in to change notification settings 
- Fork 11
Clustering Tutorial
        Thamme Gowda edited this page Apr 6, 2016 
        ·
        8 revisions
      
    Visit [Build](build-instructions) for building the sources to obtain executable jar
- Have crawl segments from Apache Nutch in consistent format.
- Have latest version of autoext-spark-xx-SNAPSHOT.jar
 -master local will be specified in the below steps. To run it in cluster mode, start the job with spark-submit command instead of  java -jar
Clustering is computationally expensive job!
So it is better to partition dataset for interesting documents to cluster. For instance, there is no point of trying to cluster images and other non-HTML web pages for structure and style. This step partitions
mkdir test