forked from thammegowda/autoextractor
    
        
        - 
                Notifications
    You must be signed in to change notification settings 
- Fork 11
Home
        Thamme Gowda edited this page Feb 2, 2016 
        ·
        7 revisions
      
    Welcome to the Auto-Extractor wiki! Here you will find information related to Auto Extractor
- Clustering the web pages based on style and structure
- Scalable on Apache Spark
- Auto extraction of content
- Port to Map Reduce and thus plug into Apache Tika
Not documented yet. Look for FileCluster in the code
This functionality is provided by autoextractor-spark module.
- build the autoextractor-sparkmodule :mvn clean package
- run : java -jar target/autoextrator-spark-*.jar
 Usage:
 -list VAL            : List of Nutch Segment(s) Part(s)
 -master VAL          : Spark master url (default: local[2])
 -sw (--sim-weight) N : weight used for aggregating structural and style
                        similarity measures.
                        Range : [0.0, 1.0] inclusive
                        Notes :
                                0.0 disables structural similarity and only style
                        similarity will be used (it is faster)
                                1.0 disables style similarity and thus only structural
                        similarity will be used
                         (default: 0.0)
 -workdir VAL         : Work directory.
- put segment content part paths to a file. For example list.txtcontains :
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00000/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003201017/content/part-00001/data
/ext/tg-ext/sites/old/batch1/batch1/segments/20151003221050/content/part-00000/data
- Run the job :
java -jar autoextractor-spark/target/autoextractor-spark-0.1-SNAPSHOT.jar -list list.txt -workdir out-4 -master local[4]