Moved to https://github.com/uscdataScience/autoextractor

Auto Extractor

An intelligent extractor library which learns the structures of the input web pages and then figures out a strategy for scraping the structured content

NOTE : The project is under active development, as a result the README is out of sync with the codebase.

TODO: update this file with the description of all new features.

Example Usage:

1. Structural Similarity Between HTML/XML documents

$ mvn clean compile package
$ java -cp target/autoextractor-0.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.autoext.tree.ZSTEDComputer \
        -dir src/test/resources/html/simple/

#Index  File Path
0       /home/tg/work/projects/oss/autoextractor/src/test/resources/html/simple/3.html
1       /home/tg/work/projects/oss/autoextractor/src/test/resources/html/simple/2.html
2       /home/tg/work/projects/oss/autoextractor/src/test/resources/html/simple/1.html

#Similarity Matrix
0.000000        13.000000       10.000000       
13.000000       0.000000        3.000000        
10.000000       3.000000        0.000000

2. Clustering based on style and structure

$ mvn clean package
$ java -cp target/autoextractor-0.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.autoext.cluster.FileClusterer
    Option "-list" is required
    -list FILE    : path to a file containing paths to html files that requires
                     clustering
     -workdir FILE : Path to directory to create intermediate files and reports

# Creating input list of htmls
$ find src/test/resources/html/simple/ -type f  > list.txt

# Cluster
$ java -cp target/autoextractor-0.1-SNAPSHOT-jar-with-dependencies.jar edu.usc.irds.autoext.cluster.FileClusterer \
        -list list.txt  -workdir out

# Report 
$ cat out/report.txt

# Similarity Matrix
$ cat out/gross-sim.csv

# Clusters
$ cat out/clusters.txt 
    ##Total Clusters:2
    
    #Cluster:0
    src/test/resources/html/simple/3.html
    
    #Cluster:1
    src/test/resources/html/simple/2.html
    src/test/resources/html/simple/1.html

Developers:

References :

K. Zhang and D. Shasha. 1989. "Simple fast algorithms for the editing distance between trees and related problems". SIAM J. Comput. 18, 6 (December 1989), 1245-1262.
Jarvis, R.A.; Patrick, Edward A., "Clustering Using a Similarity Measure Based on Shared Near Neighbors," in Computers, IEEE Transactions on , vol.C-22, no.11, pp.1025-1034, Nov. 1973

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
autoextractor-core		autoextractor-core
autoextractor-spark		autoextractor-spark
.gitignore		.gitignore
LICENSE		LICENSE
OPENSOURCE-LICENCES.md		OPENSOURCE-LICENCES.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Moved to https://github.com/uscdataScience/autoextractor

Auto Extractor

Example Usage:

1. Structural Similarity Between HTML/XML documents

2. Clustering based on style and structure

Developers:

References :

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

thammegowda/autoextractor

Folders and files

Latest commit

History

Repository files navigation

Moved to https://github.com/uscdataScience/autoextractor

Auto Extractor

Example Usage:

1. Structural Similarity Between HTML/XML documents

2. Clustering based on style and structure

Developers:

References :

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages