webcrawler

Initial version of a web crawler

up and running you must have both maven ( i used 3.3.9 ) and java ( i used 1.7 ) installed and configured for the path

once the source is downloaded navigate the the source folder (you should see a pom.xml file in that directory)

build it by executing the following mvn clean package

This will have created a directory

To execute

cd target
java -jar webcrawler-0.1-jar-with-dependencies.jar http://www.theregister.co.uk

this will start processing www.theregister.co.uk

usage and limitations The current version requires that you have a fully url - including the http:// portion This version is hard coded for a maximum of 50 pages - mainly because i tested against the bbc - which is a vast web site and debugging takes longer...

The code does NOT handle network errors - further work to have an "Error Page" is required.

Implementation details I wanted to avoid recusion - recusion is ok in scala / clojure - but Java does not have any tail call optimisation.

To achieve this a there is one i used the "Processor" object which has a mutating collection of completed pages and owns a UrlTracker object The "UrlTracker" class owns two mutating Sets which track progress along with a Queue where further requests are added and consumed. a good todo is to move the "pages" variable into the urlTracker - so ALL mutating variables are in a single object.

Yes - I used RegEx to parse html - I wanted to keep the number of external packages to a minimum the regex implemetation of Page_Parser is reasonably performant - a "proper" html parser would probably be less so.

I have not handled Proxy settings - anyone running this behind a corporate firewill will suffer. not a big job but time limited

ToDo currently static url's are not parsed - this is a simple of implementing in the PageParser ( SimplePageParser implementation ) vie an additional regex

Unit Tests - I started this in a TDD manner - this was extremely usefull when ensuring the more complex parts worked ( the page parser and Page classes ) I then broke most of the other class tests when during a major refactor ( removing and add new classes ) !! I made the code work quickly ( for a demo ) and removed those class tests ( along with the classes ) Additional unit tests need adding before any more work.

output should be configurable - currently hard coded as string buffer system.out the Processor class should know nothing about formatting output - currently it does !!

currently a stupid file name for Fat Jar - needs creating with a sensible name

needs a throttling mechanism - some web site will stop crawlers

robots.txt - no support for robots.txt- currently crawls all web links.

Make end use friendly - add HTTP:// to the command line

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

webcrawler

About

Uh oh!

Releases

Packages

Languages

csennitt/webcrawler

Folders and files

Latest commit

History

Repository files navigation

webcrawler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages