You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sreenidhithallam edited this page Dec 14, 2016
·
1 revision
Crawler-Service
Input : URL,Term,Intent,Domain
(This JSON is output of function generator)
Get request to the url to fetch the data (May be done by library)
Note:(internet required, timeout, check and update the status about url)
Filterout unwanted content from the fetched data.
StopWords(We should be able to customise the stopwords list)
Search for the words we are interested in, get the term density for interested terms
(Can configure for the words irrespective of case- sensitivity)
Search for synonyms (which improves the accuracy of crawling)
Index the url in neo4j
Create the node for web document
Create the relationship with concept graph terms
(ensure that concept term is related to domain up, update the relation property with the term density)
If any terms found that are not related to domain then make relationship to url
Goto mongoDB and update the document with terms found and terms not found and terms found not related to domain