comp479-project/report.txt at master · DrunkDutch/comp479-project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
Final Report COMP-479

Devin Mens 26290515
Caio Paiva 27339887
Eric Gagnon 27387474

Scrapping: Caio Paiva
Indexing/Query: Devin Mens
Report: Eric Gagnon

Web Crawler Section

Running the Crawler:
python crawler.py

Technologies Used:
- Scrapy: Web Crawling Framework for Python (https://scrapy.org/)
- Beautiful Soup 4: HTML text extraction

The Web crawling can be done by running the crawler.py program. We use the 'crawlerProperties.json'
to define the arguments to the spider crawling the web, argument include:
    1. spiderStartUrl: list<string>: A list with the starting urls for the crawl.
    2. spiderResult: string: The name of the folder were the corpus will be downloaded.
    3. spiderDocumentPrefix: string: The prefix for the incremental name of documents downloaded, files will be placed inside the folder specified in the argument above.
    4. spiderDocumentFetchLimit: Integer: The upper limit of documents that will be fetched when crawling.
    5. spiderTimeOut: Integer: A Timeout in seconds for the crawling.
The spider will stop crawling if 1. The document limit is reached, 2. The timeout is reached, or 3. The spider runs out of URLs to crawl.
Robot.txt expecification are followed when using the API by running the Scrapy CrawlerProcess with the 'ROBOTSTXT_OBEY' settings flags set
to True as specified in the Libraries Documentation (https://doc.scrapy.org/en/1.1/topics/settings.html#robotstxt-obey).
The spider 'sctrach.py' is responsable for the crawling of webpages, extract their text and links, and generate the corpus, composed of
the content of the page and the url so we can reaccess that page after a query.
A black list is also defined as to avoid crawling Websites that will not give us much information such as Social medias.


Indexing Section

To run the indexer against a given corpus, the corpus must be located in a directory named "Corpus" located directly adjacent
to the comp479-core folder that contains the code. The code is run by calling the spimi.py script with the desired parameters
as specified by its help command.

Example: python spimi.py -S 2000 -s -d -c -m

This would run the indexer with a memory blocksize of 2000KB, with stemming, case-folding, and stopword and digit removal.

Brief Information
1000 Documents
5745KB index using stemming, case folding, stopword and digit removal
12KB Meta-Data file
18KB sentiment hash-map
1321488 Tokens
1321 Average Document Length
63600 Unique Terms in Index

The indexing section of this project was taken from Devin Mens' submission for assignment 2 for this class with his permission (Team member).
The following modifications were made to meet the requirements as stipulated for this project:
    - The Document class in the "core" file was pared down due to the new simplicity of the corpus documents
        * New corpus documents consist of JSON objects containing only the url and body of the page from scrapper (See above)
    - Addition of a streamlined meta-data class to be serialized and stored as a file
        * Contains:
            ** Number of documents
            ** Token count
            ** Average document length
            ** Total corpus sentiment score
            ** hashmap/dictionary containing the following information for each document, keyed to the document id
                *** document id, document length, document sentiment score

        This was done to allow for easy sentiment analysis of the document when performing the query analysis later on. It was determined that
        to not do this would require recreating these values from the inverted index would be a huge performance impact and the effort of traversing
        the entire index would remove the benefit of using the index for query term matching later on. The other option was to include this information
        within the inverted index, however, including it in such a way that does not alter the original layout and design of the index (ie: including these values
        with the docId for each posting) would lead to an almost exponential increase in storage requirements and increased memory usage when processing queries.
        As seen from the information provided at the top of this section, the meta data as a seperate file takes 1/10 of a percent of the storage space compared to the index.
        This also allows us to store the metadata using a hashmap to allow for highly performant lookups to be made against the documentId.

        A similar serialized hashmap was also used to store the aFINN sentiment dictionary, this reduces storage space requirements by roughly half and allows for sentiment mapping to terms to
        be down much faster than searching the dictionary on a term by term basis at each time. This also allows for reduced overhead when performing sentiment
        analysis on the user queries at later dates.

        In regards to the enhanced inverted index, we created a new Term class that would allow for easy storage of sentiment value for a given term in the inverted index. This value is stored
        as the first value immediatly after the term in the index to remove even the need for lookups at later dates.

Query Section

To run a query against the index, it is done by simply calling the query.py script with the desired parameters as shown below:
python query.py -o -q "student strike" -s -d -c -m

This runs the script using the OR flag against the query "student strike" using stemming, case-folding and stopword and digit removal.
This script will then output to console the resulting documents with their score and ranking and outputs the results as text files
in the ./output folder located in the comp479-core directory

Similar to the indexing section, only a few minor changes were required to get the existing Query Handler provided by Devin Mens
from his second assignment to work for this project.
First change:
    Implementation of loading the serialized sentiment hashmap from the indexing section as an attribute of the QueryHandler class
        Used to calculate sentiment score of entire query
Second Change:
    Scoring of a given document relevant to a query has been modified to take the BM25 score for each term used in assignment 2 and multiplying it
       by the total sentiment score of the document
            This is done to ensure that while sentiment value of a document is of primary importance for the ranking of documents
            it still requires the document to have an appropriate term relevancy related to the query
    The order of the returned documents is also done based on query sentiment, with neutral or positive queries being returned in descending order
    while negative queries are returned in ascending order.


Observations

While adding basic sentiment analysis does allow for more relevent results to be presented to the user based on similarity of sentiment,
we feel as though there must be some better way to incorporate sentiment scores that would allow for a more complete evaluation of results for the user

We believe that it would also have been interesting to compare the results from this index-based approach with more vector based approaches in regards
to information retrieval, say using cosine similarity of document/query vectors or even cluster based systems using the same overall scoring mechanism.


Sample Queries and Information

All queries done using OR option as well as using stemming, stopword/digit removal and case folding

1)
funny youtube
score =4
202 results
result range: [-361, 1577]

2)
cold winter
score = 0
94 results
result range: [-410,566]

3)
Information Retrieval
score = 0
584 results
result range = [-659,459]