Index Creation for Information Retrieval
Index construction Create an inverted index for the given corpus with data structures designed by you.
- Tokens: all alphanumeric sequences in the dataset.
- Stop words: do not use stopping, i.e. use all words, even the frequently occurring ones.
- Stemming: use stemming for better textual matches. Suggestion: Porter stemming.
- Important words: Words in bold, in headings (h1, h2, h3), and in titles should be treated as more important than the other words.
Building the inverted index: Now that you have been provided the HTML files to index, you may build your inverted index off of them. The inverted index is simply a map with the token as a key and a list of its corresponding postings. A posting is the representation of the token’s occurrence in a document. The posting typically (not limited to) contains the following info (you are encouraged to think of other attributes that you could add to the index):
The document name/id the token was found in. Its tf-idf score for that document (for MS1, add only the term frequency)