Skip to content

arrrnav/IR-Index-Creation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IR-Index-Construction

Index Creation for Information Retrieval

Index construction Create an inverted index for the given corpus with data structures designed by you.

  • Tokens: all alphanumeric sequences in the dataset.
  • Stop words: do not use stopping, i.e. use all words, even the frequently occurring ones.
  • Stemming: use stemming for better textual matches. Suggestion: Porter stemming.
  • Important words: Words in bold, in headings (h1, h2, h3), and in titles should be treated as more important than the other words.

Building the inverted index: Now that you have been provided the HTML files to index, you may build your inverted index off of them. The inverted index is simply a map with the token as a key and a list of its corresponding postings. A posting is the representation of the token’s occurrence in a document. The posting typically (not limited to) contains the following info (you are encouraged to think of other attributes that you could add to the index):

The document name/id the token was found in. Its tf-idf score for that document (for MS1, add only the term frequency)

About

Index Creation for Information Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages