IR-Index-Construction

Index Creation for Information Retrieval

Index construction Create an inverted index for the given corpus with data structures designed by you.

Tokens: all alphanumeric sequences in the dataset.
Stop words: do not use stopping, i.e. use all words, even the frequently occurring ones.
Stemming: use stemming for better textual matches. Suggestion: Porter stemming.
Important words: Words in bold, in headings (h1, h2, h3), and in titles should be treated as more important than the other words.

Building the inverted index: Now that you have been provided the HTML files to index, you may build your inverted index off of them. The inverted index is simply a map with the token as a key and a list of its corresponding postings. A posting is the representation of the token’s occurrence in a document. The posting typically (not limited to) contains the following info (you are encouraged to think of other attributes that you could add to the index):

The document name/id the token was found in. Its tf-idf score for that document (for MS1, add only the term frequency)

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
stats		stats
.gitignore		.gitignore
README.md		README.md
indexer.py		indexer.py
merger.py		merger.py
report.pdf		report.pdf
requirements.txt		requirements.txt
searcher-alt.py		searcher-alt.py
searcher.py		searcher.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IR-Index-Construction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

arrrnav/IR-Index-Creation

Folders and files

Latest commit

History

Repository files navigation

IR-Index-Construction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages