billstark/ranked-retrieval
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
This is the README file for A0135817B and A0147995H's submission
Email address:
e0011848@u.nus.edu
e0012667@u.nus.edu
== Python Version ==
We're using Python Version 2.7.14 for this assignment.
== General Notes about this assignment ==
## Indexing:
Procedure:
1. For every training file, load the whole content into the system and tokenize into words
2. Remove invalid characters of the tokenized word
3. Count the term in a dictionary mapping document id to a term frequency counter
4. Write postings into the posting file
5. Write terms into dictionary file with offset to the corresponding postings.
Notes:
- We choose not to use sentence tokenizer because we are handle invalid punctuation by keeping only
digits, alphabets, white spaces and dashes
Searching:
- The entire dictionary is read into memory
- Each query is tokenized using the same method as for indexing
- For each query term, we calculate their weights using 1 + log(tf)
- For every document that any of the query terms appear in, we calculate the total score using the query weight
and the document tf-idf
- We collect the top 10 documents using a priority queue and return the result
== Files included with this submission ==
common.py - includes the tokenization function which is used for both index and search
index.py - indexing program that will be run to index all the training files.
search.py - searching program that will be used to execute queries in a specific file and give output.
dictionary.txt - a dictionary mapping terms to their location in the postings file
postings.txt - includes document IDs for each term
== Statement of individual work ==
Please initial one of the following statements.
[X] We, A0135817B and A0147995H, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments. In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.
[ ] I, A0000000X, did not follow the class rules regarding homework
assignment, because of the following reason:
<Please fill in>
I suggest that I should be graded as follows:
<Please fill in>
== References ==
- tf-idf