# Microblog Information Retrieval System
## Setting Up
### 1. Create and activate a virtual environment (recommended)
From the project root:
```bash
python3 -m venv .venv
source .venv/bin/activate # macOS/Linux
# On Windows (PowerShell):
# .venv\Scripts\Activate.ps1
```With the virtual environment activated:
pip install nltk prettytableYou can let main.py download what it needs at runtime, or do it once manually:
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('punkt_tab')"Once this is done, you can run:
python main.pyand it will generate dist/Results.txt.
The assets/ directory contains all the data required for this assignment:
tweet_list.txt– Collection of tweets (documents).stop_words.txt– Additional stopword list.test_queries.txt– Set of 49 test queries.Trec_microblog11-qrels.txt– Relevance judgments for evaluation.
The dist/ directory is used for outputs:
Results.txt– Ranked retrieval results for all 49 queries in TREC format.trec_eval.txt– Evaluation summary produced bytrec_evalcomparingResults.txttoTrec_microblog11-qrels.txt.
Example snippet from Results.txt:
Topic_id Q0 docno rank score tag
1 Q0 30198105513140224 1 0.588467208018523 myRun
1 Q0 30260724248870912 2 0.5870127971399565 myRun
1 Q0 32229379287289857 3 0.5311552466369023 myRun
Once the prerequisites are satisfied, run:
python3 main.pyThis generates dist/Results.txt in the following format:
Topic_id Q0 docno rank score tag
1 Q0 30198105513140224 1 0.588467208018523 myRun
1 Q0 30260724248870912 2 0.5870127971399565 myRun
1 Q0 32229379287289857 3 0.5311552466369023 myRun
To evaluate the retrieval effectiveness:
-
Run
eval.shProducestrec_eval.txt, which reports overall performance metrics across all queries. -
Run
full-eval.shProducestrec_eval_all_query.txt, which reports detailedtrec_evalmetrics for each individual query.
The assignment is to implement an Information Retrieval (IR) system for a collection of microblog documents (tweets). At a high level, the system:
-
Loads and preprocesses data
- Imports tweets and test queries from the
assets/directory. - Converts text into a Python-friendly representation (primarily dictionaries).
- Applies tokenization, stopword removal, and stemming (Porter stemmer) to normalize terms.
- Imports tweets and test queries from the
-
Builds an inverted index
- Constructs a term → document posting list.
- Computes
idffor each term andtf-idffor each term in each tweet. - Stores these values in the inverted index structure.
-
Represents queries and ranks documents
- Computes query term weights (including tf-idf).
- Uses document and query vectors to compute cosine similarity (
CosSim) between each query and tweet. - Produces a per-query ranking of tweets in descending order of similarity.
-
Writes results to disk
- Formats the ranked results according to the assignment’s required TREC-style layout.
- Writes them to
dist/Results.txt.
The implementation follows a standard vector space model with cosine similarity over tf-idf–weighted vectors. The core logic is split across several Python modules.
-
Orchestrates the entire pipeline:
- Imports and preprocesses tweets and queries.
- Builds the inverted index.
- Computes document vector lengths.
- Calls the retrieval function to rank documents.
- Triggers result file creation.
-
Prints progress messages to indicate major stages (preprocessing, retrieval, result writing).
Implements Step 1: Preprocessing and Step 2: Indexing.
Key functions:
-
isNumeric(subject)Checks whether a string represents a numeric value. -
importTweets()Readstweet_list.txt, appliesfilterSentence()to each tweet, and returns a dictionary mapping tweet IDs to token lists. -
importQuery()Readstest_queries.txt, extracts<title>entries, appliesfilterSentence(), and returns a dictionary mapping query IDs to token lists. -
filterSentence(sentence)-
Builds a combined stopword set from:
- NLTK’s English stopwords,
- custom stopwords (e.g., URLs and abbreviations),
- the provided
stop_words.txt.
-
Tokenizes the input sentence.
-
Removes stopwords, punctuation, and numeric tokens.
-
Applies Porter stemming to each remaining token.
-
Returns the cleaned, stemmed token list.
-
-
buildIndex(documents)- Iterates over all preprocessed tweets.
- For each term, builds a posting list:
term -> {doc_id: tf}. - Computes
idffor each term. - Converts term frequencies to term weights (tf-idf) and stores them back into the inverted index.
-
lengthOfDocument(index, tweets)- Computes the L2 norm (vector length) for each document based on its term weights.
- Returns
doc_id -> length.
-
Implements the retrieval and ranking logic:
- Computes query term statistics and weights.
- Uses the inverted index and precomputed document lengths to compute cosine similarity between each query and each candidate document.
- Produces a ranked dictionary for each query:
query_id -> {doc_id: score}, sorted by score in descending order.
-
Implements Step 4: Output Generation:
- Takes the ranked results.
- Formats them using PrettyTable into the TREC-like layout.
- Writes the final table to
dist/Results.txt.
- Used to format
Results.txtinto aligned, human-readable columns. - Helps ensure the output matches the expected text-based table format.
-
PorterStemmer Used in
filterSentence()to normalize tokens by reducing words to their stems, improving matching between queries and documents. -
Stopwords Used to remove high-frequency, low-information terms that usually do not contribute to relevance.
-
Tokenizer Used in
filterSentence()to split sentences into tokens (alphanumeric sequences separated by non-alphanumeric characters), as required by the assignment’s preprocessing step.
The system was evaluated with trec_eval by comparing dist/Results.txt to the provided relevance file Trec_microblog11-qrels.txt. The following is an excerpt of the overall evaluation:
runid all myRun
num_q all 49
num_ret all 39091
num_rel all 2640
num_rel_ret all 2054
map all 0.1634
gm_map all 0.0919
Rprec all 0.1856
bpref all 0.1465
recip_rank all 0.3484
iprec_at_recall_0.00 all 0.4229
iprec_at_recall_0.10 all 0.3001
iprec_at_recall_0.20 all 0.2653
iprec_at_recall_0.30 all 0.2195
iprec_at_recall_0.40 all 0.2025
iprec_at_recall_0.50 all 0.1770
iprec_at_recall_0.60 all 0.1436
iprec_at_recall_0.70 all 0.1230
iprec_at_recall_0.80 all 0.1027
iprec_at_recall_0.90 all 0.0685
iprec_at_recall_1.00 all 0.0115
P_5 all 0.1714
P_10 all 0.1796
P_15 all 0.1796
P_20 all 0.1776
P_30 all 0.1714
P_100 all 0.1406
P_200 all 0.1133
P_500 all 0.0713
P_1000 all 0.0419
Overall, the system achieves:
- MAP ≈
0.1634 - P@10 ≈
0.1796
After correcting and optimizing the retrieval logic (especially around query weighting and cosine similarity calculations), the performance improved from earlier buggy versions. Manual inspection of several queries indicates that the ranked results are generally relevant and make sense qualitatively.
3 Q0 32333726654398464 1 0.69484735460699 myRun
3 Q0 32910196598636545 2 0.6734426036041226 myRun
3 Q0 35040428893937664 3 0.5424091725376433 myRun
3 Q0 35039337598947328 4 0.5424091725376433 myRun
3 Q0 29613127372898304 5 0.5233927588038552 myRun
3 Q0 29615296666931200 6 0.5054085301107222 myRun
3 Q0 32204788955357184 7 0.48949945859699995 myRun
3 Q0 33711164877701120 8 0.47740062368197117 myRun
3 Q0 33995136060882945 9 0.47209559331399364 myRun
3 Q0 31167954573852672 10 0.47209559331399364 myRun
20 Q0 33356942797701120 1 0.8821317020383918 myRun
20 Q0 34082003779330048 2 0.7311611336720092 myRun
20 Q0 34066620821282816 3 0.7311611336720092 myRun
20 Q0 33752688764125184 4 0.7311611336720092 myRun
20 Q0 33695252271480832 5 0.7311611336720092 myRun
20 Q0 33580510970126337 6 0.7311611336720092 myRun
20 Q0 32866366780342272 7 0.7311611336720092 myRun
20 Q0 32269178773708800 8 0.7311611336720092 myRun
20 Q0 32179898437218304 9 0.7311611336720092 myRun
20 Q0 31752644565409792 10 0.7311611336720092 myRun
The final vocabulary size is 88,422 tokens.
Below is a sample of 100 tokens from the vocabulary:
['bbc', 'world', 'servic', 'staff', 'cut', 'fifa', 'soccer', 'haiti', 'aristid', 'return', 'mexico', 'drug', 'war',
'diplomat', 'arrest', 'murder', 'phone', 'hack', 'british', 'politician', 'toyota', 'reca', 'egyptian', 'protest',
'attack', 'museumkubica', 'crash', 'assang', 'nobel', 'peac', 'nomin', 'oprah', 'winfrey', 'half-sist', 'known',
'unknown', 'white', 'stripe', 'breakup', 'william', 'kate', 'fax', 'save-the-da', 'cuomo', 'budget', 'super', 'bowl',
'seat', 'tsa', 'airport', 'screen', 'unemploymen', 'reduc', 'energi', 'consumpt', 'detroit', 'auto', 'global', 'warm',
'weather', 'keith', 'olbermann', 'job', 'special', 'athlet', 'state', 'union', 'dog', 'whisper', 'cesar', 'millan',
"'s", 'techniqu', 'msnbc', 'rachel', 'maddow', 'sargent', 'shriver', 'tribut', 'moscow', 'bomb', 'gifford', 'recoveri',
'jordan', 'curfew', 'beck', 'piven', 'obama', 'birth', 'certifica', 'campaign', 'social', 'media', 'veneta', 'organ',
'farm', 'requir', 'evacu', 'carbon', 'monoxid']