GitHub - dalgama/microblog-information-retrieval-system: Microblog information retrieval system. Implementing an Information Retrieval (IR) system based on a collection of documents (X messages).

# Microblog Information Retrieval System

## Setting Up

### 1. Create and activate a virtual environment (recommended)

From the project root:

```bash
python3 -m venv .venv
source .venv/bin/activate      # macOS/Linux

# On Windows (PowerShell):
# .venv\Scripts\Activate.ps1
```

2. Install dependencies

With the virtual environment activated:

pip install nltk prettytable

3. (Optional) Pre-download NLTK data

You can let main.py download what it needs at runtime, or do it once manually:

python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt'); nltk.download('punkt_tab')"

Once this is done, you can run:

python main.py

and it will generate dist/Results.txt.

Project Overview

The assets/ directory contains all the data required for this assignment:

tweet_list.txt – Collection of tweets (documents).
stop_words.txt – Additional stopword list.
test_queries.txt – Set of 49 test queries.
Trec_microblog11-qrels.txt – Relevance judgments for evaluation.

The dist/ directory is used for outputs:

Results.txt – Ranked retrieval results for all 49 queries in TREC format.
trec_eval.txt – Evaluation summary produced by trec_eval comparing Results.txt to Trec_microblog11-qrels.txt.

Example snippet from Results.txt:

Topic_id  Q0  docno              rank  score                  tag
1         Q0  30198105513140224  1     0.588467208018523      myRun
1         Q0  30260724248870912  2     0.5870127971399565     myRun
1         Q0  32229379287289857  3     0.5311552466369023     myRun

Execution

Once the prerequisites are satisfied, run:

python3 main.py

This generates dist/Results.txt in the following format:

Topic_id  Q0  docno              rank  score                  tag
1         Q0  30198105513140224  1     0.588467208018523      myRun
1         Q0  30260724248870912  2     0.5870127971399565     myRun
1         Q0  32229379287289857  3     0.5311552466369023     myRun

Evaluation

To evaluate the retrieval effectiveness:

Run eval.sh Produces trec_eval.txt, which reports overall performance metrics across all queries.
Run full-eval.sh Produces trec_eval_all_query.txt, which reports detailed trec_eval metrics for each individual query.

Functionality

The assignment is to implement an Information Retrieval (IR) system for a collection of microblog documents (tweets). At a high level, the system:

Loads and preprocesses data
- Imports tweets and test queries from the assets/ directory.
- Converts text into a Python-friendly representation (primarily dictionaries).
- Applies tokenization, stopword removal, and stemming (Porter stemmer) to normalize terms.
Builds an inverted index
- Constructs a term → document posting list.
- Computes idf for each term and tf-idf for each term in each tweet.
- Stores these values in the inverted index structure.
Represents queries and ranks documents
- Computes query term weights (including tf-idf).
- Uses document and query vectors to compute cosine similarity (CosSim) between each query and tweet.
- Produces a per-query ranking of tweets in descending order of similarity.
Writes results to disk
- Formats the ranked results according to the assignment’s required TREC-style layout.
- Writes them to dist/Results.txt.

Algorithms, Data Structures, and Optimizations

The implementation follows a standard vector space model with cosine similarity over tf-idf–weighted vectors. The core logic is split across several Python modules.

Project-Specific Files

`main.py`

Orchestrates the entire pipeline:
- Imports and preprocesses tweets and queries.
- Builds the inverted index.
- Computes document vector lengths.
- Calls the retrieval function to rank documents.
- Triggers result file creation.
Prints progress messages to indicate major stages (preprocessing, retrieval, result writing).

`preprocess.py`

Implements Step 1: Preprocessing and Step 2: Indexing.

Key functions:

isNumeric(subject) Checks whether a string represents a numeric value.
importTweets() Reads tweet_list.txt, applies filterSentence() to each tweet, and returns a dictionary mapping tweet IDs to token lists.
importQuery() Reads test_queries.txt, extracts <title> entries, applies filterSentence(), and returns a dictionary mapping query IDs to token lists.
filterSentence(sentence)
- Builds a combined stopword set from:
  - NLTK’s English stopwords,
  - custom stopwords (e.g., URLs and abbreviations),
  - the provided stop_words.txt.
- Tokenizes the input sentence.
- Removes stopwords, punctuation, and numeric tokens.
- Applies Porter stemming to each remaining token.
- Returns the cleaned, stemmed token list.
buildIndex(documents)
- Iterates over all preprocessed tweets.
- For each term, builds a posting list: term -> {doc_id: tf}.
- Computes idf for each term.
- Converts term frequencies to term weights (tf-idf) and stores them back into the inverted index.
lengthOfDocument(index, tweets)
- Computes the L2 norm (vector length) for each document based on its term weights.
- Returns doc_id -> length.

`results.py`

Implements the retrieval and ranking logic:
- Computes query term statistics and weights.
- Uses the inverted index and precomputed document lengths to compute cosine similarity between each query and each candidate document.
- Produces a ranked dictionary for each query: query_id -> {doc_id: score}, sorted by score in descending order.

`write.py`

Implements Step 4: Output Generation:
- Takes the ranked results.
- Formats them using PrettyTable into the TREC-like layout.
- Writes the final table to dist/Results.txt.

Additional Libraries

PrettyTable (`prettytable.py`)

Used to format Results.txt into aligned, human-readable columns.
Helps ensure the output matches the expected text-based table format.

NLTK

PorterStemmer Used in filterSentence() to normalize tokens by reducing words to their stems, improving matching between queries and documents.
Stopwords Used to remove high-frequency, low-information terms that usually do not contribute to relevance.
Tokenizer Used in filterSentence() to split sentences into tokens (alphanumeric sequences separated by non-alphanumeric characters), as required by the assignment’s preprocessing step.

Final Result Discussion

The system was evaluated with trec_eval by comparing dist/Results.txt to the provided relevance file Trec_microblog11-qrels.txt. The following is an excerpt of the overall evaluation:

runid                 	all	myRun
num_q                 	all	49
num_ret               	all	39091
num_rel               	all	2640
num_rel_ret           	all	2054
map                   	all	0.1634
gm_map                	all	0.0919
Rprec                 	all	0.1856
bpref                 	all	0.1465
recip_rank            	all	0.3484
iprec_at_recall_0.00  	all	0.4229
iprec_at_recall_0.10  	all	0.3001
iprec_at_recall_0.20  	all	0.2653
iprec_at_recall_0.30  	all	0.2195
iprec_at_recall_0.40  	all	0.2025
iprec_at_recall_0.50  	all	0.1770
iprec_at_recall_0.60  	all	0.1436
iprec_at_recall_0.70  	all	0.1230
iprec_at_recall_0.80  	all	0.1027
iprec_at_recall_0.90  	all	0.0685
iprec_at_recall_1.00  	all	0.0115
P_5                   	all	0.1714
P_10                  	all	0.1796
P_15                  	all	0.1796
P_20                  	all	0.1776
P_30                  	all	0.1714
P_100                 	all	0.1406
P_200                 	all	0.1133
P_500                 	all	0.0713
P_1000                	all	0.0419

Overall, the system achieves:

MAP ≈ 0.1634
P@10 ≈ 0.1796

After correcting and optimizing the retrieval logic (especially around query weighting and cosine similarity calculations), the performance improved from earlier buggy versions. Manual inspection of several queries indicates that the ranked results are generally relevant and make sense qualitatively.

Results from Queries 3 and 20

Query 3

3  Q0  32333726654398464  1   0.69484735460699       myRun
3  Q0  32910196598636545  2   0.6734426036041226     myRun
3  Q0  35040428893937664  3   0.5424091725376433     myRun
3  Q0  35039337598947328  4   0.5424091725376433     myRun
3  Q0  29613127372898304  5   0.5233927588038552     myRun
3  Q0  29615296666931200  6   0.5054085301107222     myRun
3  Q0  32204788955357184  7   0.48949945859699995    myRun
3  Q0  33711164877701120  8   0.47740062368197117    myRun
3  Q0  33995136060882945  9   0.47209559331399364    myRun
3  Q0  31167954573852672  10  0.47209559331399364    myRun

Query 20

20  Q0  33356942797701120  1   0.8821317020383918     myRun
20  Q0  34082003779330048  2   0.7311611336720092     myRun
20  Q0  34066620821282816  3   0.7311611336720092     myRun
20  Q0  33752688764125184  4   0.7311611336720092     myRun
20  Q0  33695252271480832  5   0.7311611336720092     myRun
20  Q0  33580510970126337  6   0.7311611336720092     myRun
20  Q0  32866366780342272  7   0.7311611336720092     myRun
20  Q0  32269178773708800  8   0.7311611336720092     myRun
20  Q0  32179898437218304  9   0.7311611336720092     myRun
20  Q0  31752644565409792  10  0.7311611336720092     myRun

Vocabulary

The final vocabulary size is 88,422 tokens.

Below is a sample of 100 tokens from the vocabulary:

['bbc', 'world', 'servic', 'staff', 'cut', 'fifa', 'soccer', 'haiti', 'aristid', 'return', 'mexico', 'drug', 'war',
 'diplomat', 'arrest', 'murder', 'phone', 'hack', 'british', 'politician', 'toyota', 'reca', 'egyptian', 'protest',
 'attack', 'museumkubica', 'crash', 'assang', 'nobel', 'peac', 'nomin', 'oprah', 'winfrey', 'half-sist', 'known',
 'unknown', 'white', 'stripe', 'breakup', 'william', 'kate', 'fax', 'save-the-da', 'cuomo', 'budget', 'super', 'bowl',
 'seat', 'tsa', 'airport', 'screen', 'unemploymen', 'reduc', 'energi', 'consumpt', 'detroit', 'auto', 'global', 'warm',
 'weather', 'keith', 'olbermann', 'job', 'special', 'athlet', 'state', 'union', 'dog', 'whisper', 'cesar', 'millan',
 "'s", 'techniqu', 'msnbc', 'rachel', 'maddow', 'sargent', 'shriver', 'tribut', 'moscow', 'bomb', 'gifford', 'recoveri',
 'jordan', 'curfew', 'beck', 'piven', 'obama', 'birth', 'certifica', 'campaign', 'social', 'media', 'veneta', 'organ',
 'farm', 'requir', 'evacu', 'carbon', 'monoxid']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

2. Install dependencies

3. (Optional) Pre-download NLTK data

Project Overview

Execution

Evaluation

Functionality

Algorithms, Data Structures, and Optimizations

Project-Specific Files

`main.py`

`preprocess.py`

`results.py`

`write.py`

Additional Libraries

PrettyTable (`prettytable.py`)

NLTK

Final Result Discussion

Results from Queries 3 and 20

Query 3

Query 20

Vocabulary

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
assets		assets
dist		dist
lib		lib
README.md		README.md
eval.sh		eval.sh
full-eval.sh		full-eval.sh
main.py		main.py
preprocess.py		preprocess.py
prettytable.py		prettytable.py
results.py		results.py
step3.py		step3.py
trec_eval.txt		trec_eval.txt
trec_eval_all_query.txt		trec_eval_all_query.txt
write.py		write.py

dalgama/microblog-information-retrieval-system

Folders and files

Latest commit

History

Repository files navigation

2. Install dependencies

3. (Optional) Pre-download NLTK data

Project Overview

Execution

Evaluation

Functionality

Algorithms, Data Structures, and Optimizations

Project-Specific Files

main.py

preprocess.py

results.py

write.py

Additional Libraries

PrettyTable (prettytable.py)

NLTK

Final Result Discussion

Results from Queries 3 and 20

Query 3

Query 20

Vocabulary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`main.py`

`preprocess.py`

`results.py`

`write.py`

PrettyTable (`prettytable.py`)

Packages