TF-IDF Schemes - Detailed Methodology

Mathematical Formulations

Term Frequency (TF) Schemes

1. Raw TF

tf_raw(t, d) = f(t, d)

Where:

f(t, d) = frequency of term t in document d

Characteristics:

Simple counting of term occurrences
No normalization
Biased toward longer documents
Best for: Short documents of similar length

Example:

Document: "cat sat on the mat with the cat"
"cat" appears 2 times → tf_raw = 2

2. Double Normalization (K-normalization)

tf_double_norm(t, d) = K + (1 - K) × (f(t, d) / max_f(d))

Where:

K = normalization constant (typically 0.5)
max_f(d) = frequency of most common term in document d

Characteristics:

Prevents bias towards longer documents
K=0.5 gives half weight to normalization
Values range from K to 1.0
Best for: Mixed-length documents

Example:

Document: "cat sat on the mat with the cat"
"cat" appears 2 times, max frequency = 2
tf_double_norm = 0.5 + (1-0.5) × (2/2) = 1.0
"sat" appears 1 time, max frequency = 2
tf_double_norm = 0.5 + (1-0.5) × (1/2) = 0.75

3. Log Scaled TF

tf_log(t, d) = 1 + log(f(t, d))  if f(t, d) > 0
             = 0                  otherwise

Characteristics:

Dampens effect of high frequencies
Reduces impact of repetition
Logarithmic growth instead of linear
Best for: Documents with high term repetition

Example:

"cat" appears 10 times
tf_log = 1 + log(10) = 1 + 2.303 = 3.303
"dog" appears 100 times
tf_log = 1 + log(100) = 1 + 4.605 = 5.605
Note: 10x increase in frequency → only 1.7x increase in weight

4. Normalized TF

tf_norm(t, d) = f(t, d) / |d|

Where:

|d| = total number of terms in document d

Characteristics:

Relative term importance
Values are proportions (0 to 1)
Document length independent
Best for: Comparing across varied document lengths

Example:

Document with 100 words
"cat" appears 5 times
tf_norm = 5/100 = 0.05

Inverse Document Frequency (IDF) Schemes

1. Standard IDF

idf_standard(t) = log(N / df(t))

Where:

N = total number of documents
df(t) = number of documents containing term t

Characteristics:

Classic formulation
Higher weight for rare terms
Lower weight for common terms
Can be undefined if df(t) = 0

Example:

Collection: 1000 documents
"algorithm" appears in 10 documents
idf = log(1000/10) = log(100) = 4.605
"the" appears in 999 documents
idf = log(1000/999) = log(1.001) = 0.001

2. Smooth IDF

idf_smooth(t) = log(1 + N / df(t))

Characteristics:

Adds smoothing to prevent extreme values
Avoids division by zero issues
Slightly lower weights than standard IDF
More stable numerically

Example:

Collection: 1000 documents
"algorithm" appears in 10 documents
idf = log(1 + 1000/10) = log(101) = 4.615

3. Max IDF (Normalized IDF)

idf_max(t) = log(N / df(t)) / max_t' log(N / df(t'))

Where:

max_t' log(N / df(t')) = maximum IDF in collection (term appearing in 1 doc)

Characteristics:

Scales IDF values between 0 and 1
Relative to collection's rarest term
Helps when combining with other normalized features
Useful for machine learning pipelines

Example:

N = 1000 documents
"algorithm" appears in 10 documents
idf = log(1000/10) / log(1000/1) = log(100) / log(1000) = 4.605 / 6.908 = 0.667

4. Probabilistic IDF

idf_prob(t) = log((N - df(t)) / df(t))

Characteristics:

Based on Robertson & Sparck Jones
Probability of relevance model
Can be negative for very common terms
Theoretical foundation in probability

Example:

Collection: 1000 documents
"algorithm" appears in 10 documents
idf = log((1000-10)/10) = log(99) = 4.595

5. Entropy-based IDF

H(t) = -Σ(p_i × log(p_i))  where p_i = f(t,d_i) / Σf(t,d_j)
idf_entropy(t) = 1 - (H(t) / log(df(t)))

Characteristics:

Measures information content
Lower entropy = more discriminative
Considers distribution across documents
Computationally intensive

Example:

Term appears in 4 documents with frequencies [2, 2, 2, 2]
Distribution is uniform → high entropy → low IDF
Term appears in 4 documents with frequencies [8, 1, 1, 1]
Distribution is skewed → low entropy → high IDF

TF-IDF Combination

Final score for term t in document d:

tfidf(t, d) = tf(t, d) × idf(t)

Document-query similarity (cosine):

sim(q, d) = Σ(tfidf(t, q) × tfidf(t, d)) / (||q|| × ||d||)

Selection Guidelines

When to use each TF scheme:

Raw TF
- Short, uniform-length documents
- When term frequency directly indicates importance
- Fast computation required
Double Normalization
- Mixed-length documents
- When you want to balance raw frequency and normalization
- Most versatile option
Log Scaled TF
- Documents with high term repetition
- When diminishing returns apply (2nd occurrence less important than 1st)
- Scientific/technical documents
Normalized TF
- Highly varied document lengths
- When relative frequency matters more than absolute
- Short queries vs. long documents

When to use each IDF scheme:

Standard IDF
- General purpose
- Well-understood and interpretable
- Good baseline
Smooth IDF
- Noisy collections
- When numerical stability is important
- Small document collections
Max IDF
- When you want IDF relative to corpus
- Experimental/comparative analysis
Probabilistic IDF
- Theoretical soundness required
- When you can handle negative values
- Research settings
Entropy-based IDF
- When term distribution matters
- Willing to pay computational cost
- Need fine-grained discrimination

Expected Performance on Cranfield

Based on IR literature:

Best Combinations (typically):

Log Scaled TF + Smooth IDF
Double Normalization + Standard IDF
Log Scaled TF + Standard IDF

Why these work well:

Balance between raw frequency and normalization
Logarithmic dampening prevents over-emphasis
Standard/smooth IDF are well-calibrated

Common Pitfalls:

Raw TF without normalization (document length bias)
Probabilistic IDF (can give negative values)
Over-normalized schemes (lose important signal)

Evaluation Metrics Explained

MAP (Mean Average Precision)

Average of precision values at each relevant document
Emphasizes ranking quality
Range: 0 to 1 (higher is better)
Most important metric for overall system quality

P@10 (Precision at 10)

Proportion of relevant docs in top 10
User-facing metric
Range: 0 to 1
Important for web search

R@10 (Recall at 10)

Proportion of relevant docs found in top 10
Complements precision
Important when finding all relevant docs matters

MRR (Mean Reciprocal Rank)

1 / (rank of first relevant document)
Emphasizes first result
Important for question answering

NDCG@10

Considers graded relevance
Position-discounted
More sophisticated than P@10

Experimental Protocol

1. Preprocessing

Tokenization (split on whitespace)
Lowercasing
Stopword removal
Stemming (suffix stripping)

2. Index Building

Inverted index: term → [(doc, freq), ...]
Statistics: doc lengths, max frequencies
Vocabulary extraction

3. Retrieval

Query processing (same as documents)
TF-IDF calculation for query terms
Cosine similarity scoring
Ranking by score

4. Evaluation

For each query:
- Retrieve ranked list
- Compare with relevance judgments
- Calculate metrics
Average across all queries

Interpreting Results

Statistical Significance

Small differences (<0.01) may not be meaningful
Consider variance across queries
Look for consistent improvements

Choosing Best Scheme

Identify top 3-5 by MAP
Check consistency across metrics
Consider computational cost
Evaluate on held-out queries if possible

Trade-offs

Precision vs. Recall
Computational cost vs. accuracy
Complexity vs. interpretability
Generalization vs. dataset-specific tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF-IDF Schemes - Detailed Methodology

Mathematical Formulations

Term Frequency (TF) Schemes

1. Raw TF

2. Double Normalization (K-normalization)

3. Log Scaled TF

4. Normalized TF

Inverse Document Frequency (IDF) Schemes

1. Standard IDF

2. Smooth IDF

3. Max IDF (Normalized IDF)

4. Probabilistic IDF

5. Entropy-based IDF

TF-IDF Combination

Selection Guidelines

When to use each TF scheme:

When to use each IDF scheme:

Expected Performance on Cranfield

Best Combinations (typically):

Why these work well:

Common Pitfalls:

Evaluation Metrics Explained

MAP (Mean Average Precision)

P@10 (Precision at 10)

R@10 (Recall at 10)

MRR (Mean Reciprocal Rank)

NDCG@10

Experimental Protocol

1. Preprocessing

2. Index Building

3. Retrieval

4. Evaluation

Interpreting Results

Statistical Significance

Choosing Best Scheme

Trade-offs

FilesExpand file tree

METHODOLOGY.md

Latest commit

History

METHODOLOGY.md

File metadata and controls

TF-IDF Schemes - Detailed Methodology

Mathematical Formulations

Term Frequency (TF) Schemes

1. Raw TF

2. Double Normalization (K-normalization)

3. Log Scaled TF

4. Normalized TF

Inverse Document Frequency (IDF) Schemes

1. Standard IDF

2. Smooth IDF

3. Max IDF (Normalized IDF)

4. Probabilistic IDF

5. Entropy-based IDF

TF-IDF Combination

Selection Guidelines

When to use each TF scheme:

When to use each IDF scheme:

Expected Performance on Cranfield

Best Combinations (typically):

Why these work well:

Common Pitfalls:

Evaluation Metrics Explained

MAP (Mean Average Precision)

P@10 (Precision at 10)

R@10 (Recall at 10)

MRR (Mean Reciprocal Rank)

NDCG@10

Experimental Protocol

1. Preprocessing

2. Index Building

3. Retrieval

4. Evaluation

Interpreting Results

Statistical Significance

Choosing Best Scheme

Trade-offs