Skip to content

Latest commit

 

History

History
354 lines (275 loc) · 8.03 KB

File metadata and controls

354 lines (275 loc) · 8.03 KB

TF-IDF Schemes - Detailed Methodology

Mathematical Formulations

Term Frequency (TF) Schemes

1. Raw TF

tf_raw(t, d) = f(t, d)

Where:

  • f(t, d) = frequency of term t in document d

Characteristics:

  • Simple counting of term occurrences
  • No normalization
  • Biased toward longer documents
  • Best for: Short documents of similar length

Example:

  • Document: "cat sat on the mat with the cat"
  • "cat" appears 2 times → tf_raw = 2

2. Double Normalization (K-normalization)

tf_double_norm(t, d) = K + (1 - K) × (f(t, d) / max_f(d))

Where:

  • K = normalization constant (typically 0.5)
  • max_f(d) = frequency of most common term in document d

Characteristics:

  • Prevents bias towards longer documents
  • K=0.5 gives half weight to normalization
  • Values range from K to 1.0
  • Best for: Mixed-length documents

Example:

  • Document: "cat sat on the mat with the cat"
  • "cat" appears 2 times, max frequency = 2
  • tf_double_norm = 0.5 + (1-0.5) × (2/2) = 1.0
  • "sat" appears 1 time, max frequency = 2
  • tf_double_norm = 0.5 + (1-0.5) × (1/2) = 0.75

3. Log Scaled TF

tf_log(t, d) = 1 + log(f(t, d))  if f(t, d) > 0
             = 0                  otherwise

Characteristics:

  • Dampens effect of high frequencies
  • Reduces impact of repetition
  • Logarithmic growth instead of linear
  • Best for: Documents with high term repetition

Example:

  • "cat" appears 10 times
  • tf_log = 1 + log(10) = 1 + 2.303 = 3.303
  • "dog" appears 100 times
  • tf_log = 1 + log(100) = 1 + 4.605 = 5.605
  • Note: 10x increase in frequency → only 1.7x increase in weight

4. Normalized TF

tf_norm(t, d) = f(t, d) / |d|

Where:

  • |d| = total number of terms in document d

Characteristics:

  • Relative term importance
  • Values are proportions (0 to 1)
  • Document length independent
  • Best for: Comparing across varied document lengths

Example:

  • Document with 100 words
  • "cat" appears 5 times
  • tf_norm = 5/100 = 0.05

Inverse Document Frequency (IDF) Schemes

1. Standard IDF

idf_standard(t) = log(N / df(t))

Where:

  • N = total number of documents
  • df(t) = number of documents containing term t

Characteristics:

  • Classic formulation
  • Higher weight for rare terms
  • Lower weight for common terms
  • Can be undefined if df(t) = 0

Example:

  • Collection: 1000 documents
  • "algorithm" appears in 10 documents
  • idf = log(1000/10) = log(100) = 4.605
  • "the" appears in 999 documents
  • idf = log(1000/999) = log(1.001) = 0.001

2. Smooth IDF

idf_smooth(t) = log(1 + N / df(t))

Characteristics:

  • Adds smoothing to prevent extreme values
  • Avoids division by zero issues
  • Slightly lower weights than standard IDF
  • More stable numerically

Example:

  • Collection: 1000 documents
  • "algorithm" appears in 10 documents
  • idf = log(1 + 1000/10) = log(101) = 4.615

3. Max IDF (Normalized IDF)

idf_max(t) = log(N / df(t)) / max_t' log(N / df(t'))

Where:

  • max_t' log(N / df(t')) = maximum IDF in collection (term appearing in 1 doc)

Characteristics:

  • Scales IDF values between 0 and 1
  • Relative to collection's rarest term
  • Helps when combining with other normalized features
  • Useful for machine learning pipelines

Example:

  • N = 1000 documents
  • "algorithm" appears in 10 documents
  • idf = log(1000/10) / log(1000/1) = log(100) / log(1000) = 4.605 / 6.908 = 0.667

4. Probabilistic IDF

idf_prob(t) = log((N - df(t)) / df(t))

Characteristics:

  • Based on Robertson & Sparck Jones
  • Probability of relevance model
  • Can be negative for very common terms
  • Theoretical foundation in probability

Example:

  • Collection: 1000 documents
  • "algorithm" appears in 10 documents
  • idf = log((1000-10)/10) = log(99) = 4.595

5. Entropy-based IDF

H(t) = -Σ(p_i × log(p_i))  where p_i = f(t,d_i) / Σf(t,d_j)
idf_entropy(t) = 1 - (H(t) / log(df(t)))

Characteristics:

  • Measures information content
  • Lower entropy = more discriminative
  • Considers distribution across documents
  • Computationally intensive

Example:

  • Term appears in 4 documents with frequencies [2, 2, 2, 2]
  • Distribution is uniform → high entropy → low IDF
  • Term appears in 4 documents with frequencies [8, 1, 1, 1]
  • Distribution is skewed → low entropy → high IDF

TF-IDF Combination

Final score for term t in document d:

tfidf(t, d) = tf(t, d) × idf(t)

Document-query similarity (cosine):

sim(q, d) = Σ(tfidf(t, q) × tfidf(t, d)) / (||q|| × ||d||)

Selection Guidelines

When to use each TF scheme:

  1. Raw TF

    • Short, uniform-length documents
    • When term frequency directly indicates importance
    • Fast computation required
  2. Double Normalization

    • Mixed-length documents
    • When you want to balance raw frequency and normalization
    • Most versatile option
  3. Log Scaled TF

    • Documents with high term repetition
    • When diminishing returns apply (2nd occurrence less important than 1st)
    • Scientific/technical documents
  4. Normalized TF

    • Highly varied document lengths
    • When relative frequency matters more than absolute
    • Short queries vs. long documents

When to use each IDF scheme:

  1. Standard IDF

    • General purpose
    • Well-understood and interpretable
    • Good baseline
  2. Smooth IDF

    • Noisy collections
    • When numerical stability is important
    • Small document collections
  3. Max IDF

    • When you want IDF relative to corpus
    • Experimental/comparative analysis
  4. Probabilistic IDF

    • Theoretical soundness required
    • When you can handle negative values
    • Research settings
  5. Entropy-based IDF

    • When term distribution matters
    • Willing to pay computational cost
    • Need fine-grained discrimination

Expected Performance on Cranfield

Based on IR literature:

Best Combinations (typically):

  1. Log Scaled TF + Smooth IDF
  2. Double Normalization + Standard IDF
  3. Log Scaled TF + Standard IDF

Why these work well:

  • Balance between raw frequency and normalization
  • Logarithmic dampening prevents over-emphasis
  • Standard/smooth IDF are well-calibrated

Common Pitfalls:

  • Raw TF without normalization (document length bias)
  • Probabilistic IDF (can give negative values)
  • Over-normalized schemes (lose important signal)

Evaluation Metrics Explained

MAP (Mean Average Precision)

  • Average of precision values at each relevant document
  • Emphasizes ranking quality
  • Range: 0 to 1 (higher is better)
  • Most important metric for overall system quality

P@10 (Precision at 10)

  • Proportion of relevant docs in top 10
  • User-facing metric
  • Range: 0 to 1
  • Important for web search

R@10 (Recall at 10)

  • Proportion of relevant docs found in top 10
  • Complements precision
  • Important when finding all relevant docs matters

MRR (Mean Reciprocal Rank)

  • 1 / (rank of first relevant document)
  • Emphasizes first result
  • Important for question answering

NDCG@10

  • Considers graded relevance
  • Position-discounted
  • More sophisticated than P@10

Experimental Protocol

1. Preprocessing

  • Tokenization (split on whitespace)
  • Lowercasing
  • Stopword removal
  • Stemming (suffix stripping)

2. Index Building

  • Inverted index: term → [(doc, freq), ...]
  • Statistics: doc lengths, max frequencies
  • Vocabulary extraction

3. Retrieval

  • Query processing (same as documents)
  • TF-IDF calculation for query terms
  • Cosine similarity scoring
  • Ranking by score

4. Evaluation

  • For each query:
    • Retrieve ranked list
    • Compare with relevance judgments
    • Calculate metrics
  • Average across all queries

Interpreting Results

Statistical Significance

  • Small differences (<0.01) may not be meaningful
  • Consider variance across queries
  • Look for consistent improvements

Choosing Best Scheme

  1. Identify top 3-5 by MAP
  2. Check consistency across metrics
  3. Consider computational cost
  4. Evaluate on held-out queries if possible

Trade-offs

  • Precision vs. Recall
  • Computational cost vs. accuracy
  • Complexity vs. interpretability
  • Generalization vs. dataset-specific tuning