tf_raw(t, d) = f(t, d)
Where:
- f(t, d) = frequency of term t in document d
Characteristics:
- Simple counting of term occurrences
- No normalization
- Biased toward longer documents
- Best for: Short documents of similar length
Example:
- Document: "cat sat on the mat with the cat"
- "cat" appears 2 times → tf_raw = 2
tf_double_norm(t, d) = K + (1 - K) × (f(t, d) / max_f(d))
Where:
- K = normalization constant (typically 0.5)
- max_f(d) = frequency of most common term in document d
Characteristics:
- Prevents bias towards longer documents
- K=0.5 gives half weight to normalization
- Values range from K to 1.0
- Best for: Mixed-length documents
Example:
- Document: "cat sat on the mat with the cat"
- "cat" appears 2 times, max frequency = 2
- tf_double_norm = 0.5 + (1-0.5) × (2/2) = 1.0
- "sat" appears 1 time, max frequency = 2
- tf_double_norm = 0.5 + (1-0.5) × (1/2) = 0.75
tf_log(t, d) = 1 + log(f(t, d)) if f(t, d) > 0
= 0 otherwise
Characteristics:
- Dampens effect of high frequencies
- Reduces impact of repetition
- Logarithmic growth instead of linear
- Best for: Documents with high term repetition
Example:
- "cat" appears 10 times
- tf_log = 1 + log(10) = 1 + 2.303 = 3.303
- "dog" appears 100 times
- tf_log = 1 + log(100) = 1 + 4.605 = 5.605
- Note: 10x increase in frequency → only 1.7x increase in weight
tf_norm(t, d) = f(t, d) / |d|
Where:
- |d| = total number of terms in document d
Characteristics:
- Relative term importance
- Values are proportions (0 to 1)
- Document length independent
- Best for: Comparing across varied document lengths
Example:
- Document with 100 words
- "cat" appears 5 times
- tf_norm = 5/100 = 0.05
idf_standard(t) = log(N / df(t))
Where:
- N = total number of documents
- df(t) = number of documents containing term t
Characteristics:
- Classic formulation
- Higher weight for rare terms
- Lower weight for common terms
- Can be undefined if df(t) = 0
Example:
- Collection: 1000 documents
- "algorithm" appears in 10 documents
- idf = log(1000/10) = log(100) = 4.605
- "the" appears in 999 documents
- idf = log(1000/999) = log(1.001) = 0.001
idf_smooth(t) = log(1 + N / df(t))
Characteristics:
- Adds smoothing to prevent extreme values
- Avoids division by zero issues
- Slightly lower weights than standard IDF
- More stable numerically
Example:
- Collection: 1000 documents
- "algorithm" appears in 10 documents
- idf = log(1 + 1000/10) = log(101) = 4.615
idf_max(t) = log(N / df(t)) / max_t' log(N / df(t'))
Where:
- max_t' log(N / df(t')) = maximum IDF in collection (term appearing in 1 doc)
Characteristics:
- Scales IDF values between 0 and 1
- Relative to collection's rarest term
- Helps when combining with other normalized features
- Useful for machine learning pipelines
Example:
- N = 1000 documents
- "algorithm" appears in 10 documents
- idf = log(1000/10) / log(1000/1) = log(100) / log(1000) = 4.605 / 6.908 = 0.667
idf_prob(t) = log((N - df(t)) / df(t))
Characteristics:
- Based on Robertson & Sparck Jones
- Probability of relevance model
- Can be negative for very common terms
- Theoretical foundation in probability
Example:
- Collection: 1000 documents
- "algorithm" appears in 10 documents
- idf = log((1000-10)/10) = log(99) = 4.595
H(t) = -Σ(p_i × log(p_i)) where p_i = f(t,d_i) / Σf(t,d_j)
idf_entropy(t) = 1 - (H(t) / log(df(t)))
Characteristics:
- Measures information content
- Lower entropy = more discriminative
- Considers distribution across documents
- Computationally intensive
Example:
- Term appears in 4 documents with frequencies [2, 2, 2, 2]
- Distribution is uniform → high entropy → low IDF
- Term appears in 4 documents with frequencies [8, 1, 1, 1]
- Distribution is skewed → low entropy → high IDF
Final score for term t in document d:
tfidf(t, d) = tf(t, d) × idf(t)
Document-query similarity (cosine):
sim(q, d) = Σ(tfidf(t, q) × tfidf(t, d)) / (||q|| × ||d||)
-
Raw TF
- Short, uniform-length documents
- When term frequency directly indicates importance
- Fast computation required
-
Double Normalization
- Mixed-length documents
- When you want to balance raw frequency and normalization
- Most versatile option
-
Log Scaled TF
- Documents with high term repetition
- When diminishing returns apply (2nd occurrence less important than 1st)
- Scientific/technical documents
-
Normalized TF
- Highly varied document lengths
- When relative frequency matters more than absolute
- Short queries vs. long documents
-
Standard IDF
- General purpose
- Well-understood and interpretable
- Good baseline
-
Smooth IDF
- Noisy collections
- When numerical stability is important
- Small document collections
-
Max IDF
- When you want IDF relative to corpus
- Experimental/comparative analysis
-
Probabilistic IDF
- Theoretical soundness required
- When you can handle negative values
- Research settings
-
Entropy-based IDF
- When term distribution matters
- Willing to pay computational cost
- Need fine-grained discrimination
Based on IR literature:
- Log Scaled TF + Smooth IDF
- Double Normalization + Standard IDF
- Log Scaled TF + Standard IDF
- Balance between raw frequency and normalization
- Logarithmic dampening prevents over-emphasis
- Standard/smooth IDF are well-calibrated
- Raw TF without normalization (document length bias)
- Probabilistic IDF (can give negative values)
- Over-normalized schemes (lose important signal)
- Average of precision values at each relevant document
- Emphasizes ranking quality
- Range: 0 to 1 (higher is better)
- Most important metric for overall system quality
- Proportion of relevant docs in top 10
- User-facing metric
- Range: 0 to 1
- Important for web search
- Proportion of relevant docs found in top 10
- Complements precision
- Important when finding all relevant docs matters
- 1 / (rank of first relevant document)
- Emphasizes first result
- Important for question answering
- Considers graded relevance
- Position-discounted
- More sophisticated than P@10
- Tokenization (split on whitespace)
- Lowercasing
- Stopword removal
- Stemming (suffix stripping)
- Inverted index: term → [(doc, freq), ...]
- Statistics: doc lengths, max frequencies
- Vocabulary extraction
- Query processing (same as documents)
- TF-IDF calculation for query terms
- Cosine similarity scoring
- Ranking by score
- For each query:
- Retrieve ranked list
- Compare with relevance judgments
- Calculate metrics
- Average across all queries
- Small differences (<0.01) may not be meaningful
- Consider variance across queries
- Look for consistent improvements
- Identify top 3-5 by MAP
- Check consistency across metrics
- Consider computational cost
- Evaluate on held-out queries if possible
- Precision vs. Recall
- Computational cost vs. accuracy
- Complexity vs. interpretability
- Generalization vs. dataset-specific tuning