-
Notifications
You must be signed in to change notification settings - Fork 20.5k
Description
What would you like to Propose?
This PR introduces an InvertedIndex data structure that uses the BM25 scoring algorithm for ranking movie search results. The InvertedIndex class indexes movie documents based on their content (e.g., descriptions or scripts), allowing users to perform full-text searches on the movie database.
Issue details
Overview
The problem this PR addresses is the need for an efficient text search mechanism that ranks documents (in this case, movies) based on their relevance to a search term. The solution uses BM25 scoring, a popular information retrieval algorithm, to rank the search results based on term frequency, inverse document frequency (IDF), and document length.
Objectives
- Implement an
InvertedIndex
for movie search. - Use BM25 scoring to rank search results based on term frequency, IDF, and document length.
- Provide test coverage for functionality and edge cases.
- Resolves the need for efficient document retrieval and ranking.
Key Features
- Inverted Index Structure:
- The index maps terms (words) to document IDs and term frequencies in those documents.
- Supports fast retrieval of documents that contain the search terms.
-
BM25 Scoring Algorithm:
- Ranks search results based on relevance.
- Considers term frequency, document length, and average document length.
- Two tuning parameters: k (controls term frequency saturation) and b (controls document length normalization).
Key Components of BM25:
1. Term Frequency (TF): TF measures how frequently a term appears in a document. Higher frequency terms are often more relevant to the document.
2. Inverse Document Frequency (IDF): IDF is used to scale down the importance of common terms (e.g., "the", "is") and boost the importance of rare terms. The more documents a term appears in, the lower its IDF score, as it is less discriminative for determining relevance.
3. Document Length Normalization: BM25 normalizes the term frequency by the document length to prevent long documents from being unfairly penalized (since they contain more terms). The normalization is controlled by a parameter b, where 0β€πβ€1. If π=0, length normalization is not applied. If π=1 full length normalization is applied.
4. Saturation of Term Frequency: BM25 applies a saturation function to the term frequency, meaning that after a certain point, the frequency of a term does not contribute linearly to the relevance score. This is controlled by the parameter π1
, which adjusts the sensitivity to term frequency.
-
Movies:
- Each movie has a unique docId, name, IMDb rating, release year, and content (description or script).
- The search results are based on the content of these movies.
-
Search Functionality:
- Allows searching for a term, and returns a list of relevant documents (movies) with their BM25 scores.
- The results are sorted by relevance score in descending order..
Implementation Details
- InvertedIndex Class: Handles indexing and BM25 score calculation.
- SearchResult Class: Stores
docId
andrelevanceScore
. - Movie Class: Represents a movie with relevant details and content.
Testing
- Comprehensive test suite using JUnit 5:
- Test for adding movies, search results, BM25 score calculation, case sensitivity, and edge cases.
- Validate print functions for manual inspection
Additional Information
Role of Inverse Document Frequency (IDF):
The IDF component in BM25 plays a critical role in ranking because it adjusts the importance of terms based on their overall presence in the document collection. If a term appears in many documents, its IDF is low, as it is less useful for distinguishing between relevant and non-relevant documents. Conversely, rare terms get a higher IDF score, boosting the document's relevance score for those terms.
Why BM25 is Effective:
- Non-linear term frequency: Unlike plain TF-IDF, which increases the score linearly with term frequency, BM25 dampens this effect. This is more realistic, as repeating a word many times in a document does not infinitely increase its relevance.
- Document length normalization: BM25 adjusts for document length, preventing long documents from being unfairly penalized or short documents from being unfairly boosted.
- IDF component: It improves on raw term frequency by accounting for how informative a term is across the entire corpus, which improves the relevance of returned results.
Time and Space Complexity Analysis
1. Adding a Movie to the Index (addMovie)
When a movie is added to the index, the method processes the content of the movie and updates the inverted index.
-
Time Complexity:
- Let N be the number of words in the movie content.
- For each word in the movie's content:
- Tokenization and normalization: O(N).
- For each word, it either updates or inserts into the inverted index, which is a HashMap. Each update in the HashMap has an average time complexity of O(1) (since HashMap operations like put() are constant time on average).
- Updating the average document length takes constant time: O(1).
- Overall Time Complexity: O(N) per movie, where N is the number of words in the movie's content.
-
Space Complexity:
- Each word in the movie's content is stored in the inverted index.
- If there are M movies and each movie has an average of N words, the space required for the inverted index will be proportional to O(M * N) for storing the terms and their frequencies.
- Each movie is also stored in a separate movies map, which takes O(M) space.
- Overall Space Complexity: O(M * N).
2. Searching for a Term (search)
When searching for a term, the algorithm retrieves all documents that contain the term and computes the BM25 relevance score for each document.
-
Time Complexity:
- Let D be the number of documents that contain the search term (i.e., the document frequency for the term).
- For each document containing the term:
- Compute the BM25 score, which involves calculating term frequency and document length (both are retrieved in O(1) time).
- Sorting the results by relevance score takes O(D log D).
- Overall Time Complexity: O(D log D), where D is the number of documents containing the term.
-
Space Complexity:
- The space required to store the results is proportional to the number of documents containing the term.
- Overall Space Complexity: O(D), where D is the number of documents that contain the search term.
3. BM25 Score Calculation (computeBM25Score)
The BM25 score calculation for a single document is a constant-time operation:
- Time Complexity: O(1) per document.
- Space Complexity: O(1) per document.
Overall Complexity Summary
-
Adding a Movie:
- Time: O(N) (where N is the number of words in the movie's content).
- Space: O(M * N) (where M is the number of movies and N is the average number of words in each movie).
-
Searching for a Term:
- Time: O(D log D) (where D is the number of documents containing the term).
- Space: O(D).
The performance of this implementation is efficient given that the HashMap operations for indexing and retrieving terms have average O(1) time complexity. The search is also optimized by focusing only on documents that contain the searched term.