-
-
Notifications
You must be signed in to change notification settings - Fork 48.6k
Cosine Similarity Algorithm | Machine Learning #11539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
0af293b
Cosine Similarity Algorithm | Machine Learning
Arko-Sengupta 3a62339
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] e8ec6df
Input Fixes
Arko-Sengupta 1458803
Input Fixes
Arko-Sengupta 030ced3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 768015c
Lower Case Fixes
Arko-Sengupta d8deb03
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d597f45
Case Fixes
Arko-Sengupta 2479eef
Case Fixes
Arko-Sengupta fa91225
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 1b87ff9
spaCy Fixes
Arko-Sengupta 2fe680f
Fixed Model Dependency
Arko-Sengupta 0336893
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 522edab
Fixed Model Dependency
Arko-Sengupta 135d9ea
Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…
Arko-Sengupta 4cbeb62
Resolved All Doctests
Arko-Sengupta 70a6de4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c892be4
Resolved all DocTests
Arko-Sengupta 89aef3c
Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…
Arko-Sengupta 26c7117
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 547e538
Resolved All Dependencies
Arko-Sengupta 7158e47
Merge branch 'master' of https://github.com/Arko-Sengupta/The-Algorit…
Arko-Sengupta b1738d9
Resolved Dependency in DocTest
Arko-Sengupta 4d94aaf
Resolved Dependency from All Methods
Arko-Sengupta e0f24f2
Loaded Package at a Time
Arko-Sengupta 3a3f30c
Cleared All Dependencies
Arko-Sengupta 147bcb2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d320b99
Cleared All Dependencies
Arko-Sengupta 2aa3608
Resolved Package OS Error
Arko-Sengupta 8c15055
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 90c4446
Merge branch 'TheAlgorithms:master' into master
Arko-Sengupta cc4258d
Jaccard Similarity | Machine Learning
Arko-Sengupta 6ebe310
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3851df0
Correct Seperate Algo Conflict
Arko-Sengupta 3435d80
Resolved Rename Issue
Arko-Sengupta File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,242 @@ | ||
import logging | ||
|
||
import numpy as np | ||
import spacy | ||
import spacy.cli | ||
import spacy.cli.download | ||
|
||
""" | ||
Cosine Similarity Algorithm - Natural Language Processing (NLP) Algorithm | ||
|
||
Use Case: | ||
- The Cosine Similarity Algorithm measures the Cosine of the Angle between two | ||
Non-Zero Vectors in a Multi-Dimensional Space. | ||
- It is used to determine how similar two texts are based on their Vector | ||
representations. | ||
- The similarity score ranges from -1 (Completely Dissimilar) to 1 (Completely Similar), | ||
with 0 indicating no Similarity. | ||
|
||
Dependencies: | ||
- spacy: A Natural Language Processing library for Python, used here for Tokenization | ||
and Vectorization. | ||
- numpy: A Library for Numerical Operations in Python, used for Mathematical | ||
Computations. | ||
""" | ||
spacy.cli.download("en_core_web_md") # Comment if Installed | ||
nlp = spacy.load("en_core_web_md") | ||
|
||
|
||
class CosineSimilarity: | ||
def __init__(self) -> None: | ||
""" | ||
Initializes the Cosine Similarity class by loading the SpaCy Model. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> isinstance(cs.nlp, spacy.lang.en.English) | ||
True | ||
""" | ||
self.nlp = nlp | ||
|
||
def tokenize(self, text: str) -> list: | ||
""" | ||
Tokenizes the input text into a list of lowercased tokens. | ||
|
||
Parameters: | ||
- text (str): The input text to be tokenized. | ||
|
||
Returns: | ||
- list: A list of lowercased tokens. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> cs.tokenize("Hello World!") | ||
['hello', 'world'] | ||
""" | ||
try: | ||
doc = self.nlp(text) | ||
tokens = [token.text.lower() for token in doc if not token.is_punct] | ||
return tokens | ||
except Exception as e: | ||
logging.error("An error occurred during Tokenization: ", exc_info=e) | ||
raise e | ||
|
||
def vectorize(self, tokens: list) -> list: | ||
Arko-Sengupta marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Converts tokens into their corresponding vector representations. | ||
|
||
Parameters: | ||
- tokens (list): A list of tokens to be vectorized. | ||
|
||
Returns: | ||
- list: A list of vectors corresponding to the tokens. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> tokens = ['hello', 'world'] | ||
>>> len(cs.vectorize(tokens)) > 0 | ||
True | ||
""" | ||
try: | ||
vectors = [ | ||
self.nlp(token).vector | ||
for token in tokens | ||
if self.nlp(token).vector.any() | ||
] | ||
return vectors | ||
except Exception as e: | ||
logging.error("An error occurred during Vectorization: ", exc_info=e) | ||
raise e | ||
|
||
def mean_vector(self, vectors: list) -> np.ndarray: | ||
Arko-Sengupta marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Computes the mean vector of a list of vectors. | ||
|
||
Parameters: | ||
- vectors (list): A list of vectors to be averaged. | ||
|
||
Returns: | ||
- np.ndarray: The mean vector. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> vectors = [np.array([1, 2, 3]), np.array([4, 5, 6])] | ||
>>> np.allclose(cs.mean_vector(vectors), np.array([2.5, 3.5, 4.5])) | ||
True | ||
""" | ||
try: | ||
if not vectors: | ||
return np.zeros(self.nlp.vocab.vectors_length) | ||
return np.mean(vectors, axis=0) | ||
except Exception as e: | ||
logging.error( | ||
"An error occurred while computing the Mean Vector: ", exc_info=e | ||
) | ||
raise e | ||
|
||
def dot_product(self, vector1: np.ndarray, vector2: np.ndarray) -> float: | ||
Arko-Sengupta marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Computes the dot product between two vectors. | ||
|
||
Parameters: | ||
- vector1 (np.ndarray): The first vector. | ||
- vector2 (np.ndarray): The second vector. | ||
|
||
Returns: | ||
- float: The dot product of the two vectors. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> v1 = np.array([1, 2, 3]) | ||
>>> v2 = np.array([4, 5, 6]) | ||
>>> cs.dot_product(v1, v2) | ||
32 | ||
""" | ||
try: | ||
return np.dot(vector1, vector2) | ||
except Exception as e: | ||
logging.error( | ||
"An error occurred during the dot Product Calculation: ", exc_info=e | ||
) | ||
raise e | ||
|
||
def magnitude(self, vector: np.ndarray) -> float: | ||
Arko-Sengupta marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
Computes the magnitude (norm) of a vector. | ||
|
||
Parameters: | ||
- vector (np.ndarray): The vector whose magnitude is to be calculated. | ||
|
||
Returns: | ||
- float: The magnitude of the vector. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> v = np.array([1, 2, 2]) | ||
>>> cs.magnitude(v) | ||
3.0 | ||
""" | ||
try: | ||
return np.sqrt(np.sum(vector**2)) | ||
except Exception as e: | ||
logging.error( | ||
"An error occurred while computing the Magnitude: ", exc_info=e | ||
) | ||
raise e | ||
|
||
def cosine_text_similarity(self, vector1: np.ndarray, vector2: np.ndarray) -> float: | ||
""" | ||
Computes the cosine similarity between two vectors. | ||
|
||
Parameters: | ||
- vector1 (np.ndarray): The first vector. | ||
- vector2 (np.ndarray): The second vector. | ||
|
||
Returns: | ||
- float: The cosine similarity between the two vectors. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> v1 = np.array([1, 2, 3]) | ||
>>> v2 = np.array([1, 2, 3]) | ||
>>> cs.cosine_text_similarity(v1, v2) | ||
1.0 | ||
""" | ||
try: | ||
dot = self.dot_product(vector1, vector2) | ||
magnitude1, magnitude2 = (self.magnitude(vector1), self.magnitude(vector2)) | ||
if magnitude1 == 0 or magnitude2 == 0: | ||
return 0.0 | ||
return dot / (magnitude1 * magnitude2) | ||
except Exception as e: | ||
logging.error( | ||
"An error occurred during Cosine Similarity Calculation: ", exc_info=e | ||
) | ||
raise e | ||
|
||
def cosine_text_similarity_percentage(self, text1: str, text2: str) -> float: | ||
""" | ||
Computes the cosine similarity percentage between two texts. | ||
|
||
Parameters: | ||
- text1 (str): The first text. | ||
- text2 (str): The second text. | ||
|
||
Returns: | ||
- float: The cosine similarity percentage between the two texts. | ||
|
||
Example: | ||
>>> cs = CosineSimilarity() | ||
>>> text1 = "The biggest Infrastructure in the World is Burj Khalifa" | ||
>>> text2 = "The name of the tallest Tower in the world is Burj Khalifa" | ||
>>> cs.cosine_text_similarity_percentage(text1, text2) > 0 | ||
True | ||
""" | ||
try: | ||
tokens1 = self.tokenize(text1) | ||
tokens2 = self.tokenize(text2) | ||
|
||
vectors1 = self.vectorize(tokens1) | ||
vectors2 = self.vectorize(tokens2) | ||
|
||
mean_vec1 = self.mean_vector(vectors1) | ||
mean_vec2 = self.mean_vector(vectors2) | ||
|
||
similarity = self.cosine_text_similarity(mean_vec1, mean_vec2) | ||
return similarity * 100 | ||
except Exception as e: | ||
logging.error( | ||
"""An error occurred while computing the Cosine Similarity | ||
Percentage: """, | ||
exc_info=e, | ||
) | ||
raise e | ||
|
||
|
||
if __name__ == "__main__": | ||
""" | ||
Main function to Test the Cosine Similarity between two Texts. | ||
""" | ||
import doctest | ||
|
||
doctest.testmod() # Run the Doctests |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.