Skip to content

Commit 5bdcd4f

Browse files
authored
Merge pull request #38 from imohitmayank/mohit-23042025-updates
Enhance text similarity documentation with semantic search image and explanations
2 parents 05fa5fd + 825f6bf commit 5bdcd4f

File tree

2 files changed

+17
-2
lines changed

2 files changed

+17
-2
lines changed
5.9 KB
Loading

docs/natural_language_processing/text_similarity.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,18 +147,33 @@ $$
147147
0.88
148148
```
149149

150-
- In first example, it found ‘ home’ as the longest substring, then considered ‘i am going’ and ‘gone’ for further processing (left of common substring), where again it found ‘go’ as longest substring. Later on right of ‘go’ it also found ’n’ as the only common and longest substring. Overall the score was 2 * (5 + 2 + 1) / 24 ~ 0.66. In second case, it found ‘hello’ as the longest longest substring and nothing common on the left and right, hence score is 0.5. The rest of the examples showcase the advantage of using sequence algorithms for cases missed by edit distance based algorithms.
150+
- In first example, it found ‘ home’ as the longest substring, then considered ‘i am going’ and ‘gone’ for further processing (left of common substring), where again it found ‘go’ as longest substring. Later on right of ‘go’ it also found ’n’ as the only common and longest substring. Overall the score was 2 * (5 + 2 + 1) / 24 ~ 0.66. In second case, it found ‘hello’ as the longest substring and nothing common on the left and right, hence score is 0.5. The rest of the examples showcase the advantage of using sequence algorithms for cases missed by edit distance based algorithms.
151151

152152
## Semantic based approaches
153153

154154
- In semantic search, strings are embedded using some neural network (NN) model. Think of it like a function that takes an input string and returns a vector of numbers. The vector is then used to compare the similarity between two strings.
155155
- Usually the NN models work at either token or word level, so to get embedding of a string, we first find embeddings for each token in the string and then aggregate them using mean or similar function.
156156
- The expectation is that the embeddings will be able to represent the string such that it capture different aspects of the language. Because of which, the embeddings provides us with much more features to compare strings.
157-
- Let's try a couple of ways to compute semantic similarity between strings. Different models can be picked or even fine tuned based on domain and requirement, but we will use the same model for simiplicity's sake. Instead we will just use different packages.
157+
158+
<figure markdown>
159+
![](../imgs/nlp_semanticsearch_intro.png){ width="100%"}
160+
<figcaption>Semantic search embeds corpus entries and queries into a vector space. Finding the closest vector to the query vector is equivalent to finding the most similar entry in the corpus. Source: [SBert.net](https://www.sbert.net/examples/sentence_transformer/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search)</figcaption>
161+
</figure>
158162

159163
!!! Hint
160164
As embedding is an integral part of semantic search, it is important to check the quality of a embedding method before using it. [MTEB](https://github.com/embeddings-benchmark/mteb) is the "Massive Text Embedding Benchmark" python package that lets you test any embedding function on more than 30 tasks. The process is quite simple - usually the text is embedded using the provided function or neural network, and the performance of embedding is computed and checked on downstream tasks like classification, clustering, and more.
161165

166+
!!! Hint
167+
168+
To select the right model for your task, identify whether your task is symmetric or asymmetric. ([refer](https://www.sbert.net/examples/sentence_transformer/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search))
169+
170+
- **Symmetric Search**: If your task involves matching similar-length queries and corpus entries (e.g., "How to learn Python online?" vs. "How to learn Python on the web?"), use models trained on datasets like Quora Duplicate Questions.
171+
172+
- **Asymmetric Search**: If your task involves short queries and longer, detailed answers (e.g., "What is Python?" vs. a paragraph explaining Python), use models optimized for query-document mismatches, such as those pre-trained on MS MARCO.
173+
174+
175+
Let's try a couple of ways to compute semantic similarity between strings. Different models can be picked or even fine-tuned based on domain and requirement, but we will use the same model (but different packages) for simplicity's sake.
176+
162177
### txtai
163178

164179
- [txtai](https://github.com/neuml/txtai) is a python package to perform semantic based tasks on textual data including search, question answering, information extraction, etc.

0 commit comments

Comments
 (0)