Sentence Similarity Identifier (Natural Language Processing)

Final project of Natural language processing course at Cracow University of Technology.

The purpose of the project is to explore different nlp statistical and machine learning methods and techniques, such as preprocessing, stemming, tokenization, concepts like Data Scraping, BoW, n-grams, exploring nlp algorithms like Word2Vec, CBOW, GloVe, Gensim, FastText and basic text generation.
To perform Semantic Analysis and identify similarity between sentences with statistical and machine learning methods.

For a more detailed description open report

Steps:

Preprocessing

Remove Stopwords
Expand Contractions
Remove possessive endings
Lemmatization
Remove special characters
Preserve capitalizations
Capitalize Named Entities
Tokenization

Applying Statistical Methods

Word Embedding methods

Because the Word2Vec family of models is unsupervised, we can just give it a corpus without additional labels or information and it can construct dense word embeddings from the corpus.

Dipanjan Sarkar - Text Analytics with Python

These models are based on the principle of distributional hypothesis in the field of distributional semantics, which tells us that words that occur and are used in the same context are semantically similar to one another and have similar meanings (“a word is characterized by the company it keeps”)

Dipanjan Sarkar - Text Analytics with Python

Gensim Word2Vec

After training gensim's Word2Vec model, we may visualize specific words on the graph of their semantic similarities based on context

We may calculate the semantic difference of sentences separately word by word, as well as using cosine similarity measure

GloVe

GloVe is both an unsupervised method and a model. GloVe method looks for co-occurence and relashionships between words in the entire corpus.

For example, considering how frequently words like "ice" and "steam" co-occur with other words like "solid" or "gas".

The probe word "solid" is expected to co-occure more often with the word "ice", than the words "steam".

If the probe word is related to both or neither, e.g "water" and "fashion", the expected occurence probability ratio is expected to be close to 1.

The relationship of these words can be examined by studying the ratio of their co-occurrence prob- abilities with various probe words*

GloVe: Global Vectors for Word Representation. Jeffrey Pennington et al.

FastText

Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing. In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich languages, such as Turkish or Finnish... Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using charac- ter level information.

...we propose to learn representations for character n-grams, and to represent words as the sum of the n-gram vectors

Enriching Word Vectors with Subword Information. Piotr Bojanowski et al

Facebook reaserchers provide open-source pre-trained models in 157 languages, one of which is used in this project.

We came to the conclusion that, though the model is already trained, it may require additional training for our specific problem. For our small problem, Facebook's FastText models do not exhibit superiority.

Applying Machine Learning Seq2Seq model

Natural Language Generation

We've experimented with different Neural Network configurations. Our results largely depended on the data used for training, number of layers of the NN and configurations as well as number of training epochs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
docs		docs
BURCHAKOV_ERMOLAEV_PJN_PREZENTACJA.pdf		BURCHAKOV_ERMOLAEV_PJN_PREZENTACJA.pdf
Burchakov_Ermolaev_PJN.pdf		Burchakov_Ermolaev_PJN.pdf
Project.ipynb		Project.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentence Similarity Identifier (Natural Language Processing)

Steps:

Preprocessing

Applying Statistical Methods

Word Embedding methods

Gensim Word2Vec

GloVe

FastText

Applying Machine Learning Seq2Seq model

Natural Language Generation

About

Uh oh!

Releases

Packages

Languages

davidburchakov/nlp-sentence-similarity

Folders and files

Latest commit

History

Repository files navigation

Sentence Similarity Identifier (Natural Language Processing)

Steps:

Preprocessing

Applying Statistical Methods

Word Embedding methods

Gensim Word2Vec

GloVe

FastText

Applying Machine Learning Seq2Seq model

Natural Language Generation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages