-
Notifications
You must be signed in to change notification settings - Fork 31
Text Analytics
Text Analytics empowers businesses with ‘Social Listening’ capabilities. It allows businesses to tune into structured and unstructured data across emails, text messages, emails, and customer reviews to narrow down on positive and negative topics. In the past decade there has been a data boom and the manual approach to text analytics is proven to be ineffective and unproductive. our framework covers the Top 3 characteristics of Text Analytics-
0. Text Featurisation/ Transformation:
Basic feature extraction from text that add information about text, type of text, patterns in text and any special characteristics in text like -
- Get word count
- Lexon (alphabet/number/special charachter) count
- Average word length
- Special characters count
- Upper case words count
We also need to perform some standardization operations on the raw text to be made machine readable as well as uniform across, like -
- Converting text to uniform lower case
- Removing punctuations
- Stop words (or commonly occurring English words is/a/am/are/the) removal (add no extra information to text data)
- Most frequent words appearing throughout corpus removal (as their presence will not of any use in classification of our text data)
- Rare Words removal (Because they’re so rare, the association between them and other words is dominated by noise)
- Spelling correction (this also will help us in reducing multiple copies of same words and treating them differently)
- Stemming (removal of suffices, like “ing”, “ly”, “s” etc to get the base word out of different forms of the same word) or Lemmatization (It is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices)
- Most frequent N-Grams search (contiguous sequence of n items from a given sample of text)
- Sentiment Analysis to identify the polarity and subjectivity of each
1. Text Similarity:
Text similarity has to determine how 'close' two pieces of text are both in surface closeness (lexical similarity) and meaning (semantic similarity).
-
Term Frequency-inverse document frequency (TF-idf):
this looks at words that appear in both pieces of text, and scores them based on how often they appear. It is a useful tool if you expect the same words to appear in both pieces of text, but some words are more important that others.
Get cosine similarity between two texts-
Step1. Convert the text into vector of numbers (Using TF-IDF scores)
a)TF= Frequency of a word in the given sentence or Term-Frequency
b)IDF=Inverse Doc Frequency is 1/ number of times a word appears accross all documents. This is important because some words like is/am/are/the are present throughout the text and add no value/variability when present in a sentence. So allot these words a lower score by taking the inverse. We can ignore the IDF score as we have removed the stop words and most frequent words accross.
c)TF-IDF score =TF score * IDF score
d)text_to_vector function returns a tuple of { word: Frequency } or TF score. Thus converts a text to vector.
Step2.Calculate cosine similarity of the two vectors
a)cos_sim(vectA,vectB)=dot product=(xa.xb + ya.yb + za.zb)/[(sqrt(xa.xa + ya.ya + za.za)).(sqrt(xb.xb + yb.yb + zb.zb))]
where vectA=(xa,ya,za) ; vectB=(xb,yb,zb)

-
Semantic similarity:
this scores words based on how similar they are, even if they are not exact matches. It borrows techniques from Natural Language Processing (NLP), such as word embeddings. This is useful if the word overlap between texts is limited, such as if you need ‘fruit and vegetables’ to relate to ‘tomatoes’. GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.
2. Text Summarisation:
Text summarization is the process of creating a short, coherent, and fluent summary of a longer text document and involves the outlining of the text's major points. Auto Summarise Text- Extract some important sentences from the whole text and combine them together to form the abstract.
The Word importance and Sentence significance are measured as shown below-

3. Text Classification:
In-text categorization, the system is fed a pre-built set of text examples and their relevant categories. The machine learning algorithm learns how each text is categorized and creates rules for itself. When new text is presented, it applies these rules to categorize the new text into further categories.
To refer to the working samples please visit-
Twitter & Blog Post Analysis using AutoBrewML Text Analytics Tools
Datasets used-