-
Notifications
You must be signed in to change notification settings - Fork 31
Text Analytics
Text Analytics empowers businesses with ‘Social Listening’ capabilities. It allows businesses to tune into structured and unstructured data across emails, text messages, emails, and customer reviews to narrow down on positive and negative topics. In the past decade there has been a data boom and the manual approach to text analytics is proven to be ineffective and unproductive. our framework covers the Top 3 charachteristics of Text Analytics-
-
Text Featurisation/ Transformation:
Basic feature extraction from text like get word count, lexon count, average word length, special charachters count,upper case words count etc. These add information about text, type of text, patterns in text and any special charachteristics in text.
We also need to perform some standardisation operations on the raw text to be made machine readable as well as uniform accros, like converting text to uniform lower case, removing punctuations, stop words (or commonly occurring English words is/a/am/are/the) removal (add no extra information to text data), most frequent words appearing throughout corpus removal (as their presence will not of any use in classification of our text data), rare Words removal (Because they’re so rare, the association between them and other words is dominated by noise), spelling correction (this also will help us in reducing multiple copies of same words and treating them differently), stemming (removal of suffices, like “ing”, “ly”, “s” etc to get the base word out of different forms of the same word) or Lemmatization (It is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices), Sentiment Analysis to identify the polarity and subjectivity of each etc. -
Text Summarisation:
Text summarization is the process of creating a short, coherent, and fluent summary of a longer text document and involves the outlining of the text's major points. Auto Summarise Text- Extract some important sentences from the whole text and combine them together to form the abstract.
The Word importance and Sentence significance are measured as shown below-

-
Text Similarity:
Text similarity has to determine how 'close' two pieces of text are both in surface closeness (lexical similarity) and meaning (semantic similarity).
Get cosine similarity between two texts-
Step1. Convert the text into vector of numbers (Using TF-IDF scores)
a)TF= Frequency of a word in the given sentence or Term-Frequency
b)IDF=Inverse Doc Frequency is 1/ number of times a word appears accross all documents. This is important because some words like is/am/are/the are present throughout the text and add no value/variability when present in a sentence. So allot these words a lower score by taking the inverse. We can ignore the IDF score as we have removed the stop words and most frequent words accross.
c)TF-IDF score =TF score * IDF score
d)text_to_vector function returns a tuple of { word: Frequency } or TF score. Thus converts a text to vector.
Step2.Calculate cosine similarity of the two vectors
a)cos_sim(vectA,vectB)=dot product=(xa.xb + ya.yb + za.zb)/[(sqrt(xa.xa + ya.ya + za.za)).(sqrt(xb.xb + yb.yb + zb.zb))]
where vectA=(xa,ya,za) ; vectB=(xb,yb,zb)

-
Text Classification:
In-text categorization, the system is fed a pre-built set of text examples and their relevant categories. The machine learning algorithm learns how each text is categorized and creates rules for itself. When new text is presented, it applies these rules to categorize the new text into further categories.