Skip to content

ExceptionalHandler/NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Group similar log-lines together using TF-IDF

Python script to assist log analysis by grouping together similar lines. Uses TF-IDF weight to compute similarity.

Usage

python3 .\declutter.py -i .\sylog.log

This will produce a sorted log file next to input file.

Blog and explanation here.

Example syslog.log and syslog.log_sorted.txt in repo.

The script uses a CountVectorizer with a word stemmer behind it and also ignores non-English numeric tokens like timestamps by considering only alphabetic tokens.

This is what the word-vectorizer looks like:

stemmer = PorterStemmer()
def get_stemmed_tokens(tokens, stemmer):
    stemmed_tokens = []
    for token in tokens:
        if token.isalpha():
            stemmed_tokens.append(stemmer.stem(token))
    return stemmed_tokens

def get_tokens(string):
    tokens = word_tokenize(string)
    stemmed_tokens = get_stemmed_tokens(tokens, stemmer)
    return stemmed_tokens
vectorizer = ext.CountVectorizer(tokenizer=get_tokens, stop_words='english')

The lines from a document are fed into it as if each line is a document and the whole document is the corpus:

with open(inputFile) as file:
    lines = [line.rstrip() for line in file]
lineNos = dict(zip(range(1, len(lines)), lines))

The tfidf transform of the resulting sparse matrix:

tf_idf_transformer = ext.TfidfTransformer().fit(doc_matrix)
sparse = tf_idf_transformer.transform(doc_matrix).toarray()

Divide total weight of each line with number of words in that line. This will assign densest lines the highest score:

perLineScore  = []
for row in sparse:
    perLineScore.append(row.sum()/len(row.nonzero()[0]))

lineScores = dict(zip(range(1, len(lines)), perLineScore))

lineScores is the dict of line-number and score. Sorting it by the score makes 'densest' lines appear at the top and similar lines appear together.

About

Log Line Clustering using NLTK

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages