Skip to content

Commit a90854d

Browse files
committed
Add ULP from ICSME'22
1 parent fcb019d commit a90854d

File tree

13 files changed

+460
-7
lines changed

13 files changed

+460
-7
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ Logparser provides a machine learning toolkit and benchmarks for automated log p
4444
| ICWS'17 | [Drain](https://github.com/logpai/logparser/tree/main/logparser/Drain#drain) | [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), by Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.|
4545
| ICPC'18 | [MoLFI](https://github.com/logpai/logparser/tree/main/logparser/MoLFI#molfi) | [A Search-based Approach for Accurate Identification of Log Message Formats](http://publications.uni.lu/bitstream/10993/35286/1/ICPC-2018.pdf), by Salma Messaoudi, Annibale Panichella, Domenico Bianculli, Lionel Briand, Raimondas Sasnauskas. |
4646
| TSE'20 | [Logram](https://github.com/logpai/logparser/tree/main/logparser/Logram#logram) | [Logram: Efficient Log Parsing Using n-Gram Dictionaries](https://arxiv.org/pdf/2001.03038.pdf), by Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun (Peter) Chen. |
47+
| ICSME'22 | [ULP](https://github.com/logpai/logparser/tree/main/logparser/ULP#ULP) | [An Effective Approach for Parsing Large Log Files](https://users.encs.concordia.ca/~abdelw/papers/ICSME2022_ULP.pdf), by Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, Mohammed A. Shehab. |
4748

4849
:bulb: Welcome to submit a PR to push your parser code to logparser and add your paper to the table.
4950

THIRD_PARTIES.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ The logparser package is built on top of the following third-party libraries:
88
| MoLFI | https://github.com/SalmaMessaoudi/MoLFI | Apache-2.0 |
99
| alignment (LogMine) | https://gist.github.com/aziele/6192a38862ce569fe1b9cbe377339fbe | GPL |
1010
| Logram | https://github.com/BlueLionLogram/Logram | NA |
11+
| ULP | https://github.com/SRT-Lab/ULP | MIT |

docs/tools/Drain.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@ Drain is one of the representative algorithms for log parsing. It can parse logs
77

88
Drain first preprocess logs according to user-defined domain knowledge, ie. regex. Second, Drain starts from the root node of the parse tree with the preprocessed log message. The 1-st layer nodes in the parse tree represent log groups whose log messages are of different log message lengths. Third, Drain traverses from a 1-st layer node to a leaf node. Drain selects the next internal node by the tokens in the beginning positions of the log message.Then Drain calculate similarity between log message and log event of each log group to decide whether to put the log message into existing log group. Finally, Drain update the Parser Tree by scaning the tokens in the same position of the log message and the log event.
99

10-
11-
1210
Read more information about Drain from the following paper:
1311

1412
+ Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), *IEEE International Conference on Web Services (ICWS)*, 2017.

docs/tools/LKE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
LKE
22
===
33

4-
LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messsages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
4+
LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messages. After further group splitting with fine-tuning, log keys are generated from the resulting clusters.
55

66
**Step 1**: Log clustering. Weighted edit distance is designed to evaluate the similarity between two logs, WED=\sum_{i=1}^{n}\frac{1}{1+e^{x_{i}-v}} . n is the number of edit operations to make two logs the same, x_{i} is the column index of the word which is edited by the i-th operation, v is a parameter to control weight. LKE links two logs if the WED between them is less than a threshold \sigma . After going through all pairs of logs, each connected component is regarded as a cluster. Threshold \sigma is automatically calculated by utilizing K-means clustering to separate all WED between all pair of logs into 2 groups, and the largest distance from the group containing smaller WED is selected as the value of \sigma .
77

88
**Step 2**: Cluster splitting. In this step, some clusters are further partitioned. LKE firstly finds out the longest common sequence (LCS) of all the logs in the same cluster. The rests of the logs are dynamic parts separated by common words, such as “/10.251.43.210:55700” or “blk_904791815409399662”. The number of unique words in each dynamic part column, which is denoted as |DP| , is counted. For example, |DP|=2 for the dynamic part column between “src:” and “dest:” in log 2 and log 3. If the smallest |DP| is less than threshold \phi , LKE will use this dynamic part column to partition the cluster.
99

10-
**Step 3**: Log template extraction. This step is similar to the step 4 of IPLoM. The only difference is that LKE removes all variables when they generate log templates, instead of representing them by wildcards.
10+
**Step 3**: Log template extraction. This step is similar to step 4 of IPLoM. The only difference is that LKE removes all variables when they generate log templates, instead of representing them by wildcards.
1111

1212
Read more information about LKE from the following paper:
1313

logparser/LKE/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# LKE
22

3-
LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messsages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
3+
LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
44

55
Read more information about LKE from the following paper:
66

logparser/ULP/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# ULP
2+
3+
ULP (Universal Log Parsing) is a highly accurate log parsing tool, the ability to extract templates from unstructured log data. ULP learns from sample log data to recognize future log events. It combines pattern matching and frequency analysis techniques. First, log events are organized into groups using a text processing method. Frequency analysis is then applied locally to instances of the same group to identify static and dynamic content of log events. When applied to 10 log datasets of the Loghub benchmark, ULP achieves an average accuracy of 89.2%, which outperforms the accuracy of four leading log parsing tools, namely Drain, Logram, Spell and AEL. Additionally, ULP can parse up to four million log events in less than 3 minutes. ULP can be readily used by practitioners and researchers to parse effectively and efficiently large log files so as to support log analysis tasks.
4+
5+
Read more information about Drain from the following paper:
6+
7+
+ Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, Mohammed A. Shehab. [An Effective Approach for Parsing Large Log Files](https://users.encs.concordia.ca/~abdelw/papers/ICSME2022_ULP.pdf), *Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME)*, 2022.
8+
9+
### Running
10+
11+
The code has been tested in the following enviornment:
12+
+ python 3.7.6
13+
+ regex 2022.3.2
14+
+ pandas 1.0.1
15+
+ numpy 1.18.1
16+
+ scipy 1.4.1
17+
18+
Run the following scripts to start the demo:
19+
20+
```
21+
python demo.py
22+
```
23+
24+
Run the following scripts to execute the benchmark:
25+
26+
```
27+
python benchmark.py
28+
```
29+
30+
### Benchmark
31+
32+
Running the benchmark script on Loghub_2k datasets, you could obtain the following results.
33+
34+
| Dataset | F1_measure | Accuracy |
35+
|:-----------:|:----------|:--------|
36+
| HDFS | 0.999984 | 0.9975 |
37+
| Hadoop | 0.999923 | 0.9895 |
38+
| Spark | 0.994593 | 0.922 |
39+
| Zookeeper | 0.999876 | 0.9925 |
40+
| BGL | 0.999453 | 0.93 |
41+
| HPC | 0.994433 | 0.9505 |
42+
| Thunderbird | 0.998665 | 0.6755 |
43+
| Windows | 0.989051 | 0.41 |
44+
| Linux | 0.476099 | 0.3635 |
45+
| Android | 0.971417 | 0.838 |
46+
| HealthApp | 0.993431 | 0.9015 |
47+
| Apache | 1 | 1 |
48+
| Proxifier | 0.739766 | 0.024 |
49+
| OpenSSH | 0.939796 | 0.434 |
50+
| OpenStack | 0.834337 | 0.4915 |
51+
| Mac | 0.981294 | 0.814 |
52+
53+
54+
### Citation
55+
56+
:telescope: If you use our logparser tools or benchmarking results in your publication, please kindly cite the following papers.
57+
58+
+ [**ICSE'19**] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. [Tools and Benchmarks for Automated Log Parsing](https://arxiv.org/pdf/1811.03509.pdf). *International Conference on Software Engineering (ICSE)*, 2019.
59+
+ [**DSN'16**] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. [An Evaluation Study on Log Parsing and Its Use in Log Mining](https://jiemingzhu.github.io/pub/pjhe_dsn2016.pdf). *IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)*, 2016.

logparser/ULP/ULP.py

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# =========================================================================
2+
# This file is modified from https://github.com/SRT-Lab/ULP
3+
#
4+
# MIT License
5+
# Copyright (c) 2022 Universal Log Parser
6+
#
7+
# Permission is hereby granted, free of charge, to any person obtaining a copy
8+
# of this software and associated documentation files (the "Software"), to deal
9+
# in the Software without restriction, including without limitation the rights
10+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
11+
# copies of the Software, and to permit persons to whom the Software is
12+
# furnished to do so, subject to the following conditions:
13+
#
14+
# The above copyright notice and this permission notice shall be included in all
15+
# copies or substantial portions of the Software.
16+
# =========================================================================
17+
18+
import os
19+
import pandas as pd
20+
import regex as re
21+
import time
22+
import warnings
23+
from collections import Counter
24+
from string import punctuation
25+
26+
warnings.filterwarnings("ignore")
27+
28+
29+
class LogParser:
30+
def __init__(self, log_format, indir="./", outdir="./result/", rex=[]):
31+
"""
32+
Attributes
33+
----------
34+
rex : regular expressions used in preprocessing (step1)
35+
path : the input path stores the input log file name
36+
logName : the name of the input file containing raw log messages
37+
savePath : the output path stores the file containing structured logs
38+
"""
39+
self.path = indir
40+
self.indir = indir
41+
self.outdir = outdir
42+
self.logName = None
43+
self.savePath = outdir
44+
self.df_log = None
45+
self.log_format = log_format
46+
self.rex = rex
47+
48+
def tokenize(self):
49+
event_label = []
50+
# print("\n============================Removing obvious dynamic variables======================\n\n")
51+
for idx, log in self.df_log["Content"].iteritems():
52+
tokens = log.split()
53+
tokens = re.sub(r"\\", "", str(tokens))
54+
tokens = re.sub(r"\'", "", str(tokens))
55+
tokens = tokens.translate({ord(c): "" for c in "!@#$%^&*{}<>?\|`~"})
56+
57+
re_list = [
58+
"([\da-fA-F]{2}:){5}[\da-fA-F]{2}",
59+
"\d{4}-\d{2}-\d{2}",
60+
"\d{4}\/\d{2}\/\d{2}",
61+
"[0-9]{2}:[0-9]{2}:[0-9]{2}(?:[.,][0-9]{3})?",
62+
"[0-9]{2}:[0-9]{2}:[0-9]{2}",
63+
"[0-9]{2}:[0-9]{2}",
64+
"0[xX][0-9a-fA-F]+",
65+
"([\(]?[0-9a-fA-F]*:){8,}[\)]?",
66+
"^(?:[0-9]{4}-[0-9]{2}-[0-9]{2})(?:[ ][0-9]{2}:[0-9]{2}:[0-9]{2})?(?:[.,][0-9]{3})?",
67+
"(\/|)([a-zA-Z0-9-]+\.){2,}([a-zA-Z0-9-]+)?(:[a-zA-Z0-9-]+|)(:|)",
68+
]
69+
70+
pat = r"\b(?:{})\b".format("|".join(str(v) for v in re_list))
71+
tokens = re.sub(pat, "<*>", str(tokens))
72+
tokens = tokens.replace("=", " = ")
73+
tokens = tokens.replace(")", " ) ")
74+
tokens = tokens.replace("(", " ( ")
75+
tokens = tokens.replace("]", " ] ")
76+
tokens = tokens.replace("[", " [ ")
77+
event_label.append(str(tokens).lstrip().replace(",", " "))
78+
79+
self.df_log["event_label"] = event_label
80+
81+
return 0
82+
83+
def getDynamicVars2(self, petit_group):
84+
petit_group["event_label"] = petit_group["event_label"].map(
85+
lambda x: " ".join(dict.fromkeys(x.split()))
86+
)
87+
petit_group["event_label"] = petit_group["event_label"].map(
88+
lambda x: " ".join(
89+
filter(None, (word.strip(punctuation) for word in x.split()))
90+
)
91+
)
92+
93+
lst = petit_group["event_label"].values.tolist()
94+
95+
vec = []
96+
big_lst = " ".join(v for v in lst)
97+
this_count = Counter(big_lst.split())
98+
99+
if this_count:
100+
max_val = max(this_count, key=this_count.get)
101+
for word in this_count:
102+
if this_count[word] < this_count[max_val]:
103+
vec.append(word)
104+
105+
return vec
106+
107+
def remove_word_with_special(self, sentence):
108+
sentence = sentence.translate(
109+
{ord(c): "" for c in "!@#$%^&*()[]{};:,/<>?\|`~-=+"}
110+
)
111+
length = len(sentence.split())
112+
113+
finale = ""
114+
for word in sentence.split():
115+
if (
116+
not any(ch.isdigit() for ch in word)
117+
and not any(not c.isalnum() for c in word)
118+
and len(word) > 1
119+
):
120+
finale += word
121+
122+
finale = finale + str(length)
123+
return finale
124+
125+
def outputResult(self):
126+
self.df_log.to_csv(
127+
os.path.join(self.savePath, self.logName + "_structured.csv"), index=False
128+
)
129+
130+
def load_data(self):
131+
headers, regex = self.generate_logformat_regex(self.log_format)
132+
133+
self.df_log = self.log_to_dataframe(
134+
os.path.join(self.path, self.logname), regex, headers, self.log_format
135+
)
136+
137+
def generate_logformat_regex(self, logformat):
138+
"""Function to generate regular expression to split log messages"""
139+
headers = []
140+
splitters = re.split(r"(<[^<>]+>)", logformat)
141+
regex = ""
142+
for k in range(len(splitters)):
143+
if k % 2 == 0:
144+
splitter = re.sub(" +", "\\\s+", splitters[k])
145+
regex += splitter
146+
else:
147+
header = splitters[k].strip("<").strip(">")
148+
regex += "(?P<%s>.*?)" % header
149+
headers.append(header)
150+
regex = re.compile("^" + regex + "$")
151+
return headers, regex
152+
153+
def log_to_dataframe(self, log_file, regex, headers, logformat):
154+
"""Function to transform log file to dataframe"""
155+
log_messages = []
156+
linecount = 0
157+
with open(log_file, "r") as fin:
158+
for line in fin.readlines():
159+
try:
160+
match = regex.search(line.strip())
161+
message = [match.group(header) for header in headers]
162+
log_messages.append(message)
163+
linecount += 1
164+
except Exception as e:
165+
print("[Warning] Skip line: " + line)
166+
logdf = pd.DataFrame(log_messages, columns=headers)
167+
logdf.insert(0, "LineId", None)
168+
logdf["LineId"] = [i + 1 for i in range(linecount)]
169+
return logdf
170+
171+
def parse(self, logname):
172+
start_timeBig = time.time()
173+
print("Parsing file: " + os.path.join(self.path, logname))
174+
175+
self.logname = logname
176+
177+
regex = [r"blk_-?\d+", r"(\d+\.){3}\d+(:\d+)?"]
178+
179+
self.load_data()
180+
self.df_log = self.df_log.sample(n=2000)
181+
self.tokenize()
182+
self.df_log["EventId"] = self.df_log["event_label"].map(
183+
lambda x: self.remove_word_with_special(str(x))
184+
)
185+
groups = self.df_log.groupby("EventId")
186+
keys = groups.groups.keys()
187+
stock = pd.DataFrame()
188+
count = 0
189+
190+
re_list2 = ["[ ]{1,}[-]*[0-9]+[ ]{1,}", ' "\d+" ']
191+
192+
generic_re = re.compile("|".join(re_list2))
193+
194+
for i in keys:
195+
l = []
196+
slc = groups.get_group(i)
197+
198+
template = slc["event_label"][0:1].to_list()[0]
199+
count += 1
200+
if slc.size > 1:
201+
l = self.getDynamicVars2(slc.head(10))
202+
pat = r"\b(?:{})\b".format("|".join(str(v) for v in l))
203+
if len(l) > 0:
204+
template = template.lower()
205+
template = re.sub(pat, "<*>", template)
206+
207+
template = re.sub(generic_re, " <*> ", template)
208+
slc["event_label"] = [template] * len(slc["event_label"].to_list())
209+
210+
stock = stock.append(slc)
211+
stock = stock.sort_index()
212+
213+
self.df_log = stock
214+
215+
self.df_log["EventTemplate"] = self.df_log["event_label"]
216+
if not os.path.exists(self.savePath):
217+
os.makedirs(self.savePath)
218+
self.df_log.to_csv(
219+
os.path.join(self.savePath, logname + "_structured.csv"), index=False
220+
)
221+
elapsed_timeBig = time.time() - start_timeBig
222+
print(f"Parsing done in {elapsed_timeBig} sec")
223+
return 0

logparser/ULP/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .ULP import *

0 commit comments

Comments
 (0)