Skip to content

Commit fcb019d

Browse files
committed
Add Logram from TSE'20
1 parent 93d94a6 commit fcb019d

File tree

14 files changed

+559
-5
lines changed

14 files changed

+559
-5
lines changed

README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
<div>
66
<a href="https://pypi.org/project/logparser3"><img src="https://img.shields.io/badge/python-3.6+-blue" style="max-width: 100%;" alt="Python version"></a>
77
<a href="https://pypi.org/project/logparser3"><img src="https://img.shields.io/pypi/v/logparser3.svg" style="max-width: 100%;" alt="Pypi version"></a>
8-
<a href="https://github.com/logpai/logparser/actions/workflows/ci.yml"><img src="https://github.com/logpai/logparser/workflows/CI/badge.svg" style="max-width: 100%;" alt="Pypi version"></a>
8+
<a href="https://github.com/logpai/logparser/actions/workflows/ci.yml"><img src="https://github.com/logpai/logparser/workflows/CI/badge.svg?event=push" style="max-width: 100%;" alt="Pypi version"></a>
99
<a href="https://pepy.tech/project/logparser3"><img src="https://static.pepy.tech/badge/logparser3" style="max-width: 100%;" alt="Downloads"></a>
1010
<a href="https://github.com/logpai/logparser/blob/main/LICENSE"><img src="https://img.shields.io/github/license/logpai/logparser.svg" style="max-width: 100%;" alt="License"></a>
1111
<a href="https://github.com/logpai/logparser#discussion"><img src="https://img.shields.io/badge/chat-wechat-brightgreen?style=flat" style="max-width: 100%;" alt="License"></a>
@@ -22,7 +22,7 @@ Logparser provides a machine learning toolkit and benchmarks for automated log p
2222

2323
### 🌈 New updates
2424

25-
+ Since the first release of logparser, many PRs and issues have been submitted due to incompatibility with Python 3. Finally, we update logparser v1.0.0 with support for Python 3. Thanks for all the contributions! ([#PR86](https://github.com/logpai/logparser/pull/86), [#PR85](https://github.com/logpai/logparser/pull/85), [#PR83](https://github.com/logpai/logparser/pull/83), [#PR80](https://github.com/logpai/logparser/pull/80), [#PR65](https://github.com/logpai/logparser/pull/65), [#PR57](https://github.com/logpai/logparser/pull/57), [#PR53](https://github.com/logpai/logparser/pull/53), [#PR52](https://github.com/logpai/logparser/pull/52), [#PR51](https://github.com/logpai/logparser/pull/51), [#PR49](https://github.com/logpai/logparser/pull/49), [#PR18](https://github.com/logpai/logparser/pull/18), [#PR22](https://github.com/logpai/logparser/pull/22))
25+
+ Since the first release of logparser, many PRs and issues have been submitted due to incompatibility with Python 3. Finally, we update logparser v1.0.0 with support for Python 3. Thanks for all the contributions ([#PR86](https://github.com/logpai/logparser/pull/86), [#PR85](https://github.com/logpai/logparser/pull/85), [#PR83](https://github.com/logpai/logparser/pull/83), [#PR80](https://github.com/logpai/logparser/pull/80), [#PR65](https://github.com/logpai/logparser/pull/65), [#PR57](https://github.com/logpai/logparser/pull/57), [#PR53](https://github.com/logpai/logparser/pull/53), [#PR52](https://github.com/logpai/logparser/pull/52), [#PR51](https://github.com/logpai/logparser/pull/51), [#PR49](https://github.com/logpai/logparser/pull/49), [#PR18](https://github.com/logpai/logparser/pull/18), [#PR22](https://github.com/logpai/logparser/pull/22))!
2626
+ We build the package wheel logparser3 and release it on pypi. Please install via `pip install logparser3`.
2727
+ We refactor the code structure and beautify the code via the Python code formatter black.
2828

@@ -43,6 +43,7 @@ Logparser provides a machine learning toolkit and benchmarks for automated log p
4343
| ICDM'16 | [Spell](https://github.com/logpai/logparser/tree/main/logparser/Spell#spell) | [Spell: Streaming Parsing of System Event Logs](https://www.cs.utah.edu/~lifeifei/papers/spell.pdf), by Min Du, Feifei Li. |
4444
| ICWS'17 | [Drain](https://github.com/logpai/logparser/tree/main/logparser/Drain#drain) | [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), by Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.|
4545
| ICPC'18 | [MoLFI](https://github.com/logpai/logparser/tree/main/logparser/MoLFI#molfi) | [A Search-based Approach for Accurate Identification of Log Message Formats](http://publications.uni.lu/bitstream/10993/35286/1/ICPC-2018.pdf), by Salma Messaoudi, Annibale Panichella, Domenico Bianculli, Lionel Briand, Raimondas Sasnauskas. |
46+
| TSE'20 | [Logram](https://github.com/logpai/logparser/tree/main/logparser/Logram#logram) | [Logram: Efficient Log Parsing Using n-Gram Dictionaries](https://arxiv.org/pdf/2001.03038.pdf), by Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun (Peter) Chen. |
4647

4748
:bulb: Welcome to submit a PR to push your parser code to logparser and add your paper to the table.
4849

@@ -121,7 +122,7 @@ The main goal of logparser is used for research and benchmark purpose. Researche
121122
122123
+ Please be aware of the licenses of [third-party libraries](https://github.com/logpai/logparser/blob/main/THIRD_PARTIES.md) used in logparser. We suggest to keep one parser and delete the others and then re-build the package wheel. This would not break the use of logparser.
123124
+ Please enhance logparser with efficiency and scalability with multi-processing, add failure recovery, add persistence to disk or message queue Kafka.
124-
+ [Drain3](https://github.com/logpai/Drain3) provides a good example for your reference that is built with [practical enhancements] for production scenarios.
125+
+ [Drain3](https://github.com/logpai/Drain3) provides a good example for your reference that is built with [practical enhancements](https://github.com/logpai/Drain3#new-features) for production scenarios.
125126
126127
### Citation
127128
👋 If you use our logparser tools or benchmarking results in your publication, please cite the following papers.

THIRD_PARTIES.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ The logparser package is built on top of the following third-party libraries:
77
| LenMa | https://github.com/keiichishima/templateminer | BSD |
88
| MoLFI | https://github.com/SalmaMessaoudi/MoLFI | Apache-2.0 |
99
| alignment (LogMine) | https://gist.github.com/aziele/6192a38862ce569fe1b9cbe377339fbe | GPL |
10+
| Logram | https://github.com/BlueLionLogram/Logram | NA |

logparser/Logram/README.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Logram
2+
3+
Logram is an automated log parsing technique, which leverages n-gram dictionaries to achieve efficient log parsing.
4+
5+
Read more information about Logram from the following paper:
6+
7+
+ Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun (Peter) Chen. [Logram: Efficient Log Parsing Using n-Gram
8+
Dictionaries](https://arxiv.org/pdf/2001.03038.pdf), *IEEE Transactions on Software Engineering (TSE)*, 2020.
9+
10+
### Running
11+
12+
The code has been tested in the following enviornment:
13+
+ python 3.7.6
14+
+ regex 2022.3.2
15+
+ pandas 1.0.1
16+
+ numpy 1.18.1
17+
+ scipy 1.4.1
18+
19+
Run the following scripts to start the demo:
20+
21+
```
22+
python demo.py
23+
```
24+
25+
Run the following scripts to execute the benchmark:
26+
27+
```
28+
python benchmark.py
29+
```
30+
31+
### Benchmark
32+
33+
Running the benchmark script on Loghub_2k datasets, you could obtain the following results.
34+
35+
| Dataset | F1_measure | Accuracy |
36+
|:-----------:|:----------|:--------|
37+
| HDFS | 0.990518 | 0.93 |
38+
| Hadoop | 0.78249 | 0.451 |
39+
| Spark | 0.479691 | 0.282 |
40+
| Zookeeper | 0.923936 | 0.7235 |
41+
| BGL | 0.956032 | 0.587 |
42+
| HPC | 0.993748 | 0.9105 |
43+
| Thunderbird | 0.993876 | 0.554 |
44+
| Windows | 0.913735 | 0.694 |
45+
| Linux | 0.541378 | 0.361 |
46+
| Android | 0.975017 | 0.7945 |
47+
| HealthApp | 0.587935 | 0.2665 |
48+
| Apache | 0.637665 | 0.3125 |
49+
| Proxifier | 0.750476 | 0.5035 |
50+
| OpenSSH | 0.979348 | 0.6115 |
51+
| OpenStack | 0.742866 | 0.3255 |
52+
| Mac | 0.892896 | 0.568 |
53+
54+
55+
### Citation
56+
57+
:telescope: If you use our logparser tools or benchmarking results in your publication, please kindly cite the following papers.
58+
59+
+ [**ICSE'19**] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. [Tools and Benchmarks for Automated Log Parsing](https://arxiv.org/pdf/1811.03509.pdf). *International Conference on Software Engineering (ICSE)*, 2019.
60+
+ [**DSN'16**] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. [An Evaluation Study on Log Parsing and Its Use in Log Mining](https://jiemingzhu.github.io/pub/pjhe_dsn2016.pdf). *IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)*, 2016.

logparser/Logram/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .src.Logram import *

logparser/Logram/benchmark.py

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# =========================================================================
2+
# Copyright (C) 2016-2023 LOGPAI (https://github.com/logpai).
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
# =========================================================================
16+
17+
18+
import sys
19+
sys.path.append("../../")
20+
from logparser.Logram import LogParser
21+
from logparser.utils import evaluator
22+
import os
23+
import pandas as pd
24+
25+
26+
input_dir = "../../data/loghub_2k/" # The input directory of log file
27+
output_dir = "Logram_result/" # The output directory of parsing results
28+
29+
benchmark_settings = {
30+
"HDFS": {
31+
"log_file": "HDFS/HDFS_2k.log",
32+
"log_format": "<Date> <Time> <Pid> <Level> <Component>: <Content>",
33+
"regex": [
34+
r"blk_(|-)[0-9]+", # block id
35+
r"(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)", # IP
36+
r"(?<=[^A-Za-z0-9])(\-?\+?\d+)(?=[^A-Za-z0-9])|[0-9]+$",
37+
],
38+
"doubleThreshold": 15,
39+
"triThreshold": 10,
40+
},
41+
"Hadoop": {
42+
"log_file": "Hadoop/Hadoop_2k.log",
43+
"log_format": "<Date> <Time> <Level> \[<Process>\] <Component>: <Content>",
44+
"regex": [r"(\d+\.){3}\d+"],
45+
"doubleThreshold": 9,
46+
"triThreshold": 10,
47+
},
48+
"Spark": {
49+
"log_file": "Spark/Spark_2k.log",
50+
"log_format": "<Date> <Time> <Level> <Component>: <Content>",
51+
"regex": [r"(\d+\.){3}\d+", r"\b[KGTM]?B\b", r"([\w-]+\.){2,}[\w-]+"],
52+
"doubleThreshold": 15,
53+
"triThreshold": 10,
54+
},
55+
"Zookeeper": {
56+
"log_file": "Zookeeper/Zookeeper_2k.log",
57+
"log_format": "<Date> <Time> - <Level> \[<Node>:<Component>@<Id>\] - <Content>",
58+
"regex": [r"(/|)(\d+\.){3}\d+(:\d+)?"],
59+
"doubleThreshold": 15,
60+
"triThreshold": 10,
61+
},
62+
"BGL": {
63+
"log_file": "BGL/BGL_2k.log",
64+
"log_format": "<Label> <Timestamp> <Date> <Node> <Time> <NodeRepeat> <Type> <Component> <Level> <Content>",
65+
"regex": [r"core\.\d+"],
66+
"doubleThreshold": 92,
67+
"triThreshold": 4,
68+
},
69+
"HPC": {
70+
"log_file": "HPC/HPC_2k.log",
71+
"log_format": "<LogId> <Node> <Component> <State> <Time> <Flag> <Content>",
72+
"regex": [r"=\d+"],
73+
"doubleThreshold": 15,
74+
"triThreshold": 10,
75+
},
76+
"Thunderbird": {
77+
"log_file": "Thunderbird/Thunderbird_2k.log",
78+
"log_format": "<Label> <Timestamp> <Date> <User> <Month> <Day> <Time> <Location> <Component>(\[<PID>\])?: <Content>",
79+
"regex": [r"(\d+\.){3}\d+"],
80+
"doubleThreshold": 35,
81+
"triThreshold": 32,
82+
},
83+
"Windows": {
84+
"log_file": "Windows/Windows_2k.log",
85+
"log_format": "<Date> <Time>, <Level> <Component> <Content>",
86+
"regex": [r"0x.*?\s"],
87+
"doubleThreshold": 15,
88+
"triThreshold": 10,
89+
},
90+
"Linux": {
91+
"log_file": "Linux/Linux_2k.log",
92+
"log_format": "<Month> <Date> <Time> <Level> <Component>(\[<PID>\])?: <Content>",
93+
"regex": [r"(\d+\.){3}\d+", r"\d{2}:\d{2}:\d{2}"],
94+
"doubleThreshold": 120,
95+
"triThreshold": 100,
96+
},
97+
"Android": {
98+
"log_file": "Android/Android_2k.log",
99+
"log_format": "<Date> <Time> <Pid> <Tid> <Level> <Component>: <Content>",
100+
"regex": [
101+
r"(/[\w-]+)+",
102+
r"([\w-]+\.){2,}[\w-]+",
103+
r"\b(\-?\+?\d+)\b|\b0[Xx][a-fA-F\d]+\b|\b[a-fA-F\d]{4,}\b",
104+
],
105+
"doubleThreshold": 15,
106+
"triThreshold": 10,
107+
},
108+
"HealthApp": {
109+
"log_file": "HealthApp/HealthApp_2k.log",
110+
"log_format": "<Time>\|<Component>\|<Pid>\|<Content>",
111+
"regex": [],
112+
"doubleThreshold": 15,
113+
"triThreshold": 10,
114+
},
115+
"Apache": {
116+
"log_file": "Apache/Apache_2k.log",
117+
"log_format": "\[<Time>\] \[<Level>\] <Content>",
118+
"regex": [r"(\d+\.){3}\d+"],
119+
"doubleThreshold": 15,
120+
"triThreshold": 10,
121+
},
122+
"Proxifier": {
123+
"log_file": "Proxifier/Proxifier_2k.log",
124+
"log_format": "\[<Time>\] <Program> - <Content>",
125+
"regex": [
126+
r"<\d+\ssec",
127+
r"([\w-]+\.)+[\w-]+(:\d+)?",
128+
r"\d{2}:\d{2}(:\d{2})*",
129+
r"[KGTM]B",
130+
],
131+
"doubleThreshold": 500,
132+
"triThreshold": 470,
133+
},
134+
"OpenSSH": {
135+
"log_file": "OpenSSH/OpenSSH_2k.log",
136+
"log_format": "<Date> <Day> <Time> <Component> sshd\[<Pid>\]: <Content>",
137+
"regex": [r"(\d+\.){3}\d+", r"([\w-]+\.){2,}[\w-]+"],
138+
"doubleThreshold": 88,
139+
"triThreshold": 81,
140+
},
141+
"OpenStack": {
142+
"log_file": "OpenStack/OpenStack_2k.log",
143+
"log_format": "<Logrecord> <Date> <Time> <Pid> <Level> <Component> \[<ADDR>\] <Content>",
144+
"regex": [r"((\d+\.){3}\d+,?)+", r"/.+?\s", r"\d+"],
145+
"doubleThreshold": 30,
146+
"triThreshold": 25,
147+
},
148+
"Mac": {
149+
"log_file": "Mac/Mac_2k.log",
150+
"log_format": "<Month> <Date> <Time> <User> <Component>\[<PID>\]( \(<Address>\))?: <Content>",
151+
"regex": [r"([\w-]+\.){2,}[\w-]+"],
152+
"doubleThreshold": 2,
153+
"triThreshold": 2,
154+
},
155+
}
156+
157+
bechmark_result = []
158+
for dataset, setting in benchmark_settings.items():
159+
print("\n=== Evaluation on %s ===" % dataset)
160+
indir = os.path.join(input_dir, os.path.dirname(setting["log_file"]))
161+
log_file = os.path.basename(setting["log_file"])
162+
163+
parser = LogParser(
164+
log_format=setting["log_format"],
165+
indir=indir,
166+
outdir=output_dir,
167+
rex=setting["regex"],
168+
doubleThreshold=setting["doubleThreshold"],
169+
triThreshold=setting["triThreshold"],
170+
)
171+
parser.parse(log_file)
172+
173+
F1_measure, accuracy = evaluator.evaluate(
174+
groundtruth=os.path.join(indir, log_file + "_structured.csv"),
175+
parsedresult=os.path.join(output_dir, log_file + "_structured.csv"),
176+
)
177+
bechmark_result.append([dataset, F1_measure, accuracy])
178+
179+
print("\n=== Overall evaluation results ===")
180+
df_result = pd.DataFrame(bechmark_result, columns=["Dataset", "F1_measure", "Accuracy"])
181+
df_result.set_index("Dataset", inplace=True)
182+
print(df_result)
183+
df_result.to_csv("Logram_bechmark_result.csv", float_format="%.6f")

logparser/Logram/demo.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/usr/bin/env python
2+
3+
import sys
4+
sys.path.append('../../')
5+
from logparser.Logram import LogParser
6+
7+
input_dir = '../../data/loghub_2k/HDFS/' # The input directory of log file
8+
output_dir = 'demo_result/' # The output directory of parsing results
9+
log_file = 'HDFS_2k.log' # The input log file name
10+
log_format = '<Date> <Time> <Pid> <Level> <Component>: <Content>' # HDFS log format
11+
# Regular expression list for optional preprocessing (default: [])
12+
regex = [
13+
r'blk_(|-)[0-9]+' , # block id
14+
r'(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)', # IP
15+
r'(?<=[^A-Za-z0-9])(\-?\+?\d+)(?=[^A-Za-z0-9])|[0-9]+$', # Numbers
16+
]
17+
doubleThreshold = 15
18+
triThreshold = 10
19+
20+
parser = LogParser(log_format, indir=input_dir, outdir=output_dir, rex=regex,
21+
doubleThreshold=doubleThreshold, triThreshold=triThreshold)
22+
parser.parse(log_file)

logparser/Logram/requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
pandas
2+
regex==2022.3.2
3+
numpy
4+
scipy

logparser/Logram/src/Common.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
"""
2+
This file is modified from:
3+
https://github.com/BlueLionLogram/Logram/tree/master/Evaluation
4+
"""
5+
6+
import regex as re
7+
8+
MyRegex = [
9+
r"blk_(|-)[0-9]+", # block id
10+
r"(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)", # IP
11+
r"(?<=[^A-Za-z0-9])(\-?\+?\d+)(?=[^A-Za-z0-9])|[0-9]+$", # Numbers
12+
]
13+
14+
15+
def preprocess(logLine, specialRegex):
16+
line = logLine
17+
for regex in specialRegex:
18+
line = re.sub(regex, "<*>", " " + logLine)
19+
return line
20+
21+
22+
def tokenSpliter(logLine, regex, specialRegex):
23+
match = regex.search(logLine.strip())
24+
# print(match)
25+
if match == None:
26+
tokens = None
27+
pass
28+
else:
29+
message = match.group("Content")
30+
# print(message)
31+
line = preprocess(message, specialRegex)
32+
tokens = line.strip().split()
33+
# print(tokens)
34+
return tokens, message
35+
36+
37+
def regexGenerator(logformat):
38+
headers = []
39+
splitters = re.split(r"(<[^<>]+>)", logformat)
40+
regex = ""
41+
for k in range(len(splitters)):
42+
if k % 2 == 0:
43+
splitter = re.sub(" +", "\\\s+", splitters[k])
44+
regex += splitter
45+
else:
46+
header = splitters[k].strip("<").strip(">")
47+
regex += "(?P<%s>.*?)" % header
48+
headers.append(header)
49+
regex = re.compile("^" + regex + "$")
50+
return regex

0 commit comments

Comments
 (0)