Skip to content

Commit 5b4b792

Browse files
committed
Added blog and profile photo
1 parent d2b1d57 commit 5b4b792

File tree

2 files changed

+198
-0
lines changed

2 files changed

+198
-0
lines changed
Lines changed: 198 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,198 @@
1+
---
2+
tile: Intelligent Logging Pipeline
3+
project: Intelligent Log Analysis for the HSF Conditions Database
4+
author: Osama Ahmed Tahir
5+
date: 06.09.2025\
6+
year: 2025
7+
layout: blog_post
8+
intro: |
9+
An intelligent logging pipeline for NopayloadDB that integrates log
10+
aggregation, anomaly detection, and monitoring. It improves reliability
11+
and maintainability in large-scale HEP experiments It is deployed on
12+
Minikube cluster while using a deep learning model for real-time anomaly
13+
detection.
14+
---
15+
16+
##### Mentors: Ruslan Mashinistov, Michel Hernandez Villanueva, and John S. De Stefano Jr
17+
18+
##### Proposal: [Google Summer of Code Proposal](https://drive.google.com/file/d/1Gg9qLOUrT1eHkReHZSS9fJUuP3-znHuW/view?usp=sharing)
19+
20+
### Introduction
21+
<p align="justify">
22+
As experiments grow more complex, the demand for efficient access to
23+
conditions data has increased. To address this, the HEP Software
24+
Foundation (HSF) proposed a reference architecture, NopayloadDB. It
25+
stores metadata and file URLs instead of payloads. However, NopayloadDB
26+
lacks a centralized logging subsystem. To address these limitations,
27+
this project proposes an intelligent logging pipeline integrated with
28+
NopayloadDB. The pipeline combines advanced log aggregation, scalable
29+
storage, and deep learning--based anomaly detection to reduce downtime
30+
and improve operation. The result is enhanced reliability,
31+
maintainability, and scalability of conditions database services in
32+
modern HEP experiments.
33+
</p>
34+
35+
### Objectives
36+
<p align="justify">
37+
The project extended NopayloadDB, the HSF-referenced conditions
38+
database [1], by introducing a centralized and intelligent logging system.
39+
The main goal was to provide centralized log aggregation, structured
40+
parsing and storage for easier querying, and anomaly detection using
41+
DeepLog to detect anomalies. The system also aimed to support real-time
42+
monitoring and diagnostics for different stakeholders. It detected
43+
issues before escalation and provided insights for tuning system
44+
parameters. The design emphasized scalability, modularity, and
45+
compatibility with OpenShift deployments.
46+
</p>
47+
48+
### Motivation
49+
<p align="justify">
50+
I am passionate about contributing to systems that are accessible and
51+
exist in both open-source and open-science settings. My background in
52+
distributed systems, cloud computing, and data-intensive applications
53+
aligns closely with this project. I am excited to contribute my skills
54+
that help to build a scalable and intelligent system and share the
55+
results with the broader community. This opportunity also allowed me to
56+
deepen my expertise in log analysis and machine learning while advancing
57+
my passion for technology and scientific knowledge to be openly
58+
available and accessible.
59+
</p>
60+
61+
### Technical Walkthrough
62+
<p align="justify">
63+
The logging pipeline is deployed on Minikube and is implemented around
64+
three containerized modules: Process, Monitor, and Predict. Each
65+
component as shown in Figure 1 has been described below.
66+
</p>
67+
68+
<div style="text-align: center;">
69+
<img width="1023" height="213" alt="intelligent_logging_pipeline" src="https://gist.github.com/user-attachments/assets/b833ae51-6a7c-4e3c-924a-f7717ef8432b" /></div>
70+
<p align="center"><i>Figure 1: Intelligent Logging Pipeline</i></p>
71+
72+
#### Process
73+
<p align="justify">
74+
In the Process module, NopayloadDB generates log data, which includes
75+
both the Django application and the PostgreSQL database. This data is
76+
then collected, filtered, and parsed by Fluent Bit. The processed Fluent
77+
Bit logs are consumed by Kafka on a topic. Alloy is used as a forwarding
78+
agent that forwards a structured log from a Kafka topic to a Loki.
79+
</p>
80+
81+
#### Monitor
82+
<p align="justify">
83+
The Monitor module focused on scalable storage and visualization. Loki
84+
is a distributed, horizontally scalable log aggregation system composed
85+
of several key components [2]. The Distributor receives logs from Alloy,
86+
validates them, and routes them to Ingesters while balancing the load.
87+
The Ingester temporarily stores logs in memory, compresses them, and
88+
forwards them to long-term storage in MinIO. The Querier retrieves the
89+
required logs from MinIO, forwarding them to Drain3 for prediction and
90+
Grafana for visualization. Figure 2 shows the flow of data within the
91+
customized Loki architecture.
92+
</p>
93+
94+
<div style="text-align: center;">
95+
<img width="462" height="418" alt="custom_loki" src="https://gist.github.com/user-attachments/assets/8f29e922-ece8-425d-aca4-8f981c52e3a9" /></div>
96+
<p align="center"><i>Figure 2: Customized Loki</i></p>
97+
98+
#### Predict
99+
<p align="justify">
100+
The Predict module added intelligence to the pipeline. Drain3 parsed raw
101+
logs into structured template IDs. These template IDs are consumed by
102+
Redis in sequence from Drain3. The template ID sequences are then
103+
processed by DeepLog for sequence-based anomaly detection. Any detected
104+
anomalies are visualized by Grafana for debugging and monitoring.
105+
</p>
106+
107+
### DeepLog
108+
<p align="justify">
109+
DeepLog is an LSTM-based model that learns log patterns from normal
110+
execution [3]. It detects anomalies when log patterns deviate from the model
111+
trained from logs under normal execution. Over time, the DeepLog model
112+
adapts to new log patterns over time and constructs workflows from the
113+
underlying system log. This is because once an anomaly is detected,
114+
users can diagnose the detected anomaly and perform root cause analysis
115+
effectively. The deeplog configuration can be changed by the top K. The
116+
top K is how many of the most likely predictions will be considered
117+
"normal" by the model. If k is set to 2, you take the two events with
118+
the highest probabilities, and these will be the top k most probable
119+
next events.
120+
</p>
121+
122+
<p align="justify">
123+
Suppose a log system has several events, each represented by a unique
124+
ID. The model takes a sequence of past events and predicts what the next
125+
event is likely to be. For example, table 1 shows unique IDs for the set
126+
of events of a user trying to upload a file. From the past event, the
127+
DeepLog learns the upcoming event and gives a probability to each unique
128+
event. It is to be noted that these sets of events are consumed by
129+
DeepLog in a random sequence.
130+
</p>
131+
132+
| Unique ID | Event | Probability |
133+
|-----------|--------------|-------------|
134+
| 0 | Login | 0.7 |
135+
| 1 | Upload File | 0.4 |
136+
| 2 | Select File | 0.6 |
137+
| 3 | Logout | 0.25 |
138+
| 4 | Submit File | 0.3 |
139+
*Table 1: Set of Events*
140+
141+
<p align="justify">
142+
Here the model thinks "Login" is most likely next event, then "Select
143+
File" and then "Upload File" etc. Hence, the sequence will be \[Login,
144+
Select File, Upload File, Submit File, Logout\] and with their
145+
respective unique IDs, it will be \[0, 2, 1, 4, 3\]. With k=2, the model
146+
predicts the top 2 event IDs as \[Login, Select File\], while the true
147+
event is Upload File. Since the true event does not appear in the top 2
148+
predictions, this case is flagged as an anomaly. When k=3, the top 3
149+
event IDs are \[Login, Select File, Upload File\], and the true event
150+
Upload File is included, so it is considered normal. In practice, the
151+
model checks whether the true event ID appears within the top-k
152+
predicted IDs: if the true event is not present, the sequence is
153+
labelled as an anomaly; otherwise, it is treated as normal.
154+
</p>
155+
156+
### Results
157+
<p align="justify">
158+
The intelligent logging pipeline demonstrated log collection and
159+
aggregation for Kubernetes-based clusters. The heterogeneous logs were
160+
parsed and formatted into structured sequences. The DeepLog was
161+
integrated into the pipeline, showing the feasibility of automated
162+
real-time monitoring and anomaly detection. The Grafana dashboards
163+
provided tailored access for different user roles.
164+
</p>
165+
166+
### Future Work
167+
<p align="justify">
168+
This research will establish a baseline for how the observability and
169+
diagnostics of a system can benefit the most from artificial
170+
intelligence. In addition, it will also be beneficial for the open
171+
source community, scientific research, and enterprise applications. From
172+
the experiment\'s point of view, it will provide more reliable and
173+
reproducible physics experiments. This will also enable HEP to
174+
efficiently allocate resources from insights gained from the system. In
175+
addition, it will also pave the way for how cutting-edge techniques can
176+
be applied beyond HEP, such as large-scale cloud applications and
177+
enterprise systems.
178+
</p>
179+
180+
### Final remarks
181+
<p align="justify">
182+
I enjoyed my time working with my mentors Ruslan, Michel, and John. This
183+
project was the first time that I contributed to CERN and BNL to such an
184+
extent and provided me with a sense of accomplishment in my professional
185+
career. The consistent feedback on both the project and the publication
186+
helped me a lot in shaping the project. My mentors provided me with a
187+
path to present the project on a greater scale, and it resulted in the
188+
project reaching greater potential to be used for other experiments. I
189+
am happy to be mentored by such experienced and knowledgeable
190+
professionals.
191+
</p>
192+
193+
### References
194+
[1] Ruslan Mashinistov, L. Gerlach, P. Laycock, A. Formica, Giacomo Govi, and C. Pinkenburg, “The HSF Conditions Database Reference Implementation,” EPJ Web of Conferences, vol. 295, pp. 01051–01051, Jan. 2024, doi: https://doi.org/10.1051/epjconf/202429501051.
195+
196+
[2] “Grafana Loki | Grafana Loki documentation,” Grafana Labs, 2025. https://grafana.com/docs/loki/latest (accessed Jul. 30, 2025).
197+
198+
[3] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security”, Oct. 2017, doi: https://doi.org/10.1145/3133956.3134015.
77.2 KB
Loading

0 commit comments

Comments
 (0)