Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 196 additions & 0 deletions _gsocblogs/2025/blog_Intelligent_Logging_Pipeline_OsamaTahir.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
tile: Intelligent Logging Pipeline
project: Intelligent Log Analysis for the HSF Conditions Database
author: Osama Ahmed Tahir
date: 06.09.2025
year: 2025
layout: blog_post
intro: |
An intelligent logging pipeline for NopayloadDB that integrates log
aggregation, anomaly detection, and monitoring. It improves reliability
and maintainability in large-scale HEP experiments It is deployed on
Minikube cluster while using a deep learning model for real-time anomaly
detection.
---

##### Mentors: Ruslan Mashinistov, Michel Hernandez Villanueva, and John S. De Stefano Jr

### Introduction
<p align="justify">
As experiments grow more complex, the demand for efficient access to
conditions data has increased. To address this, the HEP Software
Foundation (HSF) proposed a reference architecture, NopayloadDB. It
stores metadata and file URLs instead of payloads. However, NopayloadDB
lacks a centralized logging subsystem. To address these limitations,
this project proposes an intelligent logging pipeline integrated with
NopayloadDB. The pipeline combines advanced log aggregation, scalable
storage, and deep learning-based anomaly detection to reduce downtime
and improve operation. The result is enhanced reliability,
maintainability, and scalability of conditions database services in
modern HEP experiments.
</p>

### Objectives
<p align="justify">
The project extended NopayloadDB, the HSF-referenced conditions
database [1], by introducing a centralized and intelligent logging pipeline.
The main goal was to provide centralized log aggregation, structured
parsing and storage for easier querying, and anomaly detection using
DeepLog to detect anomalies. The pipeline also aimed to support real-time
monitoring and diagnostics for different stakeholders. It detected
issues before escalation and provided insights for tuning system
parameters. The design emphasized scalability, modularity, and
compatibility with OpenShift deployments.
</p>

### Motivation
<p align="justify">
I am passionate about contributing to systems that are accessible and
exist in both open-source and open-science settings. My background in
distributed systems, cloud computing, and data-intensive applications
aligns closely with this project. I am excited to contribute my skills
that help to build a scalable and intelligent system and share the
results with the broader community. This opportunity also allowed me to
deepen my expertise in log analysis and machine learning while advancing
my passion for technology and scientific knowledge to be openly
available and accessible.
</p>

### Technical Walkthrough
<p align="justify">
The logging pipeline is deployed on Minikube and is implemented around
three containerized modules: Process, Monitor, and Predict. Each
component as shown in Figure 1 has been described below.
</p>

<div style="text-align: center;">
<img width="1023" height="213" alt="intelligent_logging_pipeline" src="https://gist.github.com/user-attachments/assets/b833ae51-6a7c-4e3c-924a-f7717ef8432b" /></div>
<p align="center"><i>Figure 1: Intelligent Logging Pipeline</i></p>

#### Process
<p align="justify">
In the Process module, NopayloadDB generates log data, which includes
both the Django application and the PostgreSQL database. This data is
then collected, filtered, and parsed by Fluent Bit. The processed Fluent
Bit logs are consumed by Kafka on a topic. Alloy is used as a forwarding
agent that forwards a structured log from a Kafka topic to a Loki.
</p>

#### Monitor
<p align="justify">
The Monitor module focused on scalable storage and visualization. Loki
is a distributed, horizontally scalable log aggregation system composed
of several key components [2]. The Distributor receives logs from Alloy,
validates them, and routes them to Ingesters while balancing the load.
The Ingester temporarily stores logs in memory, compresses them, and
forwards them to long-term storage in MinIO. The Querier retrieves the
required logs from MinIO, forwarding them to Drain3 for prediction and
Grafana for visualization. Figure 2 shows the flow of data within the
customized Loki architecture.
</p>

<div style="text-align: center;">
<img width="462" height="418" alt="custom_loki" src="https://gist.github.com/user-attachments/assets/8f29e922-ece8-425d-aca4-8f981c52e3a9" /></div>
<p align="center"><i>Figure 2: Customized Loki</i></p>

#### Predict
<p align="justify">
The Predict module added intelligence to the pipeline. Drain3 parsed raw
logs into structured template IDs. These template IDs are consumed by
Redis in sequence from Drain3. The template ID sequences are then
processed by DeepLog for sequence-based anomaly detection. Any detected
anomalies are visualized by Grafana for debugging and monitoring.
</p>

### DeepLog
<p align="justify">
DeepLog is an LSTM-based model that learns log patterns from normal
execution [3]. It detects anomalies when log patterns deviate from the model
trained from logs under normal execution. Over time, the DeepLog model
adapts to new log patterns over time and constructs workflows from the
underlying system log. This is because once an anomaly is detected,
users can diagnose the detected anomaly and perform root cause analysis
effectively. The deeplog configuration can be changed by the top K. The
top K is how many of the most likely predictions will be considered
"normal" by the model. If k is set to 2, you take the two events with
the highest probabilities, and these will be the top k most probable
next events.
</p>

<p align="justify">
Suppose a log system has several events, each represented by a unique
ID. The model takes a sequence of past events and predicts what the next
event is likely to be. For example, table 1 shows unique IDs for the set
of events of a user trying to upload a file. From the past event, the
DeepLog learns the upcoming event and gives a probability to each unique
event. It is to be noted that these sets of events are consumed by
DeepLog in a random sequence.
</p>

| Unique ID | Event | Probability |
|-----------|--------------|-------------|
| 0 | Login | 0.7 |
| 1 | Upload File | 0.4 |
| 2 | Select File | 0.6 |
| 3 | Logout | 0.25 |
| 4 | Submit File | 0.3 |
<i>Table 1: Set of Events*</i>

<p align="justify">
Here the model thinks "Login" is most likely next event, then "Select
File" and then "Upload File" etc. Hence, the sequence will be [Login,
Select File, Upload File, Submit File, Logout] and with their
respective unique IDs, it will be [0, 2, 1, 4, 3]. With k=2, the model
predicts the top 2 event IDs as [Login, Select File], while the true
event is Upload File. Since the true event does not appear in the top 2
predictions, this case is flagged as an anomaly. When k=3, the top 3
event IDs are [Login, Select File, Upload File], and the true event
Upload File is included, so it is considered normal. In practice, the
model checks whether the true event ID appears within the top-k
predicted IDs: if the true event is not present, the sequence is
labelled as an anomaly; otherwise, it is treated as normal.
</p>

### Results
<p align="justify">
The intelligent logging pipeline demonstrated log collection and
aggregation for Kubernetes-based clusters. The heterogeneous logs were
parsed and formatted into structured sequences. The DeepLog was
integrated into the pipeline, showing the feasibility of automated
real-time monitoring and anomaly detection. The Grafana dashboards
provided tailored access for different user roles.
</p>

### Future Work
<p align="justify">
This research will establish a baseline for how the observability and
diagnostics of a system can benefit the most from artificial
intelligence. In addition, it will also be beneficial for the open
source community, scientific research, and enterprise applications. From
the experiment's point of view, it will provide more reliable and
reproducible physics experiments. This will also enable HEP to
efficiently allocate resources from insights gained from the system. In
addition, it will also pave the way for how cutting-edge techniques can
be applied beyond HEP, such as large-scale cloud applications and
enterprise systems.
</p>

### Final remarks
<p align="justify">
I enjoyed my time working with my mentors Ruslan, Michel, and John. This
project was the first time that I contributed to CERN and BNL to such an
extent and provided me with a sense of accomplishment in my professional
career. The consistent feedback on both the project and the publication
helped me a lot in shaping the project. My mentors provided me with a
path to present the project on a greater scale, and it resulted in the
project reaching greater potential to be used for other experiments. I
am happy to be mentored by such experienced and knowledgeable
professionals.
</p>

### References
[1] Ruslan Mashinistov, L. Gerlach, P. Laycock, A. Formica, Giacomo Govi, and C. Pinkenburg, “The HSF Conditions Database Reference Implementation,” EPJ Web of Conferences, vol. 295, pp. 01051–01051, Jan. 2024, doi: https://doi.org/10.1051/epjconf/202429501051.

[2] “Grafana Loki | Grafana Loki documentation,” Grafana Labs, 2025. https://grafana.com/docs/loki/latest (accessed Jul. 30, 2025).

[3] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security”, Oct. 2017, doi: https://doi.org/10.1145/3133956.3134015.
Binary file added images/blog_authors/OsamaTahir.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.