diff --git a/_gsocblogs/2025/blog_Intelligent_Logging_Pipeline_OsamaTahir.md b/_gsocblogs/2025/blog_Intelligent_Logging_Pipeline_OsamaTahir.md new file mode 100644 index 000000000..ee8064412 --- /dev/null +++ b/_gsocblogs/2025/blog_Intelligent_Logging_Pipeline_OsamaTahir.md @@ -0,0 +1,196 @@ +--- +tile: Intelligent Logging Pipeline +project: Intelligent Log Analysis for the HSF Conditions Database +author: Osama Ahmed Tahir +date: 06.09.2025 +year: 2025 +layout: blog_post +intro: | +An intelligent logging pipeline for NopayloadDB that integrates log +aggregation, anomaly detection, and monitoring. It improves reliability +and maintainability in large-scale HEP experiments It is deployed on +Minikube cluster while using a deep learning model for real-time anomaly +detection. +--- + +##### Mentors: Ruslan Mashinistov, Michel Hernandez Villanueva, and John S. De Stefano Jr + +### Introduction +
+As experiments grow more complex, the demand for efficient access to +conditions data has increased. To address this, the HEP Software +Foundation (HSF) proposed a reference architecture, NopayloadDB. It +stores metadata and file URLs instead of payloads. However, NopayloadDB +lacks a centralized logging subsystem. To address these limitations, +this project proposes an intelligent logging pipeline integrated with +NopayloadDB. The pipeline combines advanced log aggregation, scalable +storage, and deep learning-based anomaly detection to reduce downtime +and improve operation. The result is enhanced reliability, +maintainability, and scalability of conditions database services in +modern HEP experiments. +
+ +### Objectives ++The project extended NopayloadDB, the HSF-referenced conditions +database [1], by introducing a centralized and intelligent logging pipeline. +The main goal was to provide centralized log aggregation, structured +parsing and storage for easier querying, and anomaly detection using +DeepLog to detect anomalies. The pipeline also aimed to support real-time +monitoring and diagnostics for different stakeholders. It detected +issues before escalation and provided insights for tuning system +parameters. The design emphasized scalability, modularity, and +compatibility with OpenShift deployments. +
+ +### Motivation ++I am passionate about contributing to systems that are accessible and +exist in both open-source and open-science settings. My background in +distributed systems, cloud computing, and data-intensive applications +aligns closely with this project. I am excited to contribute my skills +that help to build a scalable and intelligent system and share the +results with the broader community. This opportunity also allowed me to +deepen my expertise in log analysis and machine learning while advancing +my passion for technology and scientific knowledge to be openly +available and accessible. +
+ +### Technical Walkthrough ++The logging pipeline is deployed on Minikube and is implemented around +three containerized modules: Process, Monitor, and Predict. Each +component as shown in Figure 1 has been described below. +
+ +Figure 1: Intelligent Logging Pipeline
+ +#### Process ++In the Process module, NopayloadDB generates log data, which includes +both the Django application and the PostgreSQL database. This data is +then collected, filtered, and parsed by Fluent Bit. The processed Fluent +Bit logs are consumed by Kafka on a topic. Alloy is used as a forwarding +agent that forwards a structured log from a Kafka topic to a Loki. +
+ +#### Monitor ++The Monitor module focused on scalable storage and visualization. Loki +is a distributed, horizontally scalable log aggregation system composed +of several key components [2]. The Distributor receives logs from Alloy, +validates them, and routes them to Ingesters while balancing the load. +The Ingester temporarily stores logs in memory, compresses them, and +forwards them to long-term storage in MinIO. The Querier retrieves the +required logs from MinIO, forwarding them to Drain3 for prediction and +Grafana for visualization. Figure 2 shows the flow of data within the +customized Loki architecture. +
+ +Figure 2: Customized Loki
+ +#### Predict ++The Predict module added intelligence to the pipeline. Drain3 parsed raw +logs into structured template IDs. These template IDs are consumed by +Redis in sequence from Drain3. The template ID sequences are then +processed by DeepLog for sequence-based anomaly detection. Any detected +anomalies are visualized by Grafana for debugging and monitoring. +
+ +### DeepLog ++DeepLog is an LSTM-based model that learns log patterns from normal +execution [3]. It detects anomalies when log patterns deviate from the model +trained from logs under normal execution. Over time, the DeepLog model +adapts to new log patterns over time and constructs workflows from the +underlying system log. This is because once an anomaly is detected, +users can diagnose the detected anomaly and perform root cause analysis +effectively. The deeplog configuration can be changed by the top K. The +top K is how many of the most likely predictions will be considered +"normal" by the model. If k is set to 2, you take the two events with +the highest probabilities, and these will be the top k most probable +next events. +
+ ++Suppose a log system has several events, each represented by a unique +ID. The model takes a sequence of past events and predicts what the next +event is likely to be. For example, table 1 shows unique IDs for the set +of events of a user trying to upload a file. From the past event, the +DeepLog learns the upcoming event and gives a probability to each unique +event. It is to be noted that these sets of events are consumed by +DeepLog in a random sequence. +
+ +| Unique ID | Event | Probability | +|-----------|--------------|-------------| +| 0 | Login | 0.7 | +| 1 | Upload File | 0.4 | +| 2 | Select File | 0.6 | +| 3 | Logout | 0.25 | +| 4 | Submit File | 0.3 | +Table 1: Set of Events* + ++Here the model thinks "Login" is most likely next event, then "Select +File" and then "Upload File" etc. Hence, the sequence will be [Login, +Select File, Upload File, Submit File, Logout] and with their +respective unique IDs, it will be [0, 2, 1, 4, 3]. With k=2, the model +predicts the top 2 event IDs as [Login, Select File], while the true +event is Upload File. Since the true event does not appear in the top 2 +predictions, this case is flagged as an anomaly. When k=3, the top 3 +event IDs are [Login, Select File, Upload File], and the true event +Upload File is included, so it is considered normal. In practice, the +model checks whether the true event ID appears within the top-k +predicted IDs: if the true event is not present, the sequence is +labelled as an anomaly; otherwise, it is treated as normal. +
+ +### Results ++The intelligent logging pipeline demonstrated log collection and +aggregation for Kubernetes-based clusters. The heterogeneous logs were +parsed and formatted into structured sequences. The DeepLog was +integrated into the pipeline, showing the feasibility of automated +real-time monitoring and anomaly detection. The Grafana dashboards +provided tailored access for different user roles. +
+ +### Future Work ++This research will establish a baseline for how the observability and +diagnostics of a system can benefit the most from artificial +intelligence. In addition, it will also be beneficial for the open +source community, scientific research, and enterprise applications. From +the experiment's point of view, it will provide more reliable and +reproducible physics experiments. This will also enable HEP to +efficiently allocate resources from insights gained from the system. In +addition, it will also pave the way for how cutting-edge techniques can +be applied beyond HEP, such as large-scale cloud applications and +enterprise systems. +
+ +### Final remarks ++I enjoyed my time working with my mentors Ruslan, Michel, and John. This +project was the first time that I contributed to CERN and BNL to such an +extent and provided me with a sense of accomplishment in my professional +career. The consistent feedback on both the project and the publication +helped me a lot in shaping the project. My mentors provided me with a +path to present the project on a greater scale, and it resulted in the +project reaching greater potential to be used for other experiments. I +am happy to be mentored by such experienced and knowledgeable +professionals. +
+ +### References +[1] Ruslan Mashinistov, L. Gerlach, P. Laycock, A. Formica, Giacomo Govi, and C. Pinkenburg, “The HSF Conditions Database Reference Implementation,” EPJ Web of Conferences, vol. 295, pp. 01051–01051, Jan. 2024, doi: https://doi.org/10.1051/epjconf/202429501051. + +[2] “Grafana Loki | Grafana Loki documentation,” Grafana Labs, 2025. https://grafana.com/docs/loki/latest (accessed Jul. 30, 2025). + +[3] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security”, Oct. 2017, doi: https://doi.org/10.1145/3133956.3134015. \ No newline at end of file diff --git a/images/blog_authors/OsamaTahir.jpeg b/images/blog_authors/OsamaTahir.jpeg new file mode 100644 index 000000000..2e4b68b10 Binary files /dev/null and b/images/blog_authors/OsamaTahir.jpeg differ