|
| 1 | +--- |
| 2 | +tile: Intelligent Logging Pipeline |
| 3 | +project: Intelligent Log Analysis for the HSF Conditions Database |
| 4 | +author: Osama Ahmed Tahir |
| 5 | +date: 06.09.2025\ |
| 6 | +year: 2025 |
| 7 | +layout: blog_post |
| 8 | +intro: | |
| 9 | +An intelligent logging pipeline for NopayloadDB that integrates log |
| 10 | +aggregation, anomaly detection, and monitoring. It improves reliability |
| 11 | +and maintainability in large-scale HEP experiments It is deployed on |
| 12 | +Minikube cluster while using a deep learning model for real-time anomaly |
| 13 | +detection. |
| 14 | +--- |
| 15 | + |
| 16 | +##### Mentors: Ruslan Mashinistov, Michel Hernandez Villanueva, and John S. De Stefano Jr |
| 17 | + |
| 18 | +##### Proposal: [Google Summer of Code Proposal](https://drive.google.com/file/d/1Gg9qLOUrT1eHkReHZSS9fJUuP3-znHuW/view?usp=sharing) |
| 19 | + |
| 20 | +### Introduction |
| 21 | +<p align="justify"> |
| 22 | +As experiments grow more complex, the demand for efficient access to |
| 23 | +conditions data has increased. To address this, the HEP Software |
| 24 | +Foundation (HSF) proposed a reference architecture, NopayloadDB. It |
| 25 | +stores metadata and file URLs instead of payloads. However, NopayloadDB |
| 26 | +lacks a centralized logging subsystem. To address these limitations, |
| 27 | +this project proposes an intelligent logging pipeline integrated with |
| 28 | +NopayloadDB. The pipeline combines advanced log aggregation, scalable |
| 29 | +storage, and deep learning--based anomaly detection to reduce downtime |
| 30 | +and improve operation. The result is enhanced reliability, |
| 31 | +maintainability, and scalability of conditions database services in |
| 32 | +modern HEP experiments. |
| 33 | +</p> |
| 34 | + |
| 35 | +### Objectives |
| 36 | +<p align="justify"> |
| 37 | +The project extended NopayloadDB, the HSF-referenced conditions |
| 38 | +database [1], by introducing a centralized and intelligent logging system. |
| 39 | +The main goal was to provide centralized log aggregation, structured |
| 40 | +parsing and storage for easier querying, and anomaly detection using |
| 41 | +DeepLog to detect anomalies. The system also aimed to support real-time |
| 42 | +monitoring and diagnostics for different stakeholders. It detected |
| 43 | +issues before escalation and provided insights for tuning system |
| 44 | +parameters. The design emphasized scalability, modularity, and |
| 45 | +compatibility with OpenShift deployments. |
| 46 | +</p> |
| 47 | + |
| 48 | +### Motivation |
| 49 | +<p align="justify"> |
| 50 | +I am passionate about contributing to systems that are accessible and |
| 51 | +exist in both open-source and open-science settings. My background in |
| 52 | +distributed systems, cloud computing, and data-intensive applications |
| 53 | +aligns closely with this project. I am excited to contribute my skills |
| 54 | +that help to build a scalable and intelligent system and share the |
| 55 | +results with the broader community. This opportunity also allowed me to |
| 56 | +deepen my expertise in log analysis and machine learning while advancing |
| 57 | +my passion for technology and scientific knowledge to be openly |
| 58 | +available and accessible. |
| 59 | +</p> |
| 60 | + |
| 61 | +### Technical Walkthrough |
| 62 | +<p align="justify"> |
| 63 | +The logging pipeline is deployed on Minikube and is implemented around |
| 64 | +three containerized modules: Process, Monitor, and Predict. Each |
| 65 | +component as shown in Figure 1 has been described below. |
| 66 | +</p> |
| 67 | + |
| 68 | +<div style="text-align: center;"> |
| 69 | +<img width="1023" height="213" alt="intelligent_logging_pipeline" src="https://gist.github.com/user-attachments/assets/b833ae51-6a7c-4e3c-924a-f7717ef8432b" /></div> |
| 70 | +<p align="center"><i>Figure 1: Intelligent Logging Pipeline</i></p> |
| 71 | + |
| 72 | +#### Process |
| 73 | +<p align="justify"> |
| 74 | +In the Process module, NopayloadDB generates log data, which includes |
| 75 | +both the Django application and the PostgreSQL database. This data is |
| 76 | +then collected, filtered, and parsed by Fluent Bit. The processed Fluent |
| 77 | +Bit logs are consumed by Kafka on a topic. Alloy is used as a forwarding |
| 78 | +agent that forwards a structured log from a Kafka topic to a Loki. |
| 79 | +</p> |
| 80 | + |
| 81 | +#### Monitor |
| 82 | +<p align="justify"> |
| 83 | +The Monitor module focused on scalable storage and visualization. Loki |
| 84 | +is a distributed, horizontally scalable log aggregation system composed |
| 85 | +of several key components [2]. The Distributor receives logs from Alloy, |
| 86 | +validates them, and routes them to Ingesters while balancing the load. |
| 87 | +The Ingester temporarily stores logs in memory, compresses them, and |
| 88 | +forwards them to long-term storage in MinIO. The Querier retrieves the |
| 89 | +required logs from MinIO, forwarding them to Drain3 for prediction and |
| 90 | +Grafana for visualization. Figure 2 shows the flow of data within the |
| 91 | +customized Loki architecture. |
| 92 | +</p> |
| 93 | + |
| 94 | +<div style="text-align: center;"> |
| 95 | +<img width="462" height="418" alt="custom_loki" src="https://gist.github.com/user-attachments/assets/8f29e922-ece8-425d-aca4-8f981c52e3a9" /></div> |
| 96 | +<p align="center"><i>Figure 2: Customized Loki</i></p> |
| 97 | + |
| 98 | +#### Predict |
| 99 | +<p align="justify"> |
| 100 | +The Predict module added intelligence to the pipeline. Drain3 parsed raw |
| 101 | +logs into structured template IDs. These template IDs are consumed by |
| 102 | +Redis in sequence from Drain3. The template ID sequences are then |
| 103 | +processed by DeepLog for sequence-based anomaly detection. Any detected |
| 104 | +anomalies are visualized by Grafana for debugging and monitoring. |
| 105 | +</p> |
| 106 | + |
| 107 | +### DeepLog |
| 108 | +<p align="justify"> |
| 109 | +DeepLog is an LSTM-based model that learns log patterns from normal |
| 110 | +execution [3]. It detects anomalies when log patterns deviate from the model |
| 111 | +trained from logs under normal execution. Over time, the DeepLog model |
| 112 | +adapts to new log patterns over time and constructs workflows from the |
| 113 | +underlying system log. This is because once an anomaly is detected, |
| 114 | +users can diagnose the detected anomaly and perform root cause analysis |
| 115 | +effectively. The deeplog configuration can be changed by the top K. The |
| 116 | +top K is how many of the most likely predictions will be considered |
| 117 | +"normal" by the model. If k is set to 2, you take the two events with |
| 118 | +the highest probabilities, and these will be the top k most probable |
| 119 | +next events. |
| 120 | +</p> |
| 121 | + |
| 122 | +<p align="justify"> |
| 123 | +Suppose a log system has several events, each represented by a unique |
| 124 | +ID. The model takes a sequence of past events and predicts what the next |
| 125 | +event is likely to be. For example, table 1 shows unique IDs for the set |
| 126 | +of events of a user trying to upload a file. From the past event, the |
| 127 | +DeepLog learns the upcoming event and gives a probability to each unique |
| 128 | +event. It is to be noted that these sets of events are consumed by |
| 129 | +DeepLog in a random sequence. |
| 130 | +</p> |
| 131 | + |
| 132 | +| Unique ID | Event | Probability | |
| 133 | +|-----------|--------------|-------------| |
| 134 | +| 0 | Login | 0.7 | |
| 135 | +| 1 | Upload File | 0.4 | |
| 136 | +| 2 | Select File | 0.6 | |
| 137 | +| 3 | Logout | 0.25 | |
| 138 | +| 4 | Submit File | 0.3 | |
| 139 | +*Table 1: Set of Events* |
| 140 | + |
| 141 | +<p align="justify"> |
| 142 | +Here the model thinks "Login" is most likely next event, then "Select |
| 143 | +File" and then "Upload File" etc. Hence, the sequence will be \[Login, |
| 144 | +Select File, Upload File, Submit File, Logout\] and with their |
| 145 | +respective unique IDs, it will be \[0, 2, 1, 4, 3\]. With k=2, the model |
| 146 | +predicts the top 2 event IDs as \[Login, Select File\], while the true |
| 147 | +event is Upload File. Since the true event does not appear in the top 2 |
| 148 | +predictions, this case is flagged as an anomaly. When k=3, the top 3 |
| 149 | +event IDs are \[Login, Select File, Upload File\], and the true event |
| 150 | +Upload File is included, so it is considered normal. In practice, the |
| 151 | +model checks whether the true event ID appears within the top-k |
| 152 | +predicted IDs: if the true event is not present, the sequence is |
| 153 | +labelled as an anomaly; otherwise, it is treated as normal. |
| 154 | +</p> |
| 155 | + |
| 156 | +### Results |
| 157 | +<p align="justify"> |
| 158 | +The intelligent logging pipeline demonstrated log collection and |
| 159 | +aggregation for Kubernetes-based clusters. The heterogeneous logs were |
| 160 | +parsed and formatted into structured sequences. The DeepLog was |
| 161 | +integrated into the pipeline, showing the feasibility of automated |
| 162 | +real-time monitoring and anomaly detection. The Grafana dashboards |
| 163 | +provided tailored access for different user roles. |
| 164 | +</p> |
| 165 | + |
| 166 | +### Future Work |
| 167 | +<p align="justify"> |
| 168 | +This research will establish a baseline for how the observability and |
| 169 | +diagnostics of a system can benefit the most from artificial |
| 170 | +intelligence. In addition, it will also be beneficial for the open |
| 171 | +source community, scientific research, and enterprise applications. From |
| 172 | +the experiment\'s point of view, it will provide more reliable and |
| 173 | +reproducible physics experiments. This will also enable HEP to |
| 174 | +efficiently allocate resources from insights gained from the system. In |
| 175 | +addition, it will also pave the way for how cutting-edge techniques can |
| 176 | +be applied beyond HEP, such as large-scale cloud applications and |
| 177 | +enterprise systems. |
| 178 | +</p> |
| 179 | + |
| 180 | +### Final remarks |
| 181 | +<p align="justify"> |
| 182 | +I enjoyed my time working with my mentors Ruslan, Michel, and John. This |
| 183 | +project was the first time that I contributed to CERN and BNL to such an |
| 184 | +extent and provided me with a sense of accomplishment in my professional |
| 185 | +career. The consistent feedback on both the project and the publication |
| 186 | +helped me a lot in shaping the project. My mentors provided me with a |
| 187 | +path to present the project on a greater scale, and it resulted in the |
| 188 | +project reaching greater potential to be used for other experiments. I |
| 189 | +am happy to be mentored by such experienced and knowledgeable |
| 190 | +professionals. |
| 191 | +</p> |
| 192 | + |
| 193 | +### References |
| 194 | +[1] Ruslan Mashinistov, L. Gerlach, P. Laycock, A. Formica, Giacomo Govi, and C. Pinkenburg, “The HSF Conditions Database Reference Implementation,” EPJ Web of Conferences, vol. 295, pp. 01051–01051, Jan. 2024, doi: https://doi.org/10.1051/epjconf/202429501051. |
| 195 | + |
| 196 | +[2] “Grafana Loki | Grafana Loki documentation,” Grafana Labs, 2025. https://grafana.com/docs/loki/latest (accessed Jul. 30, 2025). |
| 197 | + |
| 198 | +[3] M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security”, Oct. 2017, doi: https://doi.org/10.1145/3133956.3134015. |
0 commit comments