Smart O&M Agent: An Anomaly Detection Architecture for System Operation and Maintenance Based on RAG

Smart O&M Agent is a system log anomaly detection framework that combines Large Language Model(LLM) and Retrieval-Augmented Generation(RAG). It can understand complex log semantics and detect high-density log anomalies, and can determine the current system status through the system log window.

Workflow

Experiment Results

Experimental Results on BGL, Liberty, and Thunderbird datasets. The best results are indicated using bold typeface.

Methods	BGL	BGL	BGL	Liberty	Liberty	Liberty	Thunderbird	Thunderbird	Thunderbird
	Prec.	Rec.	F1	Prec.	Rec.	F1	Prec.	Rec.	F1	Avg. F1
DeepLog	0.166	0.988	0.285	0.751	0.855	0.800	0.017	0.966	0.033	0.373
LogAnomaly	0.176	0.985	0.299	0.684	0.876	0.768	0.025	0.966	0.050	0.372
PLELog	0.595	0.880	0.710	0.795	0.874	0.832	0.808	0.724	0.764	0.769
FastLogAD	0.167	1.000	0.287	0.151	0.999	0.263	0.008	0.931	0.017	0.189
LogBERT	0.165	0.989	0.283	0.902	0.633	0.744	0.022	0.172	0.039	0.355
LogRobust	0.696	0.968	0.810	0.695	0.979	0.813	0.318	1.000	0.482	0.702
CNN	0.698	0.965	0.810	0.580	0.914	0.709	0.870	0.690	0.769	0.763
NeuralLog	0.792	0.884	0.835	0.875	0.926	0.900	0.794	0.931	0.857	0.864
RAPID	0.874	0.399	0.548	0.911	0.611	0.732	0.200	0.207	0.203	0.494
LogLLM	0.861	0.979	0.916	0.992	0.926	0.958	0.966	0.966	0.966	0.947
Smart O&M Agent(our)	0.981	0.989	0.985	0.983	0.958	0.970	1.000	1.000	1.000	0.985

Datasets

Benchmark datasets used in the experiments.

Datasets	# Logs	Training Data	Training Data	Training Data	Testing Data	Testing Data	Testing Data
		# Logs	Normal	Abnormal	# Logs	Normal	Abnormal
BGL	4,747,963	3,798,387	3,519,603	278,784	949,576	879,900	69,676
Liberty	5,000,000	4,000,007	2,719,580	1,280,427	999,993	679,895	320,098
Thunderbird	10,000,000	8,000,003	7,996,051	3,952	1,999,997	1,999,012	985

Using Our Code

Class UML Diagram

1. Setup

Python: 3.12.11
CUDA: 12
Download benchmarks
Download based LLMs

2. Install dependencies

pip install transformers bitsandbytes peft pandas torch scikit-learn pydantic matplotlib langgraph seaborn
pip install -U "huggingface_hub[cli]"

3. Two-stage training of Smart O&M Agent

The training results will be stored in ./output/{CASE_NAME}

Set the following variations in train.py

# Smart O&M Agent Train settings
CASE_NAME: str = "bgl-cw-gemma2-9b" # customize your case name
DATASET_TYPE: types.DatasetTypes = "BGL" # "BGL" | "Liberty" | "Thunderbird" | "test"(based BGL)
SAMPLING_TYPE: types.SamplingTypes = "our" # "our" | "logllm"
SLIDING_WIN_TYPE: types.SlidingWindowTypes = "count" # "count" | "time"
BASE_LLM: types.BaseLLMTypes = "gemma-2-9b" # "gemma-2-9b" | "gemma-3-4b-it" | "Llama-3.1-8B-Instruct" | "Llama-3.2-3B-Instruct"

# Stage one training settings
LEM_TRAIN_EPOCHS: int = 10 # Log Embed Model training epochs
LEM_TRAIN_LR: float = 5e-5 # Log Embed Model learning rate
LEM_SAFE_BATCH_SIZE: int = 256 # Log Embed Model safe batch size

# Stage two training settings
ADLLM_TRAIN_EPOCHS: int = 5 # Anomaly Detection LLM training epochs
ADLLM_TRAIN_LR: float = 5e-5 # Anomaly Detection LLM learning rate
ADLLM_SAFE_BATCH_SIZE: int = 3 # Anomaly Detection LLM safe GPU single batch memory usage
ADLLM_TOP_K_LOGS: int = 5 # Anomaly Detection LLM top K abnormal logs for each window

Run python train.py from the root directory to get trained models.
```
python train.py
```

3-1. [Option]: Anaysis

Run Jupyter oversampling.ipynb from the root directory to view Dataset oversampling distribution
Run Jupyter metrics.ipynb from the root directory to view <CASE_NAME> training history

4. Used for system anomaly detection

Set the following variations in use.py

CASE_NAME: str = "bgl-cw-gemma2-9b" # Name of the trained case
DATASET_TYPE: types.DatasetTypes = "BGL" # "BGL" | "Liberty" | "Thunderbird" | "test"(based BGL)
BASE_LLM: types.BaseLLMTypes = "gemma-2-9b" # "gemma-2-9b" | "gemma-3-4b-it" | "Llama-3.1-8B-Instruct" | "Llama-3.2-3B-Instruct"

Run python use.py from the root directory to perform system anomaly detection
```
python use.py
```

Download benchmarks Cli

BGL

export DATA_DIR=data
export DATA_NAME=BGL
mkdir -p ${DATA_DIR}/${DATA_NAME} && curl -L https://zenodo.org/records/8196385/files/BGL.zip?download=1 -o ${DATA_DIR}/${DATA_NAME}.zip && unzip ${DATA_DIR}/${DATA_NAME}.zip -d ${DATA_DIR}/${DATA_NAME} && rm ${DATA_DIR}/${DATA_NAME}.zip

Thunderbird

export DATA_DIR=data
export DATA_NAME=Thunderbird
mkdir -p ${DATA_DIR}/${DATA_NAME} && curl -L https://zenodo.org/records/8196385/files/Thunderbird.tar.gz?download=1 -o ${DATA_DIR}/${DATA_NAME}.tar.gz && tar -xzvf ${DATA_DIR}/${DATA_NAME}.tar.gz -C ${DATA_DIR}/${DATA_NAME} && rm ${DATA_DIR}/${DATA_NAME}.tar.gz

Liberty

export DATA_DIR=data
export DATA_NAME=Liberty
mkdir -p ${DATA_DIR}/${DATA_NAME} && curl -L http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/hpc4/liberty2.gz -o ${DATA_DIR}/${DATA_NAME}.gz && gunzip -c ${DATA_DIR}/${DATA_NAME}.gz > ${DATA_DIR}/${DATA_NAME}/${DATA_NAME}.log && rm ${DATA_DIR}/${DATA_NAME}.gz

Download based LLMs cli

1. Login Huggingface

export HF_TOKEN=<your-huggingface-token>
huggingface-cli login --token ${HF_TOKEN}

BERT

export SAVE_PATH=hf_models/bert-base-uncased
export MODEL_NAME=google-bert/bert-base-uncased
nohup bash -c "huggingface-cli download ${MODEL_NAME} --local-dir ${SAVE_PATH}" &

Gemma2-9B

export SAVE_PATH=hf_models/gemma-2-9b
export MODEL_NAME=google/gemma-2-9b
nohup bash -c "huggingface-cli download ${MODEL_NAME} --local-dir ${SAVE_PATH}" &

Gemma3-4B-IT

export SAVE_PATH=hf_models/gemma-3-4b-it
export MODEL_NAME=google/gemma-3-4b-it
nohup bash -c "huggingface-cli download ${MODEL_NAME} --local-dir ${SAVE_PATH}" &

Llama3.2-3B

export SAVE_PATH=hf_models/Llama-3.2-3B-Instruct
export MODEL_NAME=meta-llama/Llama-3.2-3B-Instruct
nohup bash -c "huggingface-cli download ${MODEL_NAME} --local-dir ${SAVE_PATH}" &

Llama3.1-8B

export SAVE_PATH=hf_models/Llama-3.1-8B-Instruct
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
nohup bash -c "huggingface-cli download ${MODEL_NAME} --local-dir ${SAVE_PATH}" &

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
docs		docs
modules		modules
.gitignore		.gitignore
README.md		README.md
metrics.ipynb		metrics.ipynb
oversampling.ipynb		oversampling.ipynb
train.py		train.py
use.py		use.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart O&M Agent: An Anomaly Detection Architecture for System Operation and Maintenance Based on RAG

Workflow

Experiment Results

Datasets

Using Our Code

Class UML Diagram

1. Setup

2. Install dependencies

3. Two-stage training of Smart O&M Agent

3-1. [Option]: Anaysis

4. Used for system anomaly detection

Download benchmarks Cli

BGL

Thunderbird

Liberty

Download based LLMs cli

1. Login Huggingface

BERT

Gemma2-9B

Gemma3-4B-IT

Llama3.2-3B

Llama3.1-8B

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smart O&M Agent: An Anomaly Detection Architecture for System Operation and Maintenance Based on RAG

Workflow

Experiment Results

Datasets

Using Our Code

Class UML Diagram

1. Setup

2. Install dependencies

3. Two-stage training of Smart O&M Agent

3-1. [Option]: Anaysis

4. Used for system anomaly detection

Download benchmarks Cli

Download based LLMs cli

1. Login Huggingface

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages