This study introduces a unique model based on BERT, specially designed for the token classification task. It can extract specific threat-related information known as "named entities" from information security threats, with an impressive accuracy rate of over 90%. When compared to the popular GPT 3.5 in real-life situations, our model demonstrates superior performance.
Below is the structure of the folders and files used in the project:
.
├── README.md
├── datasets
│ ├── CTI-reports
│ ├── DNRTI
│ ├── MalwareTextDB
│ └── conll2003
├── images
│ ├── model_structure.png
│ └── ppt
├── outputs
│ └── ner_bert_crf_checkpoint.pt
├── predict.py
├── requirements.txt
├── stopwords
│ ├── stopWord_summar.txt
│ ├── stopWord_test.txt
│ └── stopwords.txt
├── train.py
├── cloab_bert_crf.ipynb
└── website
├── README.md
├── app.py
├── requirements.txt
├── saved_dictionary3.pkl
├── static
├── templates
└── test.ipynb
As of Thu Mar 30, 2023, the following are required:
- Ubuntu 20.04
- NVIDIA RTX A5000
- NVIDIA-SMI: 515.86.01
- Driver Version: 515.86.01
- CUDA Version: 11.7
- Python: 3.8
If the name of the folder is "NER_BERT_OPEN_VERSION," move it to the Desktop:
mv NER_BERT_OPEN_VERSION ~/Desktop/
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
bash Anaconda3-2022.05-Linux-x86_64.sh
conda create --name bert_env python=3.8
conda activate bert_env
pip3 install -r requirements.txt
python3 train.py
Note: You can also skip this step and proceed to Step 6. If skipping, download the model from Google Drive, move it to the outputs folder, and ensure the model name remains unchanged. Check that the file hash is correct: MD5 (ner_bert_crf_checkpoint.pt) = 4faa7b6cd4a44cd8ac829611c0920b08.
Use the following command to predict a sentence:
python3 predict.py -I "INPUT SENTENCE"
# or
python3 predict.py --input "INPUT SENTENCE"
You can directly run colab_bert_crf.ipynb:
git clone https://github.com/stwater20/ner_bert_crf_open_version.git
- Upload all files to Google Drive.
- Link your Drive space in Colab.
- Run the code!
Recommendation: Use GPU to run the code.
Pre-processing has been performed on this dataset.
The performance of various models on different datasets is summarized below:
| DNRTI | CTI-reports | MalwareTextDB | |
|---|---|---|---|
| BERT_CRF | 90.02% | 77.29% | 58.57% |
| secBERT_CRF | 88.62% | 72.52% | 62.53% |
| BERT_BiLSTM_CRF | 84.59% | 74.39% | 45.59% |
| secBERT_BiLSTM_CRF | 83.77% | 68.05% | 47.07% |
Here's a sample prediction:
Our model is also compared with other technologies, including GPT-3.5 and DistilBERT base cased distilled SQuAD.
To gauge the model's real-world usability, we created a website for live comparison. Although dataset results don't always reflect real-world scenarios, we manually annotated and evaluated about 9000 OSINT records, confirming that our method outperforms others.
| Model | Score |
|---|---|
| BERT | 82.64% |
| GPT | 64.56% |
| BERT_QA | 36.68% |
If you find this code helpful or use it in your research, please consider citing our work. Here's the citation information for our accepted paper:
@inproceedings{chen2023enhancing,
title={Enhancing Cyber Threat Intelligence with Named Entity Recognition Using BERT-CRF},
author={Chen, Sheng-Shan and Hwang, Ren-Hung and Sun, Chin-Yu and Lin, Ying-Dar and Pai, Tun-Wen},
booktitle={GLOBECOM 2023-2023 IEEE Global Communications Conference},
pages={7532--7537},
year={2023},
organization={IEEE}
}





