Enhancing Cyber Threat Intelligence with Named Entity Recognition using BERT-CRF

This study introduces a unique model based on BERT, specially designed for the token classification task. It can extract specific threat-related information known as "named entities" from information security threats, with an impressive accuracy rate of over 90%. When compared to the popular GPT 3.5 in real-life situations, our model demonstrates superior performance.

Model Structure

Below is the structure of the folders and files used in the project:

Folders Tree

.
├── README.md
├── datasets
│   ├── CTI-reports
│   ├── DNRTI
│   ├── MalwareTextDB
│   └── conll2003
├── images
│   ├── model_structure.png
│   └── ppt
├── outputs
│   └── ner_bert_crf_checkpoint.pt
├── predict.py
├── requirements.txt
├── stopwords
│   ├── stopWord_summar.txt
│   ├── stopWord_test.txt
│   └── stopwords.txt
├── train.py
├── cloab_bert_crf.ipynb
└── website
    ├── README.md
    ├── app.py
    ├── requirements.txt
    ├── saved_dictionary3.pkl
    ├── static
    ├── templates
    └── test.ipynb

Requirements

As of Thu Mar 30, 2023, the following are required:

Operating System

Ubuntu 20.04

GPU

NVIDIA RTX A5000
NVIDIA-SMI: 515.86.01
Driver Version: 515.86.01
CUDA Version: 11.7
Python: 3.8

How to use

Step 1: Move All Folders to Desktop

If the name of the folder is "NER_BERT_OPEN_VERSION," move it to the Desktop:

mv NER_BERT_OPEN_VERSION ~/Desktop/

Step 2: Install Anaconda

wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
bash Anaconda3-2022.05-Linux-x86_64.sh

Step 3: Create a Virtual Environment and Activate it

conda create --name bert_env python=3.8
conda activate bert_env

Step 4: Install Required Libraries

pip3 install -r requirements.txt

Step 5: Train the Model

python3 train.py

Note: You can also skip this step and proceed to Step 6. If skipping, download the model from Google Drive, move it to the outputs folder, and ensure the model name remains unchanged. Check that the file hash is correct: MD5 (ner_bert_crf_checkpoint.pt) = 4faa7b6cd4a44cd8ac829611c0920b08.

Step 6: Predict a Sentence

Use the following command to predict a sentence:

python3 predict.py -I "INPUT SENTENCE"
# or
python3 predict.py --input "INPUT SENTENCE"

Google Colab Version

You can directly run colab_bert_crf.ipynb:

git clone https://github.com/stwater20/ner_bert_crf_open_version.git

Upload all files to Google Drive.
Link your Drive space in Colab.
Run the code!

Recommendation: Use GPU to run the code.

Datasets

Pre-processing has been performed on this dataset.

Results

The performance of various models on different datasets is summarized below:

	DNRTI	CTI-reports	MalwareTextDB
BERT_CRF	90.02%	77.29%	58.57%
secBERT_CRF	88.62%	72.52%	62.53%
BERT_BiLSTM_CRF	84.59%	74.39%	45.59%
secBERT_BiLSTM_CRF	83.77%	68.05%	47.07%

Implementation

Here's a sample prediction:

Our model is also compared with other technologies, including GPT-3.5 and DistilBERT base cased distilled SQuAD.

Evaluation

To gauge the model's real-world usability, we created a website for live comparison. Although dataset results don't always reflect real-world scenarios, we manually annotated and evaluated about 9000 OSINT records, confirming that our method outperforms others.

Model	Score
BERT	82.64%
GPT	64.56%
BERT_QA	36.68%

Citaztion

If you find this code helpful or use it in your research, please consider citing our work. Here's the citation information for our accepted paper:

@inproceedings{chen2023enhancing,
  title={Enhancing Cyber Threat Intelligence with Named Entity Recognition Using BERT-CRF},
  author={Chen, Sheng-Shan and Hwang, Ren-Hung and Sun, Chin-Yu and Lin, Ying-Dar and Pai, Tun-Wen},
  booktitle={GLOBECOM 2023-2023 IEEE Global Communications Conference},
  pages={7532--7537},
  year={2023},
  organization={IEEE}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Cyber Threat Intelligence with Named Entity Recognition using BERT-CRF

Model Structure

Folders Tree

Requirements

Operating System

GPU

How to use

Step 1: Move All Folders to Desktop

Step 2: Install Anaconda

Step 3: Create a Virtual Environment and Activate it

Step 4: Install Required Libraries

Step 5: Train the Model

Step 6: Predict a Sentence

Google Colab Version

Datasets

Results

Implementation

Evaluation

Citaztion

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
datasets		datasets
images		images
stopwords		stopwords
website		website
LICENSE.txt		LICENSE.txt
README.md		README.md
colab_bert_crf.ipynb		colab_bert_crf.ipynb
colab_ver.ipynb		colab_ver.ipynb
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Enhancing Cyber Threat Intelligence with Named Entity Recognition using BERT-CRF

Model Structure

Folders Tree

Requirements

Operating System

GPU

How to use

Step 1: Move All Folders to Desktop

Step 2: Install Anaconda

Step 3: Create a Virtual Environment and Activate it

Step 4: Install Required Libraries

Step 5: Train the Model

Step 6: Predict a Sentence

Google Colab Version

Datasets

Results

Implementation

Evaluation

Citaztion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages