🚦 CyberBERT: Network Traffic Classification Using DistilBERT

CyberBERT is a deep learning model for network traffic classification, leveraging the DistilBERT architecture. It processes network flow data to classify various types of network traffic, including benign and malicious patterns, with high accuracy.

✨ Features

⚡ Deep learning-based network traffic classification
📊 Real-time & batch traffic analysis
🧮 Processes 84 network flow features (via CICFlowMeter)
🔤 Converts numerical flow features to BERT-compatible text
🖥️ Command-line interface for flexible configuration
🧠 Supports CPU & GPU training/inference
🧹 Data preprocessing & cleaning (via notebook)
📈 Training progress monitoring & model checkpointing
🏷️ Multi-class traffic classification (6 classes)
🏆 Feature selection for improved performance
🚀 Mixed precision training (if available)
⏹️ Early stopping to prevent overfitting
⚖️ Class weight balancing for imbalanced datasets
📉 Visualizations of training metrics & confusion matrices
🖥️ System monitoring for resource tracking
🖱️ Unified runner scripts for Windows & Linux/Mac
⚙️ Environment variable configuration via .env

📁 Project Structure

cyberbert_project/
├── CICFlowMeter/            # Network flow feature extraction
│   ├── CICFlowMeter/        # Core implementation
│   │   ├── __init__.py
│   │   ├── features.py
│   │   ├── flow_meter.py
│   │   ├── flow_session.py
│   │   ├── main.py
│   │   ├── utils.py
│   │   └── __pycache__/
│   ├── docs/                # Feature documentation
│   │   └── feature_documentation.md
│   ├── requirements.txt     # CICFlowMeter dependencies
│   ├── setup.py             # Setup script
│   └── README.md            # Usage guide
├── data/
│   └── processed/
│       └── clean_data.csv   # Processed flow data
├── documents/
│   ├── CyberBERT_Classification_Performance_Slide.txt
│   ├── CyberBERT_Classification_Performance_Speech.txt
│   ├── cyberbert_ppt.pptx
│   ├── CyberBERT_Presentation_Slides.txt
│   ├── CyberBERT_Presentation_Speech.txt
│   ├── CyberBERT_Project_Documentation.txt
│   ├── CyberBERT_Technical_Implementation_Guide.txt
│   ├── CyberBERT_Tools_And_Technologies.txt
│   └── cyberbert.txt
├── logs/
│   ├── cyberbert_20250411_184056.log
│   ├── cyberbert_20250411_184225.log
│   ├── cyberbert_20250411_190129.log
│   ├── cyberbert_20250411_190929.log
│   ├── cyberbert_20250411_200734.log
│   ├── cyberbert_20250411_201357.log
│   ├── cyberbert_20250411_202333.log
│   ├── cyberbert_20250411_210155.log
│   ├── cyberbert_20250411_210851.log
│   └── cyberbert_20250411_212428.log
├── models/
│   ├── cyberbert_model/
│   │   ├── config.json
│   │   ├── model.safetensors
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer_config.json
│   │   ├── tokenizer.json
│   │   └── vocab.txt
│   └── trained_cyberbert/
│       ├── label_mapping.json
│       ├── selected_features.txt
│       ├── training_config.json
│       ├── training_curves.png
│       ├── training_results.json
│       ├── best_model/
│       │   ├── config.json
│       │   ├── metadata.json
│       │   ├── model.safetensors
│       │   ├── special_tokens_map.json
│       │   ├── tokenizer_config.json
│       │   └── vocab.txt
│       ├── interrupted_checkpoint/
│       └── metrics/
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── data_loader.py
│   │   ├── dataset.py
│   │   └── __pycache__/
│   ├── data_preprocessing/
│   │   └── data_cleaner.ipynb
│   ├── download_model/
│   │   └── p_c_d_check.py
│   ├── services/
│   │   ├── flow_labeler.py
│   │   └── __pycache__/
│   ├── training/
│   │   ├── trainer.py
│   │   └── __pycache__/
│   └── utils/
│       ├── __init__.py
│       ├── config.py
│       ├── logger.py
│       ├── metrics.py
│       ├── model_registry.py
│       ├── system_monitor.py
│       └── __pycache__/
├── consolidated_runner.bat  # Windows batch script
├── consolidated_runner.sh   # Linux/Mac shell script
├── flows.csv                # Sample flow data
├── flows.db                 # Flow database
├── requirements_base.txt    # Project dependencies
├── train.py                 # Main training script
└── README.md                # This file

🗂️ Example .env File

Below is an example .env file for configuring CyberBERT. Place this file in your project root to customize model, dataset, and training parameters:

# CyberBERT Environment Configuration

# Model to download
MODEL_NAME=distilbert-base-uncased

# Dataset URL (leave empty if no dataset to download)
DATASET_URL=

# Training parameters
EPOCHS=1
BATCH_SIZE=8
FEATURE_COUNT=78
MAX_LENGTH=64
SAMPLE_FRACTION=0.05

# CPU-specific parameters (used when no GPU is available)
CPU_EPOCHS=3
CPU_BATCH_SIZE=4
CPU_MAX_LENGTH=64
CPU_FEATURE_COUNT=78
CPU_SAMPLE_FRACTION=0.5

🚦 Getting Started

Follow these steps to set up and start the project after cloning:

Clone the repository

git clone https://github.com/agrawalchaitany/CyberBERT.git
cd CyberBERT

Create a virtual environment

On Windows:

python -m venv venv
.\venv\Scripts\activate

On Linux/Mac:

python3 -m venv venv
source venv/bin/activate

Install dependencies

pip install -r requirements_base.txt
pip install -r CICFlowMeter/requirements.txt

(Optional) Edit the .env file
- Copy the example from this README or adjust parameters as needed.

Run the consolidated runner script

On Windows:
```
.\consolidated_runner.bat
```

On Linux/Mac:

chmod +x consolidated_runner.sh
./consolidated_runner.sh

Or, for manual training, see the Manual Usage section below.

🚀 Installation & Usage

The project uses consolidated runner scripts for easy environment setup, model download, dataset download, and training.

▶️ Using the Consolidated Runner Scripts

Windows

.\consolidated_runner.bat

Linux/Mac

chmod +x consolidated_runner.sh
./consolidated_runner.sh

Runner Options

Setup environment and train
Just train
Download model only
Download dataset only

⚙️ Configuration via `.env`

Edit the .env file to customize the model, dataset URL, and training parameters before running the scripts. If it doesn't exist, the runner will create it with defaults.

🛠️ Manual Usage (Alternative)

GPU Example

python train.py --data "data/processed/clean_data.csv" \
                --epochs 3 \
                --batch-size 8 \
                --mixed-precision \
                --cache-tokenization \
                --feature-count 62 \
                --max-length 64

CPU Example

python train.py --data "data/processed/clean_data.csv" \
                --epochs 3 \
                --batch-size 1 \
                --max-length 64 \
                --sample-frac 0.5 \
                --feature-count 62 \
                --no-cache-tokenization

Main Command Line Arguments

--data: Path to input CSV (default: data/processed/clean_data.csv)
--model: Path to pre-trained model (default: models/cyberbert_model)
--output: Directory to save trained model (default: models/trained_cyberbert)
--epochs: Number of training epochs (default: 3)
--batch-size: Training batch size (default: 8)
--max-length: Max sequence length (default: 64)
--feature-count: Number of top features to select (default: 62)
--mixed-precision: Enable mixed precision (GPU only)
--cache-tokenization: Cache tokenized data for faster training (uses more memory)
--early-stopping: Early stopping patience in epochs (default: 3)
--eval-steps: Evaluate on validation set every N steps (default: 100, 0 to disable)
--monitor-system: Enable detailed system resource monitoring (default: True)
--monitor-interval: Interval in seconds for system monitoring (default: 5.0)

🖥️ System Monitoring

CyberBERT includes a system monitoring tool that tracks:

CPU usage
Memory usage
GPU utilization (if available)
Disk usage & I/O
Process-specific resource utilization

System metrics are logged to the console and saved to the output directory.

🏷️ Supported Traffic Classes

BENIGN (Normal Traffic)
DoS GoldenEye
DoS Slowhttptest
Portscan
FTP-Patator
SSH-Patator

📊 Training Results (from latest log)

Epochs: 3
Selected features: 62
Final Train Accuracy: 97.5%
Final Validation Accuracy: 96.0%
Final Validation F1 Score: 0.96
Classification Report (last validation, 50 samples):

              precision    recall  f1-score   support

      BENIGN       0.94      0.94      0.94        17
 DoS GoldenEye     0.90      0.90      0.90        10
DoS Slowhttptest   1.00      1.00      1.00         5
 FTP-Patator       1.00      1.00      1.00         2
    PortScan       1.00      1.00      1.00        13
 SSH-Patator       1.00      1.00      1.00         3

   accuracy                           0.96        50
  macro avg       0.97      0.97      0.97        50
weighted avg      0.96      0.96      0.96        50

Total training time: ~8 hours (CPU, 3 epochs)

📄 Data Format

Input: CSV file with 84 numerical features (from CICFlowMeter)
Each flow must have a Label column
Features are normalized and converted to model-compatible tokens

🧹 Data Preprocessing

Feature extraction:

python -m CICFlowMeter.CICFlowMeter.main -f pcap_file.pcap -o features.csv

Data cleaning:

jupyter notebook src/data_preprocessing/data_cleaner.ipynb

🏗️ Model Architecture

Base: DistilBERT (distilbert-base-uncased)
Hidden size: 768, Attention heads: 12, Layers: 6
Parameters: ~66M
Sequence classification head with dropout
Output classes: 6 traffic types

💻 Hardware Requirements

Minimum:

CPU: 2 cores
RAM: 2GB
Storage: 5GB free space

Recommended:

CPU: 4+ cores
RAM: 8GB+
GPU: NVIDIA with 4GB+ VRAM
Storage: 10GB+ SSD

🏋️ Training Details

Data split: 80% train, 20% validation
Hyperparameters: batch size 1 (CPU), 8+ (GPU), learning rate 2e-5, max sequence length 64, early stopping 3 epochs, feature selection enabled (62 features), sample fraction 0.5, eval steps 100
Training progression: initial accuracy ~24%, final accuracy ~97%, final F1 score 0.96
Training duration: CPU ~8 hours for 3 epochs

🏆 Benchmarks

Note: Only CyberBERT (DistilBERT-based) was trained and evaluated in this project.

Model	Accuracy	F1-Score	Training Time (CPU)
CyberBERT	96.0%	0.96	~8 hours

🔎 Inference

Usage Example

Real-time classification:

python -m CICFlowMeter.CICFlowMeter.main -i wi-fi -r -m models/trained_cyberbert/best_model

Batch processing:

python -m CICFlowMeter.CICFlowMeter.main -f capture.pcap -m models/trained_cyberbert/best_model -o predictions.csv

🛠️ Troubleshooting

Decrease batch size or max sequence length if you encounter memory issues
Reduce sample fraction for faster development
Use system monitoring to identify bottlenecks
For CPU training, use a small sample fraction for testing
If you see model architecture mismatch warnings, ensure the .env model type matches the actual model

🤝 Contributing

Fork the repository
Create your feature branch
Submit a pull request

📚 Citation

If you use this project in your research, please cite:

@software{cyberbert2025,
  author = {Chaitany Agrawal},
  title = {CyberBERT: Network Traffic Classification Using BERT},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/agrawalchaitany/CyberBERT}
}

🙏 Acknowledgments

CICIDS2017 dataset providers
Hugging Face Transformers team
CICFlowMeter developers
Network security community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚦 CyberBERT: Network Traffic Classification Using DistilBERT

✨ Features

📁 Project Structure

🗂️ Example .env File

🚦 Getting Started

🚀 Installation & Usage

▶️ Using the Consolidated Runner Scripts

Windows

Linux/Mac

Runner Options

⚙️ Configuration via `.env`

🛠️ Manual Usage (Alternative)

GPU Example

CPU Example

Main Command Line Arguments

🖥️ System Monitoring

🏷️ Supported Traffic Classes

📊 Training Results (from latest log)

📄 Data Format

🧹 Data Preprocessing

🏗️ Model Architecture

💻 Hardware Requirements

🏋️ Training Details

🏆 Benchmarks

🔎 Inference

Usage Example

🛠️ Troubleshooting

🤝 Contributing

📚 Citation

🙏 Acknowledgments

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CICFlowMeter		CICFlowMeter
data/processed		data/processed
documents		documents
logs		logs
models		models
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
consolidated_runner.bat		consolidated_runner.bat
consolidated_runner.sh		consolidated_runner.sh
cyberbert_network.pbix		cyberbert_network.pbix
flows.csv		flows.csv
flows.db		flows.db
requirements_base.txt		requirements_base.txt
train.py		train.py

agrawalchaitany/CyberBERT

Folders and files

Latest commit

History

Repository files navigation

🚦 CyberBERT: Network Traffic Classification Using DistilBERT

✨ Features

📁 Project Structure

🗂️ Example .env File

🚦 Getting Started

🚀 Installation & Usage

▶️ Using the Consolidated Runner Scripts

Windows

Linux/Mac

Runner Options

⚙️ Configuration via .env

🛠️ Manual Usage (Alternative)

GPU Example

CPU Example

Main Command Line Arguments

🖥️ System Monitoring

🏷️ Supported Traffic Classes

📊 Training Results (from latest log)

📄 Data Format

🧹 Data Preprocessing

🏗️ Model Architecture

💻 Hardware Requirements

🏋️ Training Details

🏆 Benchmarks

🔎 Inference

Usage Example

🛠️ Troubleshooting

🤝 Contributing

📚 Citation

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

⚙️ Configuration via `.env`