Skip to content

CyberBERT is a deep learning model for network traffic classification, leveraging the DistilBERT architecture. It processes network flow data to classify various types of network traffic, including benign and malicious patterns, with high accuracy.

Notifications You must be signed in to change notification settings

agrawalchaitany/CyberBERT

Repository files navigation

🚦 CyberBERT: Network Traffic Classification Using DistilBERT

CyberBERT is a deep learning model for network traffic classification, leveraging the DistilBERT architecture. It processes network flow data to classify various types of network traffic, including benign and malicious patterns, with high accuracy.


✨ Features

  • ⚡ Deep learning-based network traffic classification
  • 📊 Real-time & batch traffic analysis
  • 🧮 Processes 84 network flow features (via CICFlowMeter)
  • 🔤 Converts numerical flow features to BERT-compatible text
  • 🖥️ Command-line interface for flexible configuration
  • 🧠 Supports CPU & GPU training/inference
  • 🧹 Data preprocessing & cleaning (via notebook)
  • 📈 Training progress monitoring & model checkpointing
  • 🏷️ Multi-class traffic classification (6 classes)
  • 🏆 Feature selection for improved performance
  • 🚀 Mixed precision training (if available)
  • ⏹️ Early stopping to prevent overfitting
  • ⚖️ Class weight balancing for imbalanced datasets
  • 📉 Visualizations of training metrics & confusion matrices
  • 🖥️ System monitoring for resource tracking
  • 🖱️ Unified runner scripts for Windows & Linux/Mac
  • ⚙️ Environment variable configuration via .env

📁 Project Structure

cyberbert_project/
├── CICFlowMeter/            # Network flow feature extraction
│   ├── CICFlowMeter/        # Core implementation
│   │   ├── __init__.py
│   │   ├── features.py
│   │   ├── flow_meter.py
│   │   ├── flow_session.py
│   │   ├── main.py
│   │   ├── utils.py
│   │   └── __pycache__/
│   ├── docs/                # Feature documentation
│   │   └── feature_documentation.md
│   ├── requirements.txt     # CICFlowMeter dependencies
│   ├── setup.py             # Setup script
│   └── README.md            # Usage guide
├── data/
│   └── processed/
│       └── clean_data.csv   # Processed flow data
├── documents/
│   ├── CyberBERT_Classification_Performance_Slide.txt
│   ├── CyberBERT_Classification_Performance_Speech.txt
│   ├── cyberbert_ppt.pptx
│   ├── CyberBERT_Presentation_Slides.txt
│   ├── CyberBERT_Presentation_Speech.txt
│   ├── CyberBERT_Project_Documentation.txt
│   ├── CyberBERT_Technical_Implementation_Guide.txt
│   ├── CyberBERT_Tools_And_Technologies.txt
│   └── cyberbert.txt
├── logs/
│   ├── cyberbert_20250411_184056.log
│   ├── cyberbert_20250411_184225.log
│   ├── cyberbert_20250411_190129.log
│   ├── cyberbert_20250411_190929.log
│   ├── cyberbert_20250411_200734.log
│   ├── cyberbert_20250411_201357.log
│   ├── cyberbert_20250411_202333.log
│   ├── cyberbert_20250411_210155.log
│   ├── cyberbert_20250411_210851.log
│   └── cyberbert_20250411_212428.log
├── models/
│   ├── cyberbert_model/
│   │   ├── config.json
│   │   ├── model.safetensors
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer_config.json
│   │   ├── tokenizer.json
│   │   └── vocab.txt
│   └── trained_cyberbert/
│       ├── label_mapping.json
│       ├── selected_features.txt
│       ├── training_config.json
│       ├── training_curves.png
│       ├── training_results.json
│       ├── best_model/
│       │   ├── config.json
│       │   ├── metadata.json
│       │   ├── model.safetensors
│       │   ├── special_tokens_map.json
│       │   ├── tokenizer_config.json
│       │   └── vocab.txt
│       ├── interrupted_checkpoint/
│       └── metrics/
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── data_loader.py
│   │   ├── dataset.py
│   │   └── __pycache__/
│   ├── data_preprocessing/
│   │   └── data_cleaner.ipynb
│   ├── download_model/
│   │   └── p_c_d_check.py
│   ├── services/
│   │   ├── flow_labeler.py
│   │   └── __pycache__/
│   ├── training/
│   │   ├── trainer.py
│   │   └── __pycache__/
│   └── utils/
│       ├── __init__.py
│       ├── config.py
│       ├── logger.py
│       ├── metrics.py
│       ├── model_registry.py
│       ├── system_monitor.py
│       └── __pycache__/
├── consolidated_runner.bat  # Windows batch script
├── consolidated_runner.sh   # Linux/Mac shell script
├── flows.csv                # Sample flow data
├── flows.db                 # Flow database
├── requirements_base.txt    # Project dependencies
├── train.py                 # Main training script
└── README.md                # This file

🗂️ Example .env File

Below is an example .env file for configuring CyberBERT. Place this file in your project root to customize model, dataset, and training parameters:

# CyberBERT Environment Configuration

# Model to download
MODEL_NAME=distilbert-base-uncased

# Dataset URL (leave empty if no dataset to download)
DATASET_URL=

# Training parameters
EPOCHS=1
BATCH_SIZE=8
FEATURE_COUNT=78
MAX_LENGTH=64
SAMPLE_FRACTION=0.05

# CPU-specific parameters (used when no GPU is available)
CPU_EPOCHS=3
CPU_BATCH_SIZE=4
CPU_MAX_LENGTH=64
CPU_FEATURE_COUNT=78
CPU_SAMPLE_FRACTION=0.5

🚦 Getting Started

Follow these steps to set up and start the project after cloning:

  1. Clone the repository
    git clone https://github.com/agrawalchaitany/CyberBERT.git
    cd CyberBERT
  2. Create a virtual environment
    • On Windows:
      python -m venv venv
      .\venv\Scripts\activate
    • On Linux/Mac:
      python3 -m venv venv
      source venv/bin/activate
  3. Install dependencies
    pip install -r requirements_base.txt
    pip install -r CICFlowMeter/requirements.txt
  4. (Optional) Edit the .env file
    • Copy the example from this README or adjust parameters as needed.
  5. Run the consolidated runner script
    • On Windows:
      .\consolidated_runner.bat
    • On Linux/Mac:
      chmod +x consolidated_runner.sh
      ./consolidated_runner.sh

Or, for manual training, see the Manual Usage section below.


🚀 Installation & Usage

The project uses consolidated runner scripts for easy environment setup, model download, dataset download, and training.

▶️ Using the Consolidated Runner Scripts

Windows

.\consolidated_runner.bat

Linux/Mac

chmod +x consolidated_runner.sh
./consolidated_runner.sh

Runner Options

  1. Setup environment and train
  2. Just train
  3. Download model only
  4. Download dataset only

⚙️ Configuration via .env

Edit the .env file to customize the model, dataset URL, and training parameters before running the scripts. If it doesn't exist, the runner will create it with defaults.


🛠️ Manual Usage (Alternative)

GPU Example

python train.py --data "data/processed/clean_data.csv" \
                --epochs 3 \
                --batch-size 8 \
                --mixed-precision \
                --cache-tokenization \
                --feature-count 62 \
                --max-length 64

CPU Example

python train.py --data "data/processed/clean_data.csv" \
                --epochs 3 \
                --batch-size 1 \
                --max-length 64 \
                --sample-frac 0.5 \
                --feature-count 62 \
                --no-cache-tokenization

Main Command Line Arguments

  • --data: Path to input CSV (default: data/processed/clean_data.csv)
  • --model: Path to pre-trained model (default: models/cyberbert_model)
  • --output: Directory to save trained model (default: models/trained_cyberbert)
  • --epochs: Number of training epochs (default: 3)
  • --batch-size: Training batch size (default: 8)
  • --max-length: Max sequence length (default: 64)
  • --feature-count: Number of top features to select (default: 62)
  • --mixed-precision: Enable mixed precision (GPU only)
  • --cache-tokenization: Cache tokenized data for faster training (uses more memory)
  • --early-stopping: Early stopping patience in epochs (default: 3)
  • --eval-steps: Evaluate on validation set every N steps (default: 100, 0 to disable)
  • --monitor-system: Enable detailed system resource monitoring (default: True)
  • --monitor-interval: Interval in seconds for system monitoring (default: 5.0)

🖥️ System Monitoring

CyberBERT includes a system monitoring tool that tracks:

  • CPU usage
  • Memory usage
  • GPU utilization (if available)
  • Disk usage & I/O
  • Process-specific resource utilization

System metrics are logged to the console and saved to the output directory.


🏷️ Supported Traffic Classes

  • BENIGN (Normal Traffic)
  • DoS GoldenEye
  • DoS Slowhttptest
  • Portscan
  • FTP-Patator
  • SSH-Patator

📊 Training Results (from latest log)

  • Epochs: 3
  • Selected features: 62
  • Final Train Accuracy: 97.5%
  • Final Validation Accuracy: 96.0%
  • Final Validation F1 Score: 0.96
  • Classification Report (last validation, 50 samples):
              precision    recall  f1-score   support

      BENIGN       0.94      0.94      0.94        17
 DoS GoldenEye     0.90      0.90      0.90        10
DoS Slowhttptest   1.00      1.00      1.00         5
 FTP-Patator       1.00      1.00      1.00         2
    PortScan       1.00      1.00      1.00        13
 SSH-Patator       1.00      1.00      1.00         3

   accuracy                           0.96        50
  macro avg       0.97      0.97      0.97        50
weighted avg      0.96      0.96      0.96        50
  • Total training time: ~8 hours (CPU, 3 epochs)

📄 Data Format

  • Input: CSV file with 84 numerical features (from CICFlowMeter)
  • Each flow must have a Label column
  • Features are normalized and converted to model-compatible tokens

🧹 Data Preprocessing

  1. Feature extraction:
    python -m CICFlowMeter.CICFlowMeter.main -f pcap_file.pcap -o features.csv
  2. Data cleaning:
    jupyter notebook src/data_preprocessing/data_cleaner.ipynb

🏗️ Model Architecture

  • Base: DistilBERT (distilbert-base-uncased)
  • Hidden size: 768, Attention heads: 12, Layers: 6
  • Parameters: ~66M
  • Sequence classification head with dropout
  • Output classes: 6 traffic types

💻 Hardware Requirements

Minimum:

  • CPU: 2 cores
  • RAM: 2GB
  • Storage: 5GB free space

Recommended:

  • CPU: 4+ cores
  • RAM: 8GB+
  • GPU: NVIDIA with 4GB+ VRAM
  • Storage: 10GB+ SSD

🏋️ Training Details

  • Data split: 80% train, 20% validation
  • Hyperparameters: batch size 1 (CPU), 8+ (GPU), learning rate 2e-5, max sequence length 64, early stopping 3 epochs, feature selection enabled (62 features), sample fraction 0.5, eval steps 100
  • Training progression: initial accuracy ~24%, final accuracy ~97%, final F1 score 0.96
  • Training duration: CPU ~8 hours for 3 epochs

🏆 Benchmarks

Note: Only CyberBERT (DistilBERT-based) was trained and evaluated in this project.

Model Accuracy F1-Score Training Time (CPU)
CyberBERT 96.0% 0.96 ~8 hours

🔎 Inference

Usage Example

  • Real-time classification:
    python -m CICFlowMeter.CICFlowMeter.main -i wi-fi -r -m models/trained_cyberbert/best_model
  • Batch processing:
    python -m CICFlowMeter.CICFlowMeter.main -f capture.pcap -m models/trained_cyberbert/best_model -o predictions.csv

🛠️ Troubleshooting

  • Decrease batch size or max sequence length if you encounter memory issues
  • Reduce sample fraction for faster development
  • Use system monitoring to identify bottlenecks
  • For CPU training, use a small sample fraction for testing
  • If you see model architecture mismatch warnings, ensure the .env model type matches the actual model

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Submit a pull request

📚 Citation

If you use this project in your research, please cite:

@software{cyberbert2025,
  author = {Chaitany Agrawal},
  title = {CyberBERT: Network Traffic Classification Using BERT},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/agrawalchaitany/CyberBERT}
}

🙏 Acknowledgments

  • CICIDS2017 dataset providers
  • Hugging Face Transformers team
  • CICFlowMeter developers
  • Network security community

About

CyberBERT is a deep learning model for network traffic classification, leveraging the DistilBERT architecture. It processes network flow data to classify various types of network traffic, including benign and malicious patterns, with high accuracy.

Topics

Resources

Stars

Watchers

Forks