CyberBERT is a deep learning model for network traffic classification, leveraging the DistilBERT architecture. It processes network flow data to classify various types of network traffic, including benign and malicious patterns, with high accuracy.
- ⚡ Deep learning-based network traffic classification
- 📊 Real-time & batch traffic analysis
- 🧮 Processes 84 network flow features (via CICFlowMeter)
- 🔤 Converts numerical flow features to BERT-compatible text
- 🖥️ Command-line interface for flexible configuration
- 🧠 Supports CPU & GPU training/inference
- 🧹 Data preprocessing & cleaning (via notebook)
- 📈 Training progress monitoring & model checkpointing
- 🏷️ Multi-class traffic classification (6 classes)
- 🏆 Feature selection for improved performance
- 🚀 Mixed precision training (if available)
- ⏹️ Early stopping to prevent overfitting
- ⚖️ Class weight balancing for imbalanced datasets
- 📉 Visualizations of training metrics & confusion matrices
- 🖥️ System monitoring for resource tracking
- 🖱️ Unified runner scripts for Windows & Linux/Mac
- ⚙️ Environment variable configuration via
.env
cyberbert_project/
├── CICFlowMeter/ # Network flow feature extraction
│ ├── CICFlowMeter/ # Core implementation
│ │ ├── __init__.py
│ │ ├── features.py
│ │ ├── flow_meter.py
│ │ ├── flow_session.py
│ │ ├── main.py
│ │ ├── utils.py
│ │ └── __pycache__/
│ ├── docs/ # Feature documentation
│ │ └── feature_documentation.md
│ ├── requirements.txt # CICFlowMeter dependencies
│ ├── setup.py # Setup script
│ └── README.md # Usage guide
├── data/
│ └── processed/
│ └── clean_data.csv # Processed flow data
├── documents/
│ ├── CyberBERT_Classification_Performance_Slide.txt
│ ├── CyberBERT_Classification_Performance_Speech.txt
│ ├── cyberbert_ppt.pptx
│ ├── CyberBERT_Presentation_Slides.txt
│ ├── CyberBERT_Presentation_Speech.txt
│ ├── CyberBERT_Project_Documentation.txt
│ ├── CyberBERT_Technical_Implementation_Guide.txt
│ ├── CyberBERT_Tools_And_Technologies.txt
│ └── cyberbert.txt
├── logs/
│ ├── cyberbert_20250411_184056.log
│ ├── cyberbert_20250411_184225.log
│ ├── cyberbert_20250411_190129.log
│ ├── cyberbert_20250411_190929.log
│ ├── cyberbert_20250411_200734.log
│ ├── cyberbert_20250411_201357.log
│ ├── cyberbert_20250411_202333.log
│ ├── cyberbert_20250411_210155.log
│ ├── cyberbert_20250411_210851.log
│ └── cyberbert_20250411_212428.log
├── models/
│ ├── cyberbert_model/
│ │ ├── config.json
│ │ ├── model.safetensors
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer_config.json
│ │ ├── tokenizer.json
│ │ └── vocab.txt
│ └── trained_cyberbert/
│ ├── label_mapping.json
│ ├── selected_features.txt
│ ├── training_config.json
│ ├── training_curves.png
│ ├── training_results.json
│ ├── best_model/
│ │ ├── config.json
│ │ ├── metadata.json
│ │ ├── model.safetensors
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer_config.json
│ │ └── vocab.txt
│ ├── interrupted_checkpoint/
│ └── metrics/
├── src/
│ ├── __init__.py
│ ├── data/
│ │ ├── data_loader.py
│ │ ├── dataset.py
│ │ └── __pycache__/
│ ├── data_preprocessing/
│ │ └── data_cleaner.ipynb
│ ├── download_model/
│ │ └── p_c_d_check.py
│ ├── services/
│ │ ├── flow_labeler.py
│ │ └── __pycache__/
│ ├── training/
│ │ ├── trainer.py
│ │ └── __pycache__/
│ └── utils/
│ ├── __init__.py
│ ├── config.py
│ ├── logger.py
│ ├── metrics.py
│ ├── model_registry.py
│ ├── system_monitor.py
│ └── __pycache__/
├── consolidated_runner.bat # Windows batch script
├── consolidated_runner.sh # Linux/Mac shell script
├── flows.csv # Sample flow data
├── flows.db # Flow database
├── requirements_base.txt # Project dependencies
├── train.py # Main training script
└── README.md # This file
Below is an example .env
file for configuring CyberBERT. Place this file in your project root to customize model, dataset, and training parameters:
# CyberBERT Environment Configuration
# Model to download
MODEL_NAME=distilbert-base-uncased
# Dataset URL (leave empty if no dataset to download)
DATASET_URL=
# Training parameters
EPOCHS=1
BATCH_SIZE=8
FEATURE_COUNT=78
MAX_LENGTH=64
SAMPLE_FRACTION=0.05
# CPU-specific parameters (used when no GPU is available)
CPU_EPOCHS=3
CPU_BATCH_SIZE=4
CPU_MAX_LENGTH=64
CPU_FEATURE_COUNT=78
CPU_SAMPLE_FRACTION=0.5
Follow these steps to set up and start the project after cloning:
- Clone the repository
git clone https://github.com/agrawalchaitany/CyberBERT.git cd CyberBERT
- Create a virtual environment
- On Windows:
python -m venv venv .\venv\Scripts\activate
- On Linux/Mac:
python3 -m venv venv source venv/bin/activate
- On Windows:
- Install dependencies
pip install -r requirements_base.txt pip install -r CICFlowMeter/requirements.txt
- (Optional) Edit the
.env
file- Copy the example from this README or adjust parameters as needed.
- Run the consolidated runner script
- On Windows:
.\consolidated_runner.bat
- On Linux/Mac:
chmod +x consolidated_runner.sh ./consolidated_runner.sh
- On Windows:
Or, for manual training, see the Manual Usage section below.
The project uses consolidated runner scripts for easy environment setup, model download, dataset download, and training.
.\consolidated_runner.bat
chmod +x consolidated_runner.sh
./consolidated_runner.sh
- Setup environment and train
- Just train
- Download model only
- Download dataset only
Edit the .env
file to customize the model, dataset URL, and training parameters before running the scripts. If it doesn't exist, the runner will create it with defaults.
python train.py --data "data/processed/clean_data.csv" \
--epochs 3 \
--batch-size 8 \
--mixed-precision \
--cache-tokenization \
--feature-count 62 \
--max-length 64
python train.py --data "data/processed/clean_data.csv" \
--epochs 3 \
--batch-size 1 \
--max-length 64 \
--sample-frac 0.5 \
--feature-count 62 \
--no-cache-tokenization
--data
: Path to input CSV (default:data/processed/clean_data.csv
)--model
: Path to pre-trained model (default:models/cyberbert_model
)--output
: Directory to save trained model (default:models/trained_cyberbert
)--epochs
: Number of training epochs (default: 3)--batch-size
: Training batch size (default: 8)--max-length
: Max sequence length (default: 64)--feature-count
: Number of top features to select (default: 62)--mixed-precision
: Enable mixed precision (GPU only)--cache-tokenization
: Cache tokenized data for faster training (uses more memory)--early-stopping
: Early stopping patience in epochs (default: 3)--eval-steps
: Evaluate on validation set every N steps (default: 100, 0 to disable)--monitor-system
: Enable detailed system resource monitoring (default: True)--monitor-interval
: Interval in seconds for system monitoring (default: 5.0)
CyberBERT includes a system monitoring tool that tracks:
- CPU usage
- Memory usage
- GPU utilization (if available)
- Disk usage & I/O
- Process-specific resource utilization
System metrics are logged to the console and saved to the output directory.
- BENIGN (Normal Traffic)
- DoS GoldenEye
- DoS Slowhttptest
- Portscan
- FTP-Patator
- SSH-Patator
- Epochs: 3
- Selected features: 62
- Final Train Accuracy: 97.5%
- Final Validation Accuracy: 96.0%
- Final Validation F1 Score: 0.96
- Classification Report (last validation, 50 samples):
precision recall f1-score support
BENIGN 0.94 0.94 0.94 17
DoS GoldenEye 0.90 0.90 0.90 10
DoS Slowhttptest 1.00 1.00 1.00 5
FTP-Patator 1.00 1.00 1.00 2
PortScan 1.00 1.00 1.00 13
SSH-Patator 1.00 1.00 1.00 3
accuracy 0.96 50
macro avg 0.97 0.97 0.97 50
weighted avg 0.96 0.96 0.96 50
- Total training time: ~8 hours (CPU, 3 epochs)
- Input: CSV file with 84 numerical features (from CICFlowMeter)
- Each flow must have a
Label
column - Features are normalized and converted to model-compatible tokens
- Feature extraction:
python -m CICFlowMeter.CICFlowMeter.main -f pcap_file.pcap -o features.csv
- Data cleaning:
jupyter notebook src/data_preprocessing/data_cleaner.ipynb
- Base: DistilBERT (
distilbert-base-uncased
) - Hidden size: 768, Attention heads: 12, Layers: 6
- Parameters: ~66M
- Sequence classification head with dropout
- Output classes: 6 traffic types
Minimum:
- CPU: 2 cores
- RAM: 2GB
- Storage: 5GB free space
Recommended:
- CPU: 4+ cores
- RAM: 8GB+
- GPU: NVIDIA with 4GB+ VRAM
- Storage: 10GB+ SSD
- Data split: 80% train, 20% validation
- Hyperparameters: batch size 1 (CPU), 8+ (GPU), learning rate 2e-5, max sequence length 64, early stopping 3 epochs, feature selection enabled (62 features), sample fraction 0.5, eval steps 100
- Training progression: initial accuracy ~24%, final accuracy ~97%, final F1 score 0.96
- Training duration: CPU ~8 hours for 3 epochs
Note: Only CyberBERT (DistilBERT-based) was trained and evaluated in this project.
Model | Accuracy | F1-Score | Training Time (CPU) |
---|---|---|---|
CyberBERT | 96.0% | 0.96 | ~8 hours |
- Real-time classification:
python -m CICFlowMeter.CICFlowMeter.main -i wi-fi -r -m models/trained_cyberbert/best_model
- Batch processing:
python -m CICFlowMeter.CICFlowMeter.main -f capture.pcap -m models/trained_cyberbert/best_model -o predictions.csv
- Decrease batch size or max sequence length if you encounter memory issues
- Reduce sample fraction for faster development
- Use system monitoring to identify bottlenecks
- For CPU training, use a small sample fraction for testing
- If you see model architecture mismatch warnings, ensure the
.env
model type matches the actual model
- Fork the repository
- Create your feature branch
- Submit a pull request
If you use this project in your research, please cite:
@software{cyberbert2025,
author = {Chaitany Agrawal},
title = {CyberBERT: Network Traffic Classification Using BERT},
year = {2025},
publisher = {GitHub},
url = {https://github.com/agrawalchaitany/CyberBERT}
}
- CICIDS2017 dataset providers
- Hugging Face Transformers team
- CICFlowMeter developers
- Network security community