|
1 | | -# networkSecuritySystem |
| 1 | +# 🛡️ Network Security System - ML-Powered Phishing Detection |
| 2 | + |
| 3 | +[](https://www.python.org/downloads/) |
| 4 | +[](https://fastapi.tiangolo.com/) |
| 5 | +[](https://mlflow.org/) |
| 6 | +[](https://opensource.org/licenses/MIT) |
| 7 | + |
| 8 | +An end-to-end machine learning system for detecting phishing websites using network security data. Built with production-grade MLOps practices including automated training pipelines, experiment tracking, and real-time inference API. |
| 9 | + |
| 10 | +## 🎯 Project Highlights |
| 11 | + |
| 12 | +- **Production-Ready ML Pipeline**: Modular architecture with data ingestion, validation, transformation, and training components |
| 13 | +- **Model Performance**: 97.6% F1-score on test data with ensemble learning (XGBoost, Random Forest, Gradient Boosting) |
| 14 | +- **MLOps Integration**: Experiment tracking with MLflow, model versioning, and automated retraining capabilities |
| 15 | +- **RESTful API**: FastAPI-based inference service with Swagger documentation |
| 16 | +- **Data Quality Assurance**: Automated data validation and drift detection using statistical tests |
| 17 | +- **Scalable Design**: Configuration-driven architecture supporting multiple environments |
| 18 | + |
| 19 | +## 📊 Model Performance |
| 20 | + |
| 21 | +| Metric | Train | Test | |
| 22 | +|--------|-------|------| |
| 23 | +| **F1 Score** | 0.991 | 0.976 | |
| 24 | +| **Precision** | 0.987 | 0.966 | |
| 25 | +| **Recall** | 0.994 | 0.985 | |
| 26 | + |
| 27 | +## 🏗️ Architecture |
| 28 | + |
| 29 | +### ML Training Pipeline |
| 30 | +``` |
| 31 | +MongoDB → Data Ingestion → Data Validation → Feature Engineering → Model Training → Model Registry |
| 32 | + ↓ ↓ ↓ ↓ ↓ ↓ |
| 33 | +Raw Data CSV Export Schema/Drift Check Preprocessing GridSearchCV MLflow Tracking |
| 34 | +``` |
| 35 | + |
| 36 | +### Inference Pipeline |
| 37 | +``` |
| 38 | +API Request → File Upload → Data Preprocessing → Model Prediction → JSON Response |
| 39 | + ↓ ↓ ↓ ↓ ↓ |
| 40 | +FastAPI CSV/Excel Saved Preprocessor Trained Model Predictions |
| 41 | +``` |
| 42 | + |
| 43 | +## 🚀 Key Features |
| 44 | + |
| 45 | +### 1. **Modular ML Pipeline** |
| 46 | +- **Data Ingestion**: Automated data extraction from MongoDB with connection pooling |
| 47 | +- **Data Validation**: |
| 48 | + - Schema validation (31 numerical features) |
| 49 | + - Column presence checks |
| 50 | + - Data drift detection using Kolmogorov-Smirnov test |
| 51 | + - Automated drift reports generation |
| 52 | +- **Data Transformation**: |
| 53 | + - Feature scaling using StandardScaler |
| 54 | + - Robust preprocessing pipeline |
| 55 | + - Saved transformers for inference consistency |
| 56 | +- **Model Training**: |
| 57 | + - 7 ML algorithms comparison (Logistic Regression, KNN, Decision Tree, Random Forest, AdaBoost, Gradient Boosting, XGBoost) |
| 58 | + - Automated hyperparameter tuning with GridSearchCV |
| 59 | + - Best model selection based on F1-score |
| 60 | + - Model serialization with pickle |
| 61 | + |
| 62 | +### 2. **MLOps & Experiment Tracking** |
| 63 | +- **MLflow Integration**: |
| 64 | + - Experiment tracking with DagHub |
| 65 | + - Model versioning and registry |
| 66 | + - Hyperparameter logging |
| 67 | + - Metric visualization |
| 68 | +- **Artifact Management**: |
| 69 | + - Timestamped artifact directories |
| 70 | + - Model checkpointing |
| 71 | + - Preprocessor versioning |
| 72 | + |
| 73 | +### 3. **Production-Grade API** |
| 74 | +- **FastAPI Implementation**: |
| 75 | + - RESTful endpoints for training and prediction |
| 76 | + - File upload support (CSV/Excel) |
| 77 | + - CORS middleware for cross-origin requests |
| 78 | + - Automatic API documentation (Swagger/ReDoc) |
| 79 | +- **Model Serving**: |
| 80 | + - Real-time predictions |
| 81 | + - Batch inference support |
| 82 | + - HTML table rendering for results |
| 83 | + |
| 84 | +### 4. **Error Handling & Logging** |
| 85 | +- Custom exception handling throughout pipeline |
| 86 | +- Comprehensive logging with timestamps |
| 87 | +- Detailed error messages with line numbers |
| 88 | + |
| 89 | +## 📁 Project Structure |
| 90 | + |
| 91 | +``` |
| 92 | +networkSecuritySystem/ |
| 93 | +├── network_security/ |
| 94 | +│ ├── components/ |
| 95 | +│ │ ├── data_ingestion.py # MongoDB data extraction |
| 96 | +│ │ ├── data_validation.py # Schema & drift validation |
| 97 | +│ │ ├── data_transformation.py # Feature engineering |
| 98 | +│ │ └── model_trainer.py # Model training & evaluation |
| 99 | +│ ├── entity/ |
| 100 | +│ │ ├── config_entity.py # Configuration dataclasses |
| 101 | +│ │ └── artifact_entity.py # Pipeline artifact definitions |
| 102 | +│ ├── constants/ |
| 103 | +│ │ └── training_pipeline.py # Pipeline constants & configs |
| 104 | +│ ├── utils/ |
| 105 | +│ │ ├── main_utils/ |
| 106 | +│ │ │ └── utils.py # Helper functions (save/load, GridSearchCV) |
| 107 | +│ │ └── ml_utils/ |
| 108 | +│ │ ├── model/estimator.py # NetworkModel wrapper class |
| 109 | +│ │ └── metric/ # Evaluation metrics |
| 110 | +│ ├── exceptions/ |
| 111 | +│ │ └── exception.py # Custom exception classes |
| 112 | +│ └── logging/ |
| 113 | +│ └── logger.py # Logging configuration |
| 114 | +├── data_schema/ |
| 115 | +│ └── schema.yaml # Data schema definition (31 features) |
| 116 | +├── app.py # FastAPI application |
| 117 | +├── main.py # Training pipeline orchestration |
| 118 | +├── requirements.txt # Python dependencies |
| 119 | +└── README.md |
| 120 | +``` |
| 121 | + |
| 122 | +## 🛠️ Technology Stack |
| 123 | + |
| 124 | +### Core ML/Data Science |
| 125 | +- **Python 3.12**: Primary language |
| 126 | +- **Pandas & NumPy**: Data manipulation |
| 127 | +- **Scikit-learn**: ML algorithms, preprocessing, metrics |
| 128 | +- **XGBoost**: Gradient boosting framework |
| 129 | +- **SciPy**: Statistical tests for drift detection |
| 130 | + |
| 131 | +### MLOps & Tracking |
| 132 | +- **MLflow**: Experiment tracking, model registry |
| 133 | +- **DagHub**: Remote MLflow server |
| 134 | +- **Pickle/Dill**: Model serialization |
| 135 | + |
| 136 | +### Database & Data |
| 137 | +- **MongoDB**: Data storage |
| 138 | +- **PyMongo**: MongoDB driver |
| 139 | +- **Certifi**: SSL certificate verification |
| 140 | + |
| 141 | +### API & Deployment |
| 142 | +- **FastAPI**: Web framework |
| 143 | +- **Uvicorn**: ASGI server |
| 144 | +- **Jinja2**: Template rendering |
| 145 | +- **Python-dotenv**: Environment management |
| 146 | + |
| 147 | +## 📦 Installation |
| 148 | + |
| 149 | +### Prerequisites |
| 150 | +- Python 3.12+ |
| 151 | +- MongoDB instance (local or Atlas) |
| 152 | +- Git |
| 153 | + |
| 154 | +### Setup |
| 155 | + |
| 156 | +1. **Clone the repository** |
| 157 | +```bash |
| 158 | +git clone https://github.com/pycoder49/networkSecuritySystem.git |
| 159 | +cd networkSecuritySystem |
| 160 | +``` |
| 161 | + |
| 162 | +2. **Create virtual environment** |
| 163 | +```bash |
| 164 | +# Using conda |
| 165 | +conda create -p ./venv python=3.12 -y |
| 166 | +conda activate ./venv |
| 167 | + |
| 168 | +# Or using venv |
| 169 | +python -m venv venv |
| 170 | +source venv/bin/activate # On Windows: venv\Scripts\activate |
| 171 | +``` |
| 172 | + |
| 173 | +3. **Install dependencies** |
| 174 | +```bash |
| 175 | +pip install -r requirements.txt |
| 176 | +``` |
| 177 | + |
| 178 | +4. **Configure environment variables** |
| 179 | +```bash |
| 180 | +# Create .env file |
| 181 | +echo 'MONGODB_URI="your_mongodb_connection_string"' > .env |
| 182 | +``` |
| 183 | + |
| 184 | +5. **Verify MongoDB connection** |
| 185 | +```bash |
| 186 | +python test_mongodb.py |
| 187 | +``` |
| 188 | + |
| 189 | +## 🚀 Usage |
| 190 | + |
| 191 | +### Training the Model |
| 192 | + |
| 193 | +```bash |
| 194 | +# Run complete training pipeline |
| 195 | +python main.py |
| 196 | +``` |
| 197 | + |
| 198 | +This will: |
| 199 | +1. Ingest data from MongoDB |
| 200 | +2. Validate data quality and detect drift |
| 201 | +3. Transform features and create preprocessor |
| 202 | +4. Train multiple models with hyperparameter tuning |
| 203 | +5. Log experiments to MLflow |
| 204 | +6. Save best model to `final_model/` |
| 205 | + |
| 206 | +### Starting the API Server |
| 207 | + |
| 208 | +```bash |
| 209 | +# Start FastAPI server |
| 210 | +uvicorn app:app --reload --host localhost --port 8000 |
| 211 | +``` |
| 212 | + |
| 213 | +Access the API: |
| 214 | +- **Swagger UI**: http://localhost:8000/docs |
| 215 | +- **ReDoc**: http://localhost:8000/redoc |
| 216 | + |
| 217 | +### Making Predictions |
| 218 | + |
| 219 | +#### Via API (Swagger UI) |
| 220 | +1. Navigate to http://localhost:8000/docs |
| 221 | +2. Click on `/predict` endpoint |
| 222 | +3. Upload CSV file with network features |
| 223 | +4. View predictions in HTML table format |
| 224 | + |
| 225 | +#### Via cURL |
| 226 | +```bash |
| 227 | +curl -X POST "http://localhost:8000/predict" \ |
| 228 | + -H "accept: application/json" \ |
| 229 | + -H "Content-Type: multipart/form-data" \ |
| 230 | + |
| 231 | +``` |
| 232 | + |
| 233 | +#### Via Python |
| 234 | +```python |
| 235 | +import requests |
| 236 | + |
| 237 | +url = "http://localhost:8000/predict" |
| 238 | +files = {"file": open("test.csv", "rb")} |
| 239 | +response = requests.post(url, files=files) |
| 240 | +print(response.json()) |
| 241 | +``` |
| 242 | + |
| 243 | +### Training via API |
| 244 | + |
| 245 | +```bash |
| 246 | +curl -X GET "http://localhost:8000/train" |
| 247 | +``` |
| 248 | + |
| 249 | +## 📊 Data Schema |
| 250 | + |
| 251 | +The system expects 31 numerical features related to network security: |
| 252 | + |
| 253 | +| Feature | Type | Description | |
| 254 | +|---------|------|-------------| |
| 255 | +| having_IP_Address | int64 | IP address present in URL | |
| 256 | +| URL_Length | int64 | Length of URL | |
| 257 | +| Shortining_Service | int64 | URL shortening service used | |
| 258 | +| having_At_Symbol | int64 | '@' symbol present | |
| 259 | +| double_slash_redirecting | int64 | '//' after protocol | |
| 260 | +| ... | ... | ... (31 total features) | |
| 261 | +| Result | int64 | Target variable (0: Safe, 1: Phishing) | |
| 262 | + |
| 263 | +Full schema: `data_schema/schema.yaml` |
| 264 | + |
| 265 | +## 🔧 Configuration |
| 266 | + |
| 267 | +### Pipeline Configuration |
| 268 | +Located in `network_security/constants/training_pipeline.py`: |
| 269 | + |
| 270 | +```python |
| 271 | +# Data Ingestion |
| 272 | +DATA_INGESTION_COLLECTION_NAME = "NetworkData" |
| 273 | +DATA_INGESTION_DATABASE_NAME = "aryan" |
| 274 | +DATA_INGESTION_TRAIN_TEST_SPLIT_RATIO = 0.2 |
| 275 | + |
| 276 | +# Model Training |
| 277 | +MODEL_TRAINER_EXPECTED_SCORE = 0.6 |
| 278 | +MODEL_TRAINER_OVERFITTING_UNDERFITTING_THRESHOLD = 0.05 |
| 279 | +``` |
| 280 | + |
| 281 | +### Hyperparameter Grids |
| 282 | +Configure in `model_trainer.py` for each algorithm: |
| 283 | +- Logistic Regression: penalty, C, solver, max_iter |
| 284 | +- KNN: n_neighbors, weights, algorithm |
| 285 | +- Random Forest: n_estimators, max_depth, criterion |
| 286 | +- XGBoost: learning_rate, max_depth, n_estimators, subsample |
| 287 | +- And more... |
| 288 | + |
| 289 | +## 📈 MLflow Tracking |
| 290 | + |
| 291 | +View experiments at: https://dagshub.com/pycoder49/networkSecuritySystem.mlflow |
| 292 | + |
| 293 | +Logged metrics: |
| 294 | +- Training & test F1-scores |
| 295 | +- Precision & Recall |
| 296 | +- Model parameters |
| 297 | +- Training artifacts |
| 298 | + |
| 299 | +## 🧪 Testing |
| 300 | + |
| 301 | +```bash |
| 302 | +# Test individual components |
| 303 | +python -m network_security.components.data_ingestion |
| 304 | +python -m network_security.components.data_validation |
| 305 | +python -m network_security.components.model_trainer |
| 306 | + |
| 307 | +# Test API endpoints |
| 308 | +pytest tests/ # (if test suite exists) |
| 309 | +``` |
| 310 | + |
| 311 | +## 🐛 Troubleshooting |
| 312 | + |
| 313 | +### MongoDB Connection Issues |
| 314 | +```python |
| 315 | +# Verify connection with certifi |
| 316 | +import certifi |
| 317 | +ca = certifi.where() |
| 318 | +client = pymongo.MongoClient(MONGODB_URI, tlsCAFile=ca) |
| 319 | +``` |
| 320 | + |
| 321 | +### Model Loading Errors |
| 322 | +Ensure preprocessor and model are in `final_model/`: |
| 323 | +``` |
| 324 | +final_model/ |
| 325 | +├── preprocessor.pkl |
| 326 | +└── model.pkl |
| 327 | +``` |
| 328 | + |
| 329 | +### API Server Issues |
| 330 | +```bash |
| 331 | +# Check if port is already in use |
| 332 | +netstat -ano | findstr :8000 # Windows |
| 333 | +lsof -i :8000 # Linux/Mac |
| 334 | + |
| 335 | +# Use different port |
| 336 | +uvicorn app:app --port 8001 |
| 337 | +``` |
| 338 | + |
| 339 | +## 🚦 Development Workflow |
| 340 | + |
| 341 | +1. **Data Exploration**: Jupyter notebooks for EDA |
| 342 | +2. **Component Development**: Build and test individual pipeline components |
| 343 | +3. **Integration**: Connect components in `main.py` |
| 344 | +4. **Experimentation**: Use MLflow to track experiments |
| 345 | +5. **API Development**: Implement endpoints in `app.py` |
| 346 | +6. **Deployment**: Deploy to cloud (AWS, Azure, GCP) |
| 347 | + |
| 348 | +## 🎯 Future Enhancements |
| 349 | + |
| 350 | +- [ ] Add CI/CD pipeline with GitHub Actions |
| 351 | +- [ ] Implement real-time data streaming with Kafka |
| 352 | +- [ ] Add model monitoring and alerting |
| 353 | +- [ ] Containerize with Docker |
| 354 | +- [ ] Deploy on Kubernetes |
| 355 | +- [ ] Add A/B testing framework |
| 356 | +- [ ] Implement model explainability (SHAP, LIME) |
| 357 | +- [ ] Create web dashboard for predictions |
| 358 | +- [ ] Add automated retraining on data drift detection |
| 359 | +- [ ] Implement feature store for better feature management |
| 360 | + |
| 361 | +## 👨💻 Author |
| 362 | + |
| 363 | +**Aryan Ahuja** |
| 364 | + |
| 365 | +- GitHub: [@pycoder49](https://github.com/pycoder49) |
| 366 | +- DagHub: [pycoder49/networkSecuritySystem](https://dagshub.com/pycoder49/networkSecuritySystem) |
| 367 | + |
| 368 | +## 📝 License |
| 369 | + |
| 370 | +This project is licensed under the MIT License - see the LICENSE file for details. |
| 371 | + |
| 372 | +## 🙏 Acknowledgments |
| 373 | + |
| 374 | +- Dataset: Network security phishing detection dataset |
| 375 | +- MLflow for experiment tracking |
| 376 | +- DagHub for remote tracking server |
| 377 | +- FastAPI community for excellent documentation |
| 378 | + |
| 379 | +--- |
| 380 | + |
| 381 | +**Note**: This is a portfolio project demonstrating end-to-end ML engineering skills including pipeline design, MLOps practices, API development, and production-ready code organization. |
0 commit comments