Skip to content

Commit f1549ac

Browse files
committed
added readme
1 parent 12fbbea commit f1549ac

File tree

1 file changed

+381
-1
lines changed

1 file changed

+381
-1
lines changed

README.md

Lines changed: 381 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,381 @@
1-
# networkSecuritySystem
1+
# 🛡️ Network Security System - ML-Powered Phishing Detection
2+
3+
[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
4+
[![FastAPI](https://img.shields.io/badge/FastAPI-0.100+-00a393.svg)](https://fastapi.tiangolo.com/)
5+
[![MLflow](https://img.shields.io/badge/MLflow-2.15-0194E2.svg)](https://mlflow.org/)
6+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7+
8+
An end-to-end machine learning system for detecting phishing websites using network security data. Built with production-grade MLOps practices including automated training pipelines, experiment tracking, and real-time inference API.
9+
10+
## 🎯 Project Highlights
11+
12+
- **Production-Ready ML Pipeline**: Modular architecture with data ingestion, validation, transformation, and training components
13+
- **Model Performance**: 97.6% F1-score on test data with ensemble learning (XGBoost, Random Forest, Gradient Boosting)
14+
- **MLOps Integration**: Experiment tracking with MLflow, model versioning, and automated retraining capabilities
15+
- **RESTful API**: FastAPI-based inference service with Swagger documentation
16+
- **Data Quality Assurance**: Automated data validation and drift detection using statistical tests
17+
- **Scalable Design**: Configuration-driven architecture supporting multiple environments
18+
19+
## 📊 Model Performance
20+
21+
| Metric | Train | Test |
22+
|--------|-------|------|
23+
| **F1 Score** | 0.991 | 0.976 |
24+
| **Precision** | 0.987 | 0.966 |
25+
| **Recall** | 0.994 | 0.985 |
26+
27+
## 🏗️ Architecture
28+
29+
### ML Training Pipeline
30+
```
31+
MongoDB → Data Ingestion → Data Validation → Feature Engineering → Model Training → Model Registry
32+
↓ ↓ ↓ ↓ ↓ ↓
33+
Raw Data CSV Export Schema/Drift Check Preprocessing GridSearchCV MLflow Tracking
34+
```
35+
36+
### Inference Pipeline
37+
```
38+
API Request → File Upload → Data Preprocessing → Model Prediction → JSON Response
39+
↓ ↓ ↓ ↓ ↓
40+
FastAPI CSV/Excel Saved Preprocessor Trained Model Predictions
41+
```
42+
43+
## 🚀 Key Features
44+
45+
### 1. **Modular ML Pipeline**
46+
- **Data Ingestion**: Automated data extraction from MongoDB with connection pooling
47+
- **Data Validation**:
48+
- Schema validation (31 numerical features)
49+
- Column presence checks
50+
- Data drift detection using Kolmogorov-Smirnov test
51+
- Automated drift reports generation
52+
- **Data Transformation**:
53+
- Feature scaling using StandardScaler
54+
- Robust preprocessing pipeline
55+
- Saved transformers for inference consistency
56+
- **Model Training**:
57+
- 7 ML algorithms comparison (Logistic Regression, KNN, Decision Tree, Random Forest, AdaBoost, Gradient Boosting, XGBoost)
58+
- Automated hyperparameter tuning with GridSearchCV
59+
- Best model selection based on F1-score
60+
- Model serialization with pickle
61+
62+
### 2. **MLOps & Experiment Tracking**
63+
- **MLflow Integration**:
64+
- Experiment tracking with DagHub
65+
- Model versioning and registry
66+
- Hyperparameter logging
67+
- Metric visualization
68+
- **Artifact Management**:
69+
- Timestamped artifact directories
70+
- Model checkpointing
71+
- Preprocessor versioning
72+
73+
### 3. **Production-Grade API**
74+
- **FastAPI Implementation**:
75+
- RESTful endpoints for training and prediction
76+
- File upload support (CSV/Excel)
77+
- CORS middleware for cross-origin requests
78+
- Automatic API documentation (Swagger/ReDoc)
79+
- **Model Serving**:
80+
- Real-time predictions
81+
- Batch inference support
82+
- HTML table rendering for results
83+
84+
### 4. **Error Handling & Logging**
85+
- Custom exception handling throughout pipeline
86+
- Comprehensive logging with timestamps
87+
- Detailed error messages with line numbers
88+
89+
## 📁 Project Structure
90+
91+
```
92+
networkSecuritySystem/
93+
├── network_security/
94+
│ ├── components/
95+
│ │ ├── data_ingestion.py # MongoDB data extraction
96+
│ │ ├── data_validation.py # Schema & drift validation
97+
│ │ ├── data_transformation.py # Feature engineering
98+
│ │ └── model_trainer.py # Model training & evaluation
99+
│ ├── entity/
100+
│ │ ├── config_entity.py # Configuration dataclasses
101+
│ │ └── artifact_entity.py # Pipeline artifact definitions
102+
│ ├── constants/
103+
│ │ └── training_pipeline.py # Pipeline constants & configs
104+
│ ├── utils/
105+
│ │ ├── main_utils/
106+
│ │ │ └── utils.py # Helper functions (save/load, GridSearchCV)
107+
│ │ └── ml_utils/
108+
│ │ ├── model/estimator.py # NetworkModel wrapper class
109+
│ │ └── metric/ # Evaluation metrics
110+
│ ├── exceptions/
111+
│ │ └── exception.py # Custom exception classes
112+
│ └── logging/
113+
│ └── logger.py # Logging configuration
114+
├── data_schema/
115+
│ └── schema.yaml # Data schema definition (31 features)
116+
├── app.py # FastAPI application
117+
├── main.py # Training pipeline orchestration
118+
├── requirements.txt # Python dependencies
119+
└── README.md
120+
```
121+
122+
## 🛠️ Technology Stack
123+
124+
### Core ML/Data Science
125+
- **Python 3.12**: Primary language
126+
- **Pandas & NumPy**: Data manipulation
127+
- **Scikit-learn**: ML algorithms, preprocessing, metrics
128+
- **XGBoost**: Gradient boosting framework
129+
- **SciPy**: Statistical tests for drift detection
130+
131+
### MLOps & Tracking
132+
- **MLflow**: Experiment tracking, model registry
133+
- **DagHub**: Remote MLflow server
134+
- **Pickle/Dill**: Model serialization
135+
136+
### Database & Data
137+
- **MongoDB**: Data storage
138+
- **PyMongo**: MongoDB driver
139+
- **Certifi**: SSL certificate verification
140+
141+
### API & Deployment
142+
- **FastAPI**: Web framework
143+
- **Uvicorn**: ASGI server
144+
- **Jinja2**: Template rendering
145+
- **Python-dotenv**: Environment management
146+
147+
## 📦 Installation
148+
149+
### Prerequisites
150+
- Python 3.12+
151+
- MongoDB instance (local or Atlas)
152+
- Git
153+
154+
### Setup
155+
156+
1. **Clone the repository**
157+
```bash
158+
git clone https://github.com/pycoder49/networkSecuritySystem.git
159+
cd networkSecuritySystem
160+
```
161+
162+
2. **Create virtual environment**
163+
```bash
164+
# Using conda
165+
conda create -p ./venv python=3.12 -y
166+
conda activate ./venv
167+
168+
# Or using venv
169+
python -m venv venv
170+
source venv/bin/activate # On Windows: venv\Scripts\activate
171+
```
172+
173+
3. **Install dependencies**
174+
```bash
175+
pip install -r requirements.txt
176+
```
177+
178+
4. **Configure environment variables**
179+
```bash
180+
# Create .env file
181+
echo 'MONGODB_URI="your_mongodb_connection_string"' > .env
182+
```
183+
184+
5. **Verify MongoDB connection**
185+
```bash
186+
python test_mongodb.py
187+
```
188+
189+
## 🚀 Usage
190+
191+
### Training the Model
192+
193+
```bash
194+
# Run complete training pipeline
195+
python main.py
196+
```
197+
198+
This will:
199+
1. Ingest data from MongoDB
200+
2. Validate data quality and detect drift
201+
3. Transform features and create preprocessor
202+
4. Train multiple models with hyperparameter tuning
203+
5. Log experiments to MLflow
204+
6. Save best model to `final_model/`
205+
206+
### Starting the API Server
207+
208+
```bash
209+
# Start FastAPI server
210+
uvicorn app:app --reload --host localhost --port 8000
211+
```
212+
213+
Access the API:
214+
- **Swagger UI**: http://localhost:8000/docs
215+
- **ReDoc**: http://localhost:8000/redoc
216+
217+
### Making Predictions
218+
219+
#### Via API (Swagger UI)
220+
1. Navigate to http://localhost:8000/docs
221+
2. Click on `/predict` endpoint
222+
3. Upload CSV file with network features
223+
4. View predictions in HTML table format
224+
225+
#### Via cURL
226+
```bash
227+
curl -X POST "http://localhost:8000/predict" \
228+
-H "accept: application/json" \
229+
-H "Content-Type: multipart/form-data" \
230+
231+
```
232+
233+
#### Via Python
234+
```python
235+
import requests
236+
237+
url = "http://localhost:8000/predict"
238+
files = {"file": open("test.csv", "rb")}
239+
response = requests.post(url, files=files)
240+
print(response.json())
241+
```
242+
243+
### Training via API
244+
245+
```bash
246+
curl -X GET "http://localhost:8000/train"
247+
```
248+
249+
## 📊 Data Schema
250+
251+
The system expects 31 numerical features related to network security:
252+
253+
| Feature | Type | Description |
254+
|---------|------|-------------|
255+
| having_IP_Address | int64 | IP address present in URL |
256+
| URL_Length | int64 | Length of URL |
257+
| Shortining_Service | int64 | URL shortening service used |
258+
| having_At_Symbol | int64 | '@' symbol present |
259+
| double_slash_redirecting | int64 | '//' after protocol |
260+
| ... | ... | ... (31 total features) |
261+
| Result | int64 | Target variable (0: Safe, 1: Phishing) |
262+
263+
Full schema: `data_schema/schema.yaml`
264+
265+
## 🔧 Configuration
266+
267+
### Pipeline Configuration
268+
Located in `network_security/constants/training_pipeline.py`:
269+
270+
```python
271+
# Data Ingestion
272+
DATA_INGESTION_COLLECTION_NAME = "NetworkData"
273+
DATA_INGESTION_DATABASE_NAME = "aryan"
274+
DATA_INGESTION_TRAIN_TEST_SPLIT_RATIO = 0.2
275+
276+
# Model Training
277+
MODEL_TRAINER_EXPECTED_SCORE = 0.6
278+
MODEL_TRAINER_OVERFITTING_UNDERFITTING_THRESHOLD = 0.05
279+
```
280+
281+
### Hyperparameter Grids
282+
Configure in `model_trainer.py` for each algorithm:
283+
- Logistic Regression: penalty, C, solver, max_iter
284+
- KNN: n_neighbors, weights, algorithm
285+
- Random Forest: n_estimators, max_depth, criterion
286+
- XGBoost: learning_rate, max_depth, n_estimators, subsample
287+
- And more...
288+
289+
## 📈 MLflow Tracking
290+
291+
View experiments at: https://dagshub.com/pycoder49/networkSecuritySystem.mlflow
292+
293+
Logged metrics:
294+
- Training & test F1-scores
295+
- Precision & Recall
296+
- Model parameters
297+
- Training artifacts
298+
299+
## 🧪 Testing
300+
301+
```bash
302+
# Test individual components
303+
python -m network_security.components.data_ingestion
304+
python -m network_security.components.data_validation
305+
python -m network_security.components.model_trainer
306+
307+
# Test API endpoints
308+
pytest tests/ # (if test suite exists)
309+
```
310+
311+
## 🐛 Troubleshooting
312+
313+
### MongoDB Connection Issues
314+
```python
315+
# Verify connection with certifi
316+
import certifi
317+
ca = certifi.where()
318+
client = pymongo.MongoClient(MONGODB_URI, tlsCAFile=ca)
319+
```
320+
321+
### Model Loading Errors
322+
Ensure preprocessor and model are in `final_model/`:
323+
```
324+
final_model/
325+
├── preprocessor.pkl
326+
└── model.pkl
327+
```
328+
329+
### API Server Issues
330+
```bash
331+
# Check if port is already in use
332+
netstat -ano | findstr :8000 # Windows
333+
lsof -i :8000 # Linux/Mac
334+
335+
# Use different port
336+
uvicorn app:app --port 8001
337+
```
338+
339+
## 🚦 Development Workflow
340+
341+
1. **Data Exploration**: Jupyter notebooks for EDA
342+
2. **Component Development**: Build and test individual pipeline components
343+
3. **Integration**: Connect components in `main.py`
344+
4. **Experimentation**: Use MLflow to track experiments
345+
5. **API Development**: Implement endpoints in `app.py`
346+
6. **Deployment**: Deploy to cloud (AWS, Azure, GCP)
347+
348+
## 🎯 Future Enhancements
349+
350+
- [ ] Add CI/CD pipeline with GitHub Actions
351+
- [ ] Implement real-time data streaming with Kafka
352+
- [ ] Add model monitoring and alerting
353+
- [ ] Containerize with Docker
354+
- [ ] Deploy on Kubernetes
355+
- [ ] Add A/B testing framework
356+
- [ ] Implement model explainability (SHAP, LIME)
357+
- [ ] Create web dashboard for predictions
358+
- [ ] Add automated retraining on data drift detection
359+
- [ ] Implement feature store for better feature management
360+
361+
## 👨‍💻 Author
362+
363+
**Aryan Ahuja**
364+
365+
- GitHub: [@pycoder49](https://github.com/pycoder49)
366+
- DagHub: [pycoder49/networkSecuritySystem](https://dagshub.com/pycoder49/networkSecuritySystem)
367+
368+
## 📝 License
369+
370+
This project is licensed under the MIT License - see the LICENSE file for details.
371+
372+
## 🙏 Acknowledgments
373+
374+
- Dataset: Network security phishing detection dataset
375+
- MLflow for experiment tracking
376+
- DagHub for remote tracking server
377+
- FastAPI community for excellent documentation
378+
379+
---
380+
381+
**Note**: This is a portfolio project demonstrating end-to-end ML engineering skills including pipeline design, MLOps practices, API development, and production-ready code organization.

0 commit comments

Comments
 (0)