Skip to content

Commit 02b1dbe

Browse files
authored
Merge pull request #1434 from DarshAgrawal14/main
Added PDF detection malware
2 parents 23cc007 + d2bf888 commit 02b1dbe

File tree

6 files changed

+11416
-0
lines changed

6 files changed

+11416
-0
lines changed

Detection Models/PDF_Malware_Detection/Dataset/PDFMalware2022.csv

Lines changed: 10027 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# PDF Malware Detection
2+
3+
This project implements a machine learning-based system for detecting potential malware in PDF files. It includes feature extraction from PDF files, model training, and a prediction script for classifying PDFs as potentially malicious or benign.
4+
5+
## Components
6+
7+
1. **Feature Extraction** (`pdf_feature_extraction.py`)
8+
- Extracts various features from PDF files using PyMuPDF and pdfid.
9+
- Features include metadata, structural elements, and presence of potentially risky elements.
10+
11+
2. **Model Training** (`pdf_malware_dataset_training.py`)
12+
- Prepares the dataset, handles data cleaning and preprocessing.
13+
- Trains a Random Forest classifier for malware detection.
14+
- Includes code for hyperparameter tuning (commented out).
15+
16+
3. **Prediction Script** (`predict_malware.py`)
17+
- Uses the trained model to predict whether a given PDF file is potentially malicious.
18+
19+
## Setup
20+
21+
1. Install required dependencies:
22+
```
23+
pip install numpy pandas matplotlib scikit-learn imblearn PyMuPDF pdfid joblib
24+
```
25+
26+
2. Ensure you have the dataset file `PDFMalware2022.csv` in the `Dataset` folder.
27+
28+
## Usage
29+
30+
### Training the Model
31+
32+
1. Run the `pdf_malware_dataset_training.py` script to train the model:
33+
```
34+
python pdf_malware_dataset_training.py
35+
```
36+
This will create a `random_forest_model.pkl` file containing the trained model.
37+
38+
### Predicting Malware
39+
40+
For prediction run the script directly along with path to pdf file:
41+
```
42+
python predict_malware.py path/to/your/pdf_file.pdf
43+
```
44+
45+
## Note
46+
47+
This project is for educational and research purposes only. It should not be used as a sole means of determining file safety. Always use caution when dealing with potentially malicious files and consult with cybersecurity professionals for comprehensive security measures.
48+
49+
## Future Improvements
50+
51+
- Implement more advanced feature extraction techniques.
52+
- Explore other machine learning algorithms for potentially better performance.
53+
- Add a user-friendly interface for easier interaction with the prediction system.
54+
- Incorporate regular model updates with new malware samples to keep the detection current.

0 commit comments

Comments
 (0)