|
| 1 | +# PDF Malware Detection |
| 2 | + |
| 3 | +This project implements a machine learning-based system for detecting potential malware in PDF files. It includes feature extraction from PDF files, model training, and a prediction script for classifying PDFs as potentially malicious or benign. |
| 4 | + |
| 5 | +## Components |
| 6 | + |
| 7 | +1. **Feature Extraction** (`pdf_feature_extraction.py`) |
| 8 | + - Extracts various features from PDF files using PyMuPDF and pdfid. |
| 9 | + - Features include metadata, structural elements, and presence of potentially risky elements. |
| 10 | + |
| 11 | +2. **Model Training** (`pdf_malware_dataset_training.py`) |
| 12 | + - Prepares the dataset, handles data cleaning and preprocessing. |
| 13 | + - Trains a Random Forest classifier for malware detection. |
| 14 | + - Includes code for hyperparameter tuning (commented out). |
| 15 | + |
| 16 | +3. **Prediction Script** (`predict_malware.py`) |
| 17 | + - Uses the trained model to predict whether a given PDF file is potentially malicious. |
| 18 | + |
| 19 | +## Setup |
| 20 | + |
| 21 | +1. Install required dependencies: |
| 22 | + ``` |
| 23 | + pip install numpy pandas matplotlib scikit-learn imblearn PyMuPDF pdfid joblib |
| 24 | + ``` |
| 25 | + |
| 26 | +2. Ensure you have the dataset file `PDFMalware2022.csv` in the `Dataset` folder. |
| 27 | + |
| 28 | +## Usage |
| 29 | + |
| 30 | +### Training the Model |
| 31 | + |
| 32 | +1. Run the `pdf_malware_dataset_training.py` script to train the model: |
| 33 | + ``` |
| 34 | + python pdf_malware_dataset_training.py |
| 35 | + ``` |
| 36 | + This will create a `random_forest_model.pkl` file containing the trained model. |
| 37 | + |
| 38 | +### Predicting Malware |
| 39 | + |
| 40 | +For prediction run the script directly along with path to pdf file: |
| 41 | + ``` |
| 42 | + python predict_malware.py path/to/your/pdf_file.pdf |
| 43 | + ``` |
| 44 | + |
| 45 | +## Note |
| 46 | + |
| 47 | +This project is for educational and research purposes only. It should not be used as a sole means of determining file safety. Always use caution when dealing with potentially malicious files and consult with cybersecurity professionals for comprehensive security measures. |
| 48 | + |
| 49 | +## Future Improvements |
| 50 | + |
| 51 | +- Implement more advanced feature extraction techniques. |
| 52 | +- Explore other machine learning algorithms for potentially better performance. |
| 53 | +- Add a user-friendly interface for easier interaction with the prediction system. |
| 54 | +- Incorporate regular model updates with new malware samples to keep the detection current. |
0 commit comments