Skip to content

Latest commit

 

History

History
13 lines (13 loc) · 1001 Bytes

File metadata and controls

13 lines (13 loc) · 1001 Bytes

Microsoft Malware Classification

  • Domain : Cyber-Security
  • Type of problem: Multi-class Classification
  • Objective: Predict the probability of a data-point belonging to each of the nine classes of malware.
  • Performance Metric: Reduce the Multi-class log loss.
  • Solution:
  1. For every malware we have two files: .asm file and .bytes file
  2. Total dataset consists of 200 GB of data, one of the largest publicly available datasets.
  3. Extracted features from files (Unigrams, Bigrams, Opcode, Pixel Intensity, Size). Used batch processing for text extraction and preprocessing to avoid memory overshoot.
  4. Performed Hypothesis testing method (Chi-Squared) for feature selection to reduce dimensionality.
  5. Designed and Calibrated (wherever needed) KNN, Logistic Regression, Random Forest and XGBoost on different subsets of features.
  6. Achieved a Multi-Class log loss of 0.018.
  • Utilized Google Cloud Platform (GCP) Compute Engine for memory intensive tasks and Colab for GPU requisite tasks.