A complete malware detection & analysis system built using Machine Learning, PE static analysis, and multi-interface scanning tools — designed to scan .exe files for malicious behavior using Random Forest, XGBoost, and ANN models.
This project uses static PE (Portable Executable) features to train AI models that detect malicious Windows binaries. It supports scanning via:
- ✅ Command Line Interface (CLI)
- ✅ Desktop GUI (Tkinter)
- ✅ Browser App (Streamlit)
Trained on the EMBER 2018 malware dataset with over 600,000 samples, it achieves high accuracy and is structured like a modern malware lab project — designed to align with malware analyst roles like SonicWall’s internship program.
- Language: Python 3.10
- ML Libraries: Scikit-learn, XGBoost, TensorFlow/Keras
- PE Analysis: LIEF (for static binary feature extraction)
- UI: Streamlit (Web), Tkinter (GUI)
- Data: EMBER 2018 dataset (600k malware/goodware samples)
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Random Forest | 96.8% | 97.5% | 96.1% | 96.78% |
| XGBoost | 94.9% | 94.5% | 95.3% | 94.95% |
| ANN (Keras) | 96.4% | 96.6% | 96.1% | 96.39% |
- 🔍
.exebinary scanning using PE features - 🤖 Trained 3 ML models on 600k sample dataset
- 🧪 Real-world prediction on unknown
.exefiles - 🖥️ CLI, GUI, and browser-based interfaces
- 📊 Model comparison and performance graphs
- ⚙️ Supports
.pkland.h5model loading - 🧰 Easily extendable for Cuckoo Sandbox (dynamic analysis)
Please download all large files (models, datasets, binaries) from this Google Drive link:
🔗 📁 Google Drive – Data + Models
- Unzip
ember_data.zipinto the project root as:ember_data/ - Place all
.npyfiles in the root directory - Place trained model files (
.pkl,.h5) insidemodels/
python src/predict_exe.py "path/to/sample.exe" --model models/random_forest.pklA simple interface to upload a .exe file and detect if it's malware or benign using your trained models.
python ui/gui_app.pystreamlit run ui/web_app.py📁 Folder Structure
AI-Malware-Detector/
│
├── src/ # ML logic & data handling
│ ├── train_model.py # Random Forest & XGBoost trainer
│ ├── train_ann.py # ANN model trainer
│ ├── ember_loader.py # Loads + processes EMBER data
│ ├── predict_exe.py # Predicts malware from .exe
│ └── compare_models.py # Evaluates & visualizes model metrics
│
├── ui/
│ ├── gui_app.py # Tkinter-based GUI
│ └── web_app.py # Streamlit browser app
│
├── models/ # Trained model files (.pkl, .h5)
├── ember_data/ # EMBER dataset (JSONL files)
├── *.npy # Preprocessed feature arrays
├── requirements.txt
└── README.md
- Add dynamic analysis via Cuckoo Sandbox (API, registry, behavior)
- Combine static + dynamic features for hybrid model
- Deploy web app online (e.g., Streamlit Cloud or Render)
- Email alert/report for detected threats
- Real-time dashboard with threat stats
- EMBER 2018 Dataset – Endgame Inc.
- LIEF Library



