A machine learning framework for Distributed Denial of Service (DDoS) attack detection achieving, so far, 100% F1-score in the 01-12/DrDoS_DNS.csv. Features hyperparameter optimization across nine classifiers (Random Forest, SVM, XGBoost, Logistic Regression, KNN, Nearest Centroid, Gradient Boosting, LightGBM, and MLP), WGAN-GP for synthetic data generation, multi-method feature selection (Genetic Algorithms, RFE, PCA), and stacking ensemble evaluation. Validated on CICDDoS2019 benchmark datasets with full reproducibility and cross-platform support.
- DDoS-Detector.
This project provides a complete end-to-end machine learning pipeline for DDoS (Distributed Denial of Service) attack detection and classification using network flow data. The framework integrates state-of-the-art techniques for data preprocessing, feature engineering, model optimization, and evaluation to achieve robust and accurate intrusion detection across multiple benchmark datasets.
The system is organized into several interconnected modules, each addressing a critical aspect of the machine learning workflow:
1. Data Preparation and Exploration
- Dataset Converter (
dataset_converter.py): Multi-format conversion utility supporting ARFF, CSV, Parquet, and TXT formats. Performs lightweight structural cleaning and maintains directory hierarchy during conversion. - Dataset Descriptor (
dataset_descriptor.py): Generates comprehensive metadata reports including feature types, missing values, class distributions, and 2D t-SNE visualizations for data separability analysis. Produces cross-dataset compatibility reports comparing feature unions and intersections.
2. Feature Engineering
- Genetic Algorithm (
genetic_algorithm.py): DEAP-based binary-mask genetic algorithm for optimal feature selection. Uses RandomForest-based fitness evaluation with multi-objective metrics (accuracy, precision, recall, F1, FPR, FNR). Supports population sweeps and exports consolidated results with feature importance rankings. - Recursive Feature Elimination (
rfe.py): Automated RFE workflow using RandomForestClassifier to iteratively eliminate less important features. Exports structured run results with feature rankings and performance metrics. - Principal Component Analysis (
pca.py): PCA-based dimensionality reduction with configurable component counts. Performs 10-fold Stratified CV evaluation and saves PCA objects for reproducibility.
3. Data Augmentation
- WGAN-GP (
wgangp.py): Wasserstein Generative Adversarial Network with Gradient Penalty for generating synthetic network flow data. Implements conditional generation with residual blocks (DRCGAN-style architecture) for multi-class attack scenarios. Produces high-quality synthetic samples to balance datasets and augment training data.
4. Model Optimization and Evaluation
- Hyperparameter Optimization (
hyperparameters_optimization.py): Comprehensive hyperparameter tuning for nine classifiers (Random Forest, SVM, XGBoost, Logistic Regression, KNN, Nearest Centroid, Gradient Boosting, LightGBM, MLP). Features parallel evaluation with ThreadPoolExecutor, progress caching, memory-safe worker allocation, and detailed metric tracking (F1, accuracy, precision, recall, MCC, Cohen's kappa, ROC-AUC, FPR, FNR, TPR, TNR). - Stacking Ensemble (
stacking.py): Evaluates individual classifiers and stacking meta-classifiers across GA, RFE, and PCA feature sets. Produces consolidated CSV results with hardware metadata for reproducibility.
5. Utilities and Infrastructure
- Logger (
Logger.py): Dual-channel logger preserving ANSI color codes for terminal output while maintaining clean log files. - Telegram Bot (
telegram_bot.py): Notification system for long-running experiments, supporting message splitting for Telegram's character limits. - Makefile: Automation for all pipeline stages with cross-platform support (Windows, Linux, macOS) and detached execution modes.
- Multi-Dataset Support: Designed for CICDDoS2019, CIC-IDS-2017, and compatible datasets with shared feature definitions
- Feature Reusability: GA-selected features can be reused across compatible datasets without retraining
- Comprehensive Metrics: Tracks standard metrics (accuracy, precision, recall, F1) plus confusion-based rates (FPR, FNR, TPR, TNR), MCC, Cohen's kappa, and ROC-AUC
- Progress Persistence: Checkpoint saving and caching for resumable long-running optimizations
- Hardware Awareness: Automatic detection of CPU model, cores, RAM, GPU availability (ThunderSVM), and memory-safe parallel worker allocation
- Cross-Platform: Unified codebase with OS-specific adaptations for sound notifications, path handling, and system information retrieval
The typical pipeline execution follows this sequence:
- Download/Convert Datasets: Obtain raw data using
download_datasets.shor convert existing formats withdataset_converter.py - Describe Datasets: Generate metadata reports and t-SNE visualizations using
dataset_descriptor.py - Feature Selection: Run
genetic_algorithm.py,rfe.py, andpca.pyto extract optimal feature subsets - Hyperparameter Tuning: Optimize individual classifiers with
hyperparameters_optimization.pyusing GA-selected features - Ensemble Evaluation: Compare stacking and individual models across feature sets with
stacking.py - Optional Augmentation: Generate synthetic samples using
wgangp.pyfor dataset balancing - Results Analysis: Consolidated CSV outputs in
Feature_Analysis/andClassifiers_Hyperparameters/directories
This modular architecture enables researchers to execute the complete pipeline end-to-end or run individual components for targeted analysis.
This section provides instructions for installing Git, Python, Pip, Make, then to clone the repository (if not done yet) and all required project dependencies.
git is a distributed version control system that is widely used for tracking changes in source code during software development. In this project, git is used to download and manage the analyzed repositories, as well as to clone the project and its submodules. To install git, follow the instructions below based on your operating system:
To install git on Linux, run:
sudo apt install git -y # For Debian-based distributions (e.g., Ubuntu)If you don't have Homebrew installed, you can install it by running the following command in your terminal:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"To install git on MacOS, you can use Homebrew:
brew install gitOn Windows, you can download git from the official website here and follow the installation instructions provided there.
Now that git is installed, it's time to clone this repository with all required submodules, use:
git clone --recurse-submodules https://github.com/BrenoFariasdaSilva/DDoS-Detector.gitIf you clone without submodules (not recommended):
git clone https://github.com/BrenoFariasdaSilva/DDoS-DetectorTo initialize submodules manually:
cd DDoS-Detector # Only if not in the repository root directory yet
git submodule init
git submodule updateYou must have Python 3, Pip, and the venv module installed.
sudo apt install python3 python3-pip python3-venv -ybrew install python3If you do not have Chocolatey installed, you can install it by running the following command in an elevated PowerShell (Run as Administrator):
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))Once Chocolatey is installed, you can install Python using:
choco install python3Or download the installer from the official website here and follow the installation instructions provided there. Make sure to check the option "Add Python to PATH" during installation ans restart your terminal/computer.
Make is used to run automated tasks defined in the project's Makefile, such as setting up environments, executing scripts, and managing Python dependencies.
sudo apt install make -ybrew install makeAvailable via Cygwin, MSYS2, or WSL.
-
Install the project dependencies with the following command:
cd DDoS-Detector # Only if not in the repository root directory yet make dependencies
This command will create a virtual environment in the
.venvfolder and install all required dependencies listed in therequirements.txtfile.
This repository includes a shell script download_datasets.sh (at the repository root) that can automatically download and extract several datasets used by the project. The current script downloads and extracts:
- CICDDoS2019 (two CSV ZIPs:
CSV-01-12.zipandCSV-03-11.zip) intoDatasets/CICDDoS2019 - CIC-IDS-2017 (the labelled flows ZIP) into
Datasets/CICIDS2017
The script behavior:
- Creates the main
Datasetsdirectory if it does not exist. - For each configured dataset it creates the target directory (from
DATASET_DIRS). - Downloads the configured ZIP file(s) with
wget -c(resumable downloads). - Extracts each ZIP using
unzip -ointo the dataset directory.
Requirements: wget and unzip must be installed and available on your PATH, and you must have an active internet connection.
Usage (from the repository root):
make download_datasetsAfter successful run the files will be available under the Datasets subfolders, for example:
Datasets/CICDDoS2019/CSV-01-12.zip(unzipped CSV files will be inside this directory)Datasets/CICDDoS2019/CSV-03-11.zipDatasets/CICIDS2017/GeneratedLabelledFlows.zip
Note: the ZIP filenames and the exact extraction layout depend on the upstream archive contents; check the target folder after extraction.
Configuring the script
-
The script is driven by two associative arrays at the top of
download_datasets.sh:DATASET_URLS: maps a short dataset key to the download URL.DATASET_DIRS: maps the same dataset key to the local target directory underDatasets/.
-
To select which datasets the script downloads, edit
download_datasets.shand comment/uncomment the relevant entries inDATASET_URLS(or remove entries you don't want). The script iterates over the keys present inDATASET_URLSand downloads whatever is configured there. -
To change where a dataset is extracted, update the corresponding value in
DATASET_DIRSfor that key. Example:
# in download_datasets.sh
DATASET_URLS=( [CICDDoS2019_CSV_01_12]="http://.../CSV-01-12.zip" )
DATASET_DIRS=( [CICDDoS2019_CSV_01_12]="Datasets/CICDDoS2019" )- To add another dataset: add a new key to
DATASET_URLSwith its URL, and add the same key toDATASET_DIRSwith the desired target folder. Save the file and re-run./download_datasets.sh.
-
Manually downloaded datasets
If you prefer to download datasets manually, create the
Datasetsdirectory (if needed):mkdir -p Datasets
Then create a subfolder per dataset and place the downloaded CSV(s) or extracted files there. Example structure:
Datasets/ CICDDoS2019/ CSV-01-12/ # Extracted CSVs from the first archive CSV-03-11/ # Extracted CSVs from the second archive CICIDS2017/ TrafficLabelling/ # Extracted CSVsPrimary datasets used in this project:
- https://www.unb.ca/cic/datasets/ddos-2019.html (CICDDoS2019)
- https://www.unb.ca/cic/datasets/ids-2017.html (CIC-IDS-2017)
These datasets were chosen because they share similar feature definitions. This allows feature subsets extracted via the Genetic Algorithm to be reused across multiple datasets, avoiding the need to retrain models from scratch for each dataset/file.
-
Using other datasets
You may use additional datasets as long as they are compatible with the project's preprocessing pipeline. Good sources include:
Ensure any new dataset is adapted to match the expected CSV format, column names (features), and label conventions used by the project. If necessary, use the provided dataset utilities (e.g.,
dataset_converter.py/dataset_descriptor.py) to convert or normalize new datasets to the project's expected format.
📊 For detailed experimental results and performance benchmarks, please see RESULTS.md.
The Results document contains comprehensive outputs from all modules including:
- Dataset preparation and cross-dataset compatibility analysis
- Feature engineering results (Genetic Algorithm, RFE, PCA) with actual performance metrics
- Model optimization results across nine classifiers (Random Forest, SVM, XGBoost, Logistic Regression, KNN, Nearest Centroid, Gradient Boosting, LightGBM, MLP)
- Stacking ensemble evaluation results
- Benchmark performance summary with detailed comparison tables
- Methodological notes on reproducibility and deterministic methods
Note: The results section has been moved to a separate file as it contains extensive technical details and experimental data that may not be of interest to all readers.
If you use the DDoS-Detector in your research, please cite it using the following BibTeX entry:
@misc{softwareDDoS-Detector:2025,
title = {A Framework for DDoS Attack Detection Using Hyperparameter Optimization, WGAN-GP–Based Data Augmentation, Feature Extraction via Genetic Algorithms, RFE, and PCA, with Ensemble Classifiers and Multi-Dataset Evaluation},
author = {Breno Farias da Silva},
year = {2025},
howpublished = {https://github.com/BrenoFariasdaSilva/DDoS-Detector},
note = {Accessed on October 6, 2026}
}
Additionally, a main.bib file is available in the root directory of this repository, in which contains the BibTeX entry for this project.
If you find this repository valuable, please don't forget to give it a ⭐ to show your support! Contributions are highly encouraged, whether by creating issues for feedback or submitting pull requests (PRs) to improve the project. For details on how to contribute, please refer to the Contributing section below.
Thank you for your support and for recognizing the contribution of this tool to your work!
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have suggestions for improving the code, your insights will be highly welcome. In order to contribute to this project, please follow the guidelines below or read the CONTRIBUTING.md file for more details on how to contribute to this project, as it contains information about the commit standards and the entire pull request process. Please follow these guidelines to make your contributions smooth and effective:
-
Set Up Your Environment: Ensure you've followed the setup instructions in the Setup section to prepare your development environment.
-
Make Your Changes:
- Create a Branch:
git checkout -b feature/YourFeatureName - Implement Your Changes: Make sure to test your changes thoroughly.
- Commit Your Changes: Use clear commit messages, for example:
- For new features:
git commit -m "FEAT: Add some AmazingFeature" - For bug fixes:
git commit -m "FIX: Resolve Issue #123" - For documentation:
git commit -m "DOCS: Update README with new instructions" - For refactorings:
git commit -m "REFACTOR: Enhance component for better aspect" - For snapshots:
git commit -m "SNAPSHOT: Temporary commit to save the current state for later reference"
- For new features:
- See more about crafting commit messages in the CONTRIBUTING.md file.
- Create a Branch:
-
Submit Your Contribution:
- Push Your Changes:
git push origin feature/YourFeatureName - Open a Pull Request (PR): Navigate to the repository on GitHub and open a PR with a detailed description of your changes.
- Push Your Changes:
-
Stay Engaged: Respond to any feedback from the project maintainers and make necessary adjustments to your PR.
-
Celebrate: Once your PR is merged, celebrate your contribution to the project!
We thank the following people who contributed to this project:
![]() Breno Farias da Silva |
This project is licensed under the Apache License 2.0. This license permits use, modification, distribution, and sublicense of the code for both private and commercial purposes, provided that the original copyright notice and a disclaimer of warranty are included in all copies or substantial portions of the software. It also requires a clear attribution back to the original author(s) of the repository. For more details, see the LICENSE file in this repository.
