Project for the IoT Data Analytics course at the University of Salerno.
This project provides a framework to analyze and compare the performance of different clustering algorithms on simulated data streams. It evaluates how traditional batch models and specialized streaming algorithms perform in both stable (stationary) and changing (concept drift) environments.
You can run the experiment in two different modes to test the models under different conditions.
This mode simulates a stable environment where the underlying data distribution does not change over time.
- Dataset: Uses the Digits dataset, which is included in the
scikit-learnlibrary and requires no download. - Goal: Evaluate the accuracy and efficiency of the algorithms in a predictable and well-defined environment.
This mode simulates a dynamic environment where the data properties change over time, mimicking real-world situations like sensor degradation or shifting user behavior.
- Dataset: Uses the Gas Sensor Array Drift dataset, which must be provided as a
.ziparchive. - Goal: Test the adaptability of the algorithms and their ability to handle changes.
The project compares four distinct clustering strategies:
- K-Means: A classic K-Means model trained only once on the initial data. It's fast but not adaptive.
- K-Means (with Retraining): A K-Means model that is periodically retrained on new data.
- CluStream: A streaming algorithm designed for high-speed data, using micro-clusters to summarize the stream.
- DenStream: A density-based streaming algorithm that can find clusters of arbitrary shapes and is robust to noise.
Follow these steps to set up your environment and run the analysis.
It is highly recommended to use a Python virtual environment to keep dependencies isolated.
With the environment activated, run this single command to install all required libraries:
pip install pandas numpy scikit-learn matplotlib river requests-
Prepare the Datasets:
- Stationary Scenario: No preparation required. The script will automatically download the dataset.
- Concept Drift Scenario: Place the
.ziparchive containing the Gas Sensor Array Drift dataset in the project folder. You must extract the archive to create thebatch*.datfiles needed by the script.
-
Choose your scenario: Open the
script.pyfile and edit theRUN_STATIONARY_SCENARIOflag in theExperimentSettingsclass. -
Run the script
After the script finishes, a summary chart will be displayed. This chart allows you to compare the models across three key performance indicators:
- Adjusted Rand Index (ARI): Measures the accuracy of the clustering, a score of 1.0 is a perfect match while 0.0 is no better than random.
- Silhouette Score: To measure the quality of the clusters without looking at the true labels. It assesses how dense and well-separated the clusters are, using scores that range from -1 to 1, with higher values being better.
- Throughput (points/s): To measure the efficiency and speed of the algorithm.
Here are some solutions for common issues you might encounter.
-
If you're having trouble with computational times, you can reduce the "DATASET_FRACTION" variable.
-
The online algorithms are highly sensitive to their parameters, if you switch to a new dataset, you must tune their hyperparameters in ExperimentSettings to match the new data's scale and density.