Instance Space Analysis is a methodology for the assessment of the strengths and weaknesses of an algorithm, and an approach to objectively compare algorithmic power without bias introduced by restricted choice of test instances. At its core is the modelling of the relationship between structural properties of an instance and the performance of a group of algorithms. Instance Space Analysis allows the construction of footprints for each algorithm, defined as regions in the instance space where we statistically infer good performance. Other insights that can be gathered from Instance Space Analysis include:
- Objective metrics of each algorithm’s footprint across the instance space as a measure of algorithmic power;
- Explanation through visualisation of how instance features correlate with algorithm performance in various regions of the instance space;
- Visualisation of the distribution and diversity of existing benchmark and real-world instances;
- Assessment of the adequacy of the features used to characterise an instance;
- Partitioning of the instance space into recommended regions for automated algorithm selection;
- Distinguishing areas of the instance space where it may be useful to generate additional instances to gain further insights.
The unique advantage of visualizing algorithm performance in the instance space, rather than as a small set of summary statistics averaged across a selected collection of instances, is the nuanced analysis that becomes possible to explain strengths and weaknesses and examine interesting variations in performance that may be hidden by tables of summary statistics.
This repository provides a set of Python tools to carry out a complete Instance Space Analysis in an automated pipeline. We expect it to become the computational engine that powers the Melbourne Algorithm Test Instance Library with Data Analytics (MATILDA) web tools for online analysis. For further information on the Instance Space Analysis methodology can be found here.
If you follow the Instance Space Analysis methodology, please cite as follows:
K. Smith-Miles and M.A. Muñoz. Instance Space Analysis for Algorithm Testing: Methodology and Software Tools. ACM Comput. Surv. 55(12:255),1-31 DOI:10.1145/3572895, 2023.
Also, if you specifically use this code, please cite as follows:
TBD
DISCLAIMER: This repository contains research code. In occassions new features will be added or changes are made that may result in crashes. Although we have have made every effort to reduce bugs, this code has NO GUARANTIES. If you find issues, let us know ASAP through the contact methods described at the end of this document.
** To be expanded **
run pip install ./matilda-0.1.0.tar.gz
An example of running can be found in integration_demo.py
An example of a plugin can be found in example_plugin.py
run pdoc matilda
See the pdoc documentation for instructions on exporting static html files for hosting in github pages.
REQUIREMENTS:
- Python 3.12 installed
- Be inside the repository directory
Linux, Mac, WSL
curl -sSL https://install.python-poetry.org | python3 -
Windows
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -
poetry shell
poetry install
We will update this explanation in the next few months. Here is a copy of the description of the several inputs passed as a json file.
The metadata.csv file should contain a table where each row corresponds to a problem instance, and each column must strictly follow the naming convention mentioned below:
- instances instance identifier - We expect instance identifier to be of type "String". This column is mandatory.
- source instance source - This column is optional
- feature_name The keyword "feature_" concatenated with feature name. For instance, if feature name is "density", header name should be mentioned as "feature_density". If name consists of more than one word, each word should be separated by "_" (spaces are not allowed). There must be more than two features for the software to work. We expect the features to be of the type "Double".
- algo_name The keyword "algo_" concatenated with algorithm name. For instance, if algorithm name is "Greedy", column header should be "algo_greedy". If name consists of more than one word, each word should be separated by "_" (spaces are not allowed). You can add the performance of more than one algorithm in the same
.csv. We expect the algorithm performance to be of the type "Double".
Moreover, empty cells, NaN or null values are allowed but not recommended. We expect you to handle missing values in your data before processing. You may use this file as reference.
The script example.m constructs a structure that contains all the settings used by the code. Broadly, there are settings required for the analysis itself, settings for the pre-processing of the data, and output settings. For the first these are divided into general, dimensionality reduction, bound estimation, algorithm selection and footprint construction settings. For the second, the toolkit has routines for bounding outliers, scale the data and select features.
opts.perf.MaxPerfdetermines whether the algorithm performance values provided are efficiency measures that should be maximised (set asTRUE), or cost measures that should be minimised (set asFALSE).opts.perf.AbsPerfdetermines whether good performance is defined absolutely, e.g., misclassification error is lower than a 20%, (set asTRUE), or if it is defined relatively to the best performing algorithm, e.g., misclassification error is within at least 5% of the best algorithm, (set asFALSE).opts.perf.epsiloncorresponds to the threshold used to calculate good performance. It must be of the type "Double".opts.general.betaThresholdcorresponds to the fraction of algorithms in the portfolio that must have good performance in the instance, for it to be considered an easy instance. It must be a value between 0 and 1.opts.parallel.flagdetermines whether parallel processing will be available (set asTRUE), or not (set asFALSE). The toolkit makes use of MATLAB'sparpoolfunctionality to create a multisession environment in the local machine.opts.parallel.ncoresnumber of available cores for parallel procesing.opts.selvars.smallscaleflagby setting this flag asTRUE, you can carry out a small scale experiment using a randomly selected fraction of the original data. This is useful if you have a large dataset with more than 1000 instances, and you want to explore the parameters of the model.opts.selvars.smallscalefraction taken from the original data on the small scale experiment.opts.selvars.fileidxflagby setting this flag asTRUE, you can carry out a small scale experiment. This time you must provide a.csvfile that contains in one column the indices of the instances to be taken. This may be useful if you want to make a more controlled experiment than just randomly selecting instances.opts.selvars.fileidxname of the file containing the indexes of the instances.
The toolkit uses PILOT as a dimensionality reduction method, with BFGS as numerical solver. Technical details about it can be found here.
opts.pilot.analyticdetermines whether the analytic (set asTRUE) or the numerical (set asFALSE) solution to the dimensionality reduction problem should be used. We recommend to leave this setting asFALSE, due to the instability of the analytical solution due to possible poor-conditioning.opts.pilot.ntriesnumber of iterations that the numerical solution is attempted.
The toolkit uses CLOISTER, an algorithm based on correlation to detect the empirical bounds of the Instance Space.
opts.cloister.cthresDetermines the maximum Pearson correlation coefficient that would indicate non-correlated variables. The lower this value is, the more stringent is the algorithm; hence, it would be less likely to produce a good bound.opts.cloister.pvalDetermines the p-value of the Pearson correlation coefficient that indicates no correlation.
The toolkit uses SVMs with radial basis kernels as algorithm selection models, through MATLAB's Statistics and Machine Learning Toolbox or LIBSVM.
opts.pythia.uselibsvmdetermines whether to use LIBSVM (set asTRUE) or MATLAB's implementation of an SVM, depending on which a different method is used to fine tune the parameters. For the former, tuning is achieved using 30 iterations of the random search algorithm, usinga Latin Hyper-cube design bounded betweenas sample points, withk-fold stratified cross-validation (CV), and using model error as the loss function. On the other hand, for the latter, tuning is achieved using 30 iterations of the Bayesian Optimization algorithm bounded between
, with k-fold stratified CV.
opts.pythia.cvfoldsnumber of folds of the CV experiment.opts.pythia.ispolykrnldetermines whether to use a polynomial (set asTRUE) or Gaussian (set asFALSE) kernel. Usually, the latter one is significantly faster to calculate and more accurate; however, it also has the disadvantage of producing discontinuous areas of good performance which may look overfitted. We tend to recommend a polynomial kernel if the dataset is higher than 1000 instances.opts.pythia.useweightsdetermines whether weighted (set asTRUE) or unweighted (set asFALSE) classification is performed. The weights are calculated as.
The toolkit uses TRACE, an algorithm based on MATLAB's polyshapes to define the regions in the space where we statistically infer good algorithm performance. The polyshapes are then pruned to remove those sections for which the evidence, as defined by a minimum purity value, is poor or non-existing.
opts.trace.usesimmakes use of the actual (set asFALSE) or simulated data from the SVM results (set asTRUE) to produce the footprints.opts.trace.PIminimum purity required for a section of a footprint.
The toolkit implements simple routines to bound outliers and scale the data. These routines are by no means perfect, and users should pre-process their data independently if preferred. However, the automatic bounding and scaling routines should give some idea of the kind of results may be achieved. In general, we recommend that the data is transformed to become close to normally distributed due to the linear nature of PILOT's optimal projection algorithm.
opts.auto.preprocturns on (set asTRUE) the automatic pre-processing.opts.bound.flagturns on (set asTRUE) data bounding. This sub-routine calculates the median and the interquartile range (IQR) of each feature and performance measure, and bounds the data to the median plus or minus five times the IQR.opts.norm.flagturns on (set asTRUE) scalling. This sub-routine scales into a positive range each feature and performance measure. Then it calculates a box-cox transformation to stabilise the variance, and a Z-transformation to standarise the data. The result are features and performance measures that are close to normally distributed.
The toolkit implements SIFTED, a routine to select features, given their cross-correlation and correlation to performance. Ideally, we want the smallest number of orthogonal and predictive features. This routine are by no means perfect, and users should pre-process their data independently if preferred. In general, we recommend using no more than 10 features as input to PILOT's optimal projection algorithm, due to the numerical nature of its solution and issues in identifying meaningful linear trends.
opts.sifted.flagturns on (set asTRUE) the automatic feature selection. SIFTED is composed of two sub-processes. On the first one, SIFTED calculates the Pearson correlation coefficient between the features and the performance. Then it takes its absolute value, and sorts them from largest to lowest. Then, it takes all features that have a correlation above the threshold. It automatically bounds itself to a minimum of 3 features. Then, SIFTED uses the Pearson correlation coefficient as a dissimilarity metric between features. Then, k-means clustering is used to identify groups of similar features. To select one feature per group, the algorithm first projects the subset of selected featurs into two dimensions using Principal Components Analysis (PCA) and then Random Forests to predict whether an instance is easy or not for a given algorithm. Then, the subset of features that gives the most accurate models is selected. This section of the routine is potentially very expensive computationally due to the multiple layer training process. However, it is our current recommended approach to select the most relevant features. This routine tests all possible combinations if they are less than 1000, or uses the combination of a Genetic Algorithm and a Look-up table otherwise.opts.sifted.rhocorrelation threshold indicating the lowest acceptable absolute correlation between a feature and performance. It should be a value between 0 and 1.opts.sifted.Knumber of clusters which corresponds to the final number of features returned. The routine assumes at least 3 clusters and no more than the number of features. Ideally it should not be a value larger than 10.opts.sifted.NTREESnumber of threes used by the Random Forest models. Usually, this setting does not need tuning.opts.sifted.MaxIternumber of iterations used to converge the k-means algorithm. Usually, this setting does not need tuning.opts.sifted.Replicatesnumber of repeats carried out of the k-means algorithm. Usually, this setting does not need tuning.
These settings result in more information being stored in files or presented in the console output.
opts.outputs.csvThis flag produces the output CSV files for post-processing and analysis. It is recommended to leave this setting asTRUE.opts.outputs.pngThis flag produces the output figures files for post-processing and analysis. It is recommended to leave this setting asTRUE.opts.outputs.webThis flag produces the output files employed to draw the figures in MATILDA's web tools (click here to open an account). It is recommended to leave this setting asFALSE.
If you have any suggestions or ideas (e.g. for new features), or if you encounter any problems while running the code, please use the issue tracker or contact us through the MATILDA's Queries and Feedback page.
Partial funding for the development of this code was provided by the Australian Research Council through the Industrial Transformation Training Centre grant IC200100009.
This code was developed as part of the subject SWEN90017-18, by students Junheng Chen, Yusuf Berdan Guzel, Kushagra Khare, Dong Hyeog Jang, Kian Dsouza, Nathan Harvey, Tao Yu, Xin Xiang, Jiaying Yi, and Cheng Ze Lam. The team was mentored by Ben Golding, and the subject was coordinated by Mansooreh Zahedi.