Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
299 changes: 127 additions & 172 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,224 +1,179 @@
<!-- TOC depthFrom:1 depthTo:6 withLinks:1 updateOnSave:1 orderedList:1 -->

1. [What is ORCA-python?](#what-is-orca-python)
2. [Installing ORCA-python](#installing-orca-python)
1. [Installation Requirements](#installation-requirements)
2. [Download ORCA-Python](#download-orca-python)
3. [Algorithms Compilation](#algorithms-compilation)
4. [Installation in Python Environnement](#installation-in-python-environnement)
5. [Installation Testing](#installation-testing)
3. [How to use ORCA-python](#how-to-use-orca-python)
1. [Configuration Files](#configuration-files)
1. [general-conf](#general-conf)
2. [configurations](#configurations)
2. [Running an Experiment](#running-an-experiment)

<!-- /TOC -->
# ORCA-python

## What is ORCA-python?

ORCA-python is an experimental framework, completely built on Python (integrated with scikit-learn and sacred modules),
that seeks to automatize the run of machine learning experiments through simple-to-understand configuration files.

ORCA-python has been initially created to test ordinal classification, but it can handle regular classification algorithms,
as long as they are implemented in scikit-learn, or self-implemented following compatibility guidelines form scikit-learn.

In this README, we will explain how to use ORCA-python, and what you need to install in order to run it. A Jupyter notebook is also avaible in [spanish](https://github.com/ayrna/orca-python/blob/master/doc/spanish_user_manual.md).


# Installing ORCA-python

ORCA-python has been developed and tested in GNU/Linux systems. It has been tested with Python 3.8.
| Overview | |
|-----------|------------------------------------------------------------------------------------------------------------------------------------------|
| **CI/CD** | [![Run Tests](https://github.com/ayrna/orca-python/actions/workflows/pr_pytest.yml/badge.svg?branch=main)](https://github.com/ayrna/orca-python/actions/workflows/pr_pytest.yml) [![!python](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10%20%7C%203.11-blue)](https://www.python.org/) |
| **Code** | [![!black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Linter: Ruff](https://img.shields.io/badge/Linter-Ruff-brightgreen?style=flat-square)](https://github.com/charliermarsh/ruff) [![License - BSD 3-Clause](https://img.shields.io/pypi/l/pandas.svg)](https://github.com/ayrna/orca-python/blob/main/LICENSE) |

## Installation Requirements

Besides the need for the aforementioned Python interpreter, you will need to install the next Python modules
in order to run an experiment (needs recent versions of scikit-learn >=1.0.0):
## What is ORCA-python?

- numpy (tested with version 2.2.2)
- pandas (tested with version 2.2.3)
- sacred (tested with version 0.8.7)
- scikit-learn (tested with version 1.6.1)
- scipy (tested with version 1.15.1)
**ORCA-python** is an experimental framework built on Python that seamlessly integrates with scikit-learn and sacred modules to automate machine learning experiments through simple JSON configuration files. Initially designed for ordinal classification, it supports regular classification algorithms as long as they are compatible with scikit-learn, making it easy to run reproducible experiments across multiple datasets and classification methods.

To install Python, you can use the package management system you like the most.\
For the installation of the modules, you may follow this [Python's Official Guide](https://docs.python.org/2/installing/index.html).
## Table of Contents

All dependencies and build configurations are managed through `pyproject.toml` file. This simplifies the setup process by allowing you to install the framework and its dependencies.
- [Installation](#installation)
- [Requirements](#requirements)
- [Setup](#setup)
- [Testing Installation](#testing-installation)
- [Quick Start](#quick-start)
- [Configuration Files](#configuration-files)
- [general-conf](#general-conf)
- [configurations](#configurations)
- [Running Experiments](#running-experiments)
- [Basic Usage](#basic-usage)
- [Recommended Usage](#recommended-usage)
- [Example Output](#example-output)
- [License](#license)

## Download ORCA-Python
## Installation

To download ORCA-python you can simply clone this GitHub repository by using the following commands:
### Requirements

`$ git clone https://github.com/ayrna/orca-python`
ORCA-python requires Python 3.8 or higher and is tested on Python 3.8, 3.9, 3.10, and 3.11.

All the contents of the repository can also be downloaded from the GitHub site by using the "Download ZIP" button.
All dependencies are managed through `pyproject.toml` and include:
- numpy (>=1.24.4)
- pandas (>=2.0.3)
- sacred (>=0.8.7)
- scikit-learn (>=1.3.2)
- scipy (>=1.10.1)

## Installation in Python Environnement
### Setup

Inside the ORCA-python root, execute the following command to install the framework along with its dependencies: `pip install .`
1. **Clone the repository**:
```bash
git clone https://github.com/ayrna/orca-python
cd orca-python
```

All dependencies and build configurations are managed through the `pyproject.toml` file, simplifying the installation process. FOr development or testing purposes, you can use the `--editable` option to allow modifications without reinstalling: `pip install --editable .`
2. **Install the framework**:
```bash
pip install .
```

Additionally. optional dependencies for development (e.g., black) can be installed using the corresponding groups defined in the `pyproject.toml` file. For example: `pip install -e .[dev]`
For development purposes, use editable installation:
```bash
pip install -e .
```

Note: The editable mode is required for running tests due to automatic dependency resolution.
Optional dependencies for development:
```bash
pip install -e .[dev]
```

## Installation Testing
> **Note:** The editable mode is required for running tests due to automatic dependency resolution.

We provide a pre-made experiment (dataset and configuration file) to test if everything has been correctly installed.\
The way to run this test (and all experiments) is the following:
### Testing Installation

```
# Go to framework main folder
$ python config.py with orca_python/configurations/full_functionality_test.json -l ERROR
```
Test your installation with the provided example:

```bash
python config.py with orca_python/configurations/full_functionality_test.json -l ERROR
```

# How to use ORCA-python
## Quick Start

ORCA-python includes sample datasets with pre-partitioned train/test splits using a 30-holdout experimental design.

**Basic experiment configuration:**

```json
{
"general_conf": {
"basedir": "orca_python/datasets/data",
"datasets": ["balance-scale", "contact-lenses", "tae"],
"hyperparam_cv_nfolds": 3,
"output_folder": "results/",
"metrics": ["ccr", "mae", "amae"],
"cv_metric": "mae"
},
"configurations": {
"SVM": {
"classifier": "sklearn.svm.SVC",
"parameters": {
"C": [0.001, 0.1, 1, 10, 100],
"gamma": [0.1, 1, 10]
}
},
"SVMOP": {
"classifier": "orca_python.classifiers.OrdinalDecomposition",
"parameters": {
"dtype": "ordered_partitions",
"decision_method": "frank_hall",
"base_classifier": "sklearn.svm.SVC",
"parameters": {
"C": [0.01, 0.1, 1, 10],
"gamma": [0.01, 0.1, 1, 10],
"probability": ["True"]
}
}
}
}
}
```

**Run the experiment:**
```bash
python config.py with my_experiment.json -l ERROR
```

This tutorial uses three small datasets (balance-scale, contact-lenses and tae) contained in "datasets" folder.
The datasets are already partitioned with a 30-holdout experimental design (train and test pairs for each partition).
Results are saved in `results/` folder with performance metrics for each dataset-classifier combination. The framework automatically performs cross-validation, hyperparameter tuning, and evaluation on test sets.

## Configuration Files

All experiments are run through configuration files, which are written in JSON format, and consist of two well differentiated
sections:

- **`general-conf`**: indicates basic information to run the experiment, such as the location to datasets, the names of the different datasets to run, etc.
- **`configurations`**: tells the framework what classification algorithms to apply over all the datasets, with the collection of hyper-parameters to tune.

Each one of this sections will be inside a dictionary, having the said section names as keys.

For a better understanding of the way this files works, it's better to follow an example, that can be found in: [configurations/full_functionality_test.json](https://github.com/ayrna/orca-python/blob/master/configurations/full_functionality_test.json).
Experiments are defined using JSON configuration files with two main sections: general_conf for experiment settings and configurations for classifier definitions.

### general-conf

```
"general_conf": {

"basedir": "ordinal-datasets/ordinal-regression/",
"datasets": ["tae", "balance-scale", "contact-lenses"],
"hyperparam_cv_folds": 3,
"jobs": 10,
"input_preprocessing": "std",
"output_folder": "my_runs/",
"metrics": ["ccr", "mae", "amae", "mze"],
"cv_metric": "mae"
}
```
*note that all the keys (variable names) must be strings, while all pair: value elements are separated by commas.*
Controls global experiment parameters.

**Required parameters:**
- **`basedir`**: folder containing all dataset subfolders, it doesn't allow more than one folder at a time. It can be indicated using a full path, or a relative one to the framework folder.
- **`datasets`**: name of datasets that will be experimented with. A subfolder with the same name must exist inside `basedir`.

**Optional parameters:**
- **`hyperparam_cv_folds`**: number of folds used while cross-validating.
- **`jobs`**: number of jobs used for GridSearchCV during cross-validation.
- **`input_preprocessing`**: type of preprocessing to apply to the data, **`std`** for standardization and **`norm`** for normalization. Assigning an empty srtring will omit the preprocessing process.
- **`input_preprocessing`**: data preprocessing (`"std"` for standardization, `"norm"` for normalization, `""` for none)
- **`output_folder`**: name of the folder where all experiment results will be stored.
- **`metrics`**: name of the accuracy metrics to measure the train and test performance of the classifier.
- **`cv_metric`**: error measure used for GridSearchCV to find the best set of hyper-parameters.

Most of this variables do have default values (specified in [config.py](https://github.com/ayrna/orca-python/blob/master/config.py)), but "basedir" and "datasets" must always be written for the experiment to be run. Take into account, that all variable names in "general-conf" cannot be modified, otherwise the experiment will fail.


### configurations

this dictionary will contain, at the same time, one dictionary for each configuration to try over the datasets during the experiment. This is, a classifier with some specific hyper-parameters to tune. (Keep in mind, that if two or more configurations share the same name, the later ones will be omitted)
Defines classifiers and their hyperparameters for GridSearchCV. Each configuration has a name and consists of:

```
"configurations": {
"SVM": {

"classifier": "sklearn.svm.SVC",
"parameters": {
"C": [0.001, 0.1, 1, 10, 100],
"gamma": [0.1, 1, 10]
}
},
"SVMOP": {

"classifier": "orca_python.classifiers.OrdinalDecomposition",
"parameters": {
"dtype": "ordered_partitions",
"decision_method": "frank_hall",
"base_classifier": "sklearn.svm.SVC",
"parameters": {
"C": [0.01, 0.1, 1, 10],
"gamma": [0.01, 0.1, 1, 10],
"probability": ["True"]
}

}
},
"LR": {

"classifier": "orca_python.classifiers.OrdinalDecomposition",
"parameters": {
"dtype": ["ordered_partitions", "one_vs_next"],
"decision_method": "exponential_loss",
"base_classifier": "sklearn.linear_model.LogisticRegression",
"parameters": {
"solver": ["liblinear"],
"C": [0.01, 0.1, 1, 10],
"penalty": ["l1","l2"]
}

}
},
"REDSVM": {

"classifier": "orca_python.classifiers.REDSVM",
"parameters": {
"t": 2,
"c": [0.1, 1, 10],
"g": [0.1, 1, 10],
"r": 0,
"m": 100,
"e": 0.001,
"h": 1
}

},
"SVOREX": {

"classifier": "orca_python.classifiers.SVOREX",
"parameters": {
"kernel_type": 0,
"c": [0.1, 1, 10],
"k": [0.1, 1, 10],
"t": 0.001
}

}
}
```
- **`classifier`**: scikit-learn path or built-in ORCA-python classifier
- **`parameters`**: hyperparameters for grid search (nested for ensemble methods)

Each configuration has a name (whatever you want), and consists of:
## Running Experiments

- **`classifier`**: tells the framework which classifier to use. Can be specified in two different ways:
- A relative path to the classifier in sklearn module.
- The name of a built-in class in Classifiers folder (found in the main folder of the project).
- **`parameters`**: hyper-parameters to tune, having each one of them a list of values to cross-validate (not really necessary, can be just one value).

*In ensemble methods, as `OrdinalDecomposition`, you must nest another classifier (the base classifier, which doesn't have a configuration name), with it's respective parameters to tune.*
### Basic Usage

```bash
python config.py with experiment_file.json
```

### Recommended Usage

## Running an Experiment
For reproducible results with minimal output:

As viewed in [Installation Testing](#installation-testing), running an experiment is as simple as executing Config.py
with the python interpreter, and tell what configuration file to use for this experiment, resulting in the next command:
```bash
python config.py with experiment_file.json seed=12345 -l ERROR
```

`$ python config.py with experiment_file.json`
**Parameters:**
- `seed`: fixed random seed for reproducibility
- `-l ERROR`: reduces Sacred framework verbosity

Running an experiment this way has two problems though, one of them being an excessive verbosity from Sacred,
while the other consists of the non-reproducibility of the results of the experiment, due to the lack of a fixed seed.
### Example Output

Both problems can be easily fixed. The seed can be specified after "with" in the command:
Results are stored in the specified output folder with detailed performance metrics and hyperparameter information for each dataset and configuration combination.

`$ python config.py with experiment_file.json seed=12345`
## License
[BSD 3](LICENSE)

while we can silence Sacred just by adding "-l ERROR" at the end of the line (not necessarily at the end).
<hr>

`$ python config.py with experiment_file.json seed=12345 -l ERROR`
[Go to Top](#table-of-contents)