Skip to content

Commit ac090ee

Browse files
DOC: Update README.md (#30)
* DOC: Reorder, rename and refine section headings in README * DOC: Update What is ORCA-python? section * DOC: Update Installation section * DOC: Update Quick Start section * DOC: Update Configuration Files section * DOC: Update Running Experiments section * DOC: Add License section * DOC: Add badges
1 parent f78ce45 commit ac090ee

File tree

1 file changed

+127
-172
lines changed

1 file changed

+127
-172
lines changed

README.md

Lines changed: 127 additions & 172 deletions
Original file line numberDiff line numberDiff line change
@@ -1,224 +1,179 @@
1-
<!-- TOC depthFrom:1 depthTo:6 withLinks:1 updateOnSave:1 orderedList:1 -->
2-
3-
1. [What is ORCA-python?](#what-is-orca-python)
4-
2. [Installing ORCA-python](#installing-orca-python)
5-
1. [Installation Requirements](#installation-requirements)
6-
2. [Download ORCA-Python](#download-orca-python)
7-
3. [Algorithms Compilation](#algorithms-compilation)
8-
4. [Installation in Python Environnement](#installation-in-python-environnement)
9-
5. [Installation Testing](#installation-testing)
10-
3. [How to use ORCA-python](#how-to-use-orca-python)
11-
1. [Configuration Files](#configuration-files)
12-
1. [general-conf](#general-conf)
13-
2. [configurations](#configurations)
14-
2. [Running an Experiment](#running-an-experiment)
15-
16-
<!-- /TOC -->
1+
# ORCA-python
172

18-
## What is ORCA-python?
19-
20-
ORCA-python is an experimental framework, completely built on Python (integrated with scikit-learn and sacred modules),
21-
that seeks to automatize the run of machine learning experiments through simple-to-understand configuration files.
22-
23-
ORCA-python has been initially created to test ordinal classification, but it can handle regular classification algorithms,
24-
as long as they are implemented in scikit-learn, or self-implemented following compatibility guidelines form scikit-learn.
25-
26-
In this README, we will explain how to use ORCA-python, and what you need to install in order to run it. A Jupyter notebook is also avaible in [spanish](https://github.com/ayrna/orca-python/blob/master/doc/spanish_user_manual.md).
27-
28-
29-
# Installing ORCA-python
30-
31-
ORCA-python has been developed and tested in GNU/Linux systems. It has been tested with Python 3.8.
3+
| Overview | |
4+
|-----------|------------------------------------------------------------------------------------------------------------------------------------------|
5+
| **CI/CD** | [![Run Tests](https://github.com/ayrna/orca-python/actions/workflows/pr_pytest.yml/badge.svg?branch=main)](https://github.com/ayrna/orca-python/actions/workflows/pr_pytest.yml) [![!python](https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C%203.10%20%7C%203.11-blue)](https://www.python.org/) |
6+
| **Code** | [![!black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Linter: Ruff](https://img.shields.io/badge/Linter-Ruff-brightgreen?style=flat-square)](https://github.com/charliermarsh/ruff) [![License - BSD 3-Clause](https://img.shields.io/pypi/l/pandas.svg)](https://github.com/ayrna/orca-python/blob/main/LICENSE) |
327

33-
## Installation Requirements
348

35-
Besides the need for the aforementioned Python interpreter, you will need to install the next Python modules
36-
in order to run an experiment (needs recent versions of scikit-learn >=1.0.0):
9+
## What is ORCA-python?
3710

38-
- numpy (tested with version 2.2.2)
39-
- pandas (tested with version 2.2.3)
40-
- sacred (tested with version 0.8.7)
41-
- scikit-learn (tested with version 1.6.1)
42-
- scipy (tested with version 1.15.1)
11+
**ORCA-python** is an experimental framework built on Python that seamlessly integrates with scikit-learn and sacred modules to automate machine learning experiments through simple JSON configuration files. Initially designed for ordinal classification, it supports regular classification algorithms as long as they are compatible with scikit-learn, making it easy to run reproducible experiments across multiple datasets and classification methods.
4312

44-
To install Python, you can use the package management system you like the most.\
45-
For the installation of the modules, you may follow this [Python's Official Guide](https://docs.python.org/2/installing/index.html).
13+
## Table of Contents
4614

47-
All dependencies and build configurations are managed through `pyproject.toml` file. This simplifies the setup process by allowing you to install the framework and its dependencies.
15+
- [Installation](#installation)
16+
- [Requirements](#requirements)
17+
- [Setup](#setup)
18+
- [Testing Installation](#testing-installation)
19+
- [Quick Start](#quick-start)
20+
- [Configuration Files](#configuration-files)
21+
- [general-conf](#general-conf)
22+
- [configurations](#configurations)
23+
- [Running Experiments](#running-experiments)
24+
- [Basic Usage](#basic-usage)
25+
- [Recommended Usage](#recommended-usage)
26+
- [Example Output](#example-output)
27+
- [License](#license)
4828

49-
## Download ORCA-Python
29+
## Installation
5030

51-
To download ORCA-python you can simply clone this GitHub repository by using the following commands:
31+
### Requirements
5232

53-
`$ git clone https://github.com/ayrna/orca-python`
33+
ORCA-python requires Python 3.8 or higher and is tested on Python 3.8, 3.9, 3.10, and 3.11.
5434

55-
All the contents of the repository can also be downloaded from the GitHub site by using the "Download ZIP" button.
35+
All dependencies are managed through `pyproject.toml` and include:
36+
- numpy (>=1.24.4)
37+
- pandas (>=2.0.3)
38+
- sacred (>=0.8.7)
39+
- scikit-learn (>=1.3.2)
40+
- scipy (>=1.10.1)
5641

57-
## Installation in Python Environnement
42+
### Setup
5843

59-
Inside the ORCA-python root, execute the following command to install the framework along with its dependencies: `pip install .`
44+
1. **Clone the repository**:
45+
```bash
46+
git clone https://github.com/ayrna/orca-python
47+
cd orca-python
48+
```
6049

61-
All dependencies and build configurations are managed through the `pyproject.toml` file, simplifying the installation process. FOr development or testing purposes, you can use the `--editable` option to allow modifications without reinstalling: `pip install --editable .`
50+
2. **Install the framework**:
51+
```bash
52+
pip install .
53+
```
6254

63-
Additionally. optional dependencies for development (e.g., black) can be installed using the corresponding groups defined in the `pyproject.toml` file. For example: `pip install -e .[dev]`
55+
For development purposes, use editable installation:
56+
```bash
57+
pip install -e .
58+
```
6459

65-
Note: The editable mode is required for running tests due to automatic dependency resolution.
60+
Optional dependencies for development:
61+
```bash
62+
pip install -e .[dev]
63+
```
6664

67-
## Installation Testing
65+
> **Note:** The editable mode is required for running tests due to automatic dependency resolution.
6866
69-
We provide a pre-made experiment (dataset and configuration file) to test if everything has been correctly installed.\
70-
The way to run this test (and all experiments) is the following:
67+
### Testing Installation
7168

72-
```
73-
# Go to framework main folder
74-
$ python config.py with orca_python/configurations/full_functionality_test.json -l ERROR
75-
```
69+
Test your installation with the provided example:
7670

71+
```bash
72+
python config.py with orca_python/configurations/full_functionality_test.json -l ERROR
73+
```
7774

78-
# How to use ORCA-python
75+
## Quick Start
76+
77+
ORCA-python includes sample datasets with pre-partitioned train/test splits using a 30-holdout experimental design.
78+
79+
**Basic experiment configuration:**
80+
81+
```json
82+
{
83+
"general_conf": {
84+
"basedir": "orca_python/datasets/data",
85+
"datasets": ["balance-scale", "contact-lenses", "tae"],
86+
"hyperparam_cv_nfolds": 3,
87+
"output_folder": "results/",
88+
"metrics": ["ccr", "mae", "amae"],
89+
"cv_metric": "mae"
90+
},
91+
"configurations": {
92+
"SVM": {
93+
"classifier": "sklearn.svm.SVC",
94+
"parameters": {
95+
"C": [0.001, 0.1, 1, 10, 100],
96+
"gamma": [0.1, 1, 10]
97+
}
98+
},
99+
"SVMOP": {
100+
"classifier": "orca_python.classifiers.OrdinalDecomposition",
101+
"parameters": {
102+
"dtype": "ordered_partitions",
103+
"decision_method": "frank_hall",
104+
"base_classifier": "sklearn.svm.SVC",
105+
"parameters": {
106+
"C": [0.01, 0.1, 1, 10],
107+
"gamma": [0.01, 0.1, 1, 10],
108+
"probability": ["True"]
109+
}
110+
}
111+
}
112+
}
113+
}
114+
```
79115

116+
**Run the experiment:**
117+
```bash
118+
python config.py with my_experiment.json -l ERROR
119+
```
80120

81-
This tutorial uses three small datasets (balance-scale, contact-lenses and tae) contained in "datasets" folder.
82-
The datasets are already partitioned with a 30-holdout experimental design (train and test pairs for each partition).
121+
Results are saved in `results/` folder with performance metrics for each dataset-classifier combination. The framework automatically performs cross-validation, hyperparameter tuning, and evaluation on test sets.
83122

84123
## Configuration Files
85124

86-
All experiments are run through configuration files, which are written in JSON format, and consist of two well differentiated
87-
sections:
88-
89-
- **`general-conf`**: indicates basic information to run the experiment, such as the location to datasets, the names of the different datasets to run, etc.
90-
- **`configurations`**: tells the framework what classification algorithms to apply over all the datasets, with the collection of hyper-parameters to tune.
91-
92-
Each one of this sections will be inside a dictionary, having the said section names as keys.
93-
94-
For a better understanding of the way this files works, it's better to follow an example, that can be found in: [configurations/full_functionality_test.json](https://github.com/ayrna/orca-python/blob/master/configurations/full_functionality_test.json).
125+
Experiments are defined using JSON configuration files with two main sections: general_conf for experiment settings and configurations for classifier definitions.
95126

96127
### general-conf
97128

98-
```
99-
"general_conf": {
100-
101-
"basedir": "ordinal-datasets/ordinal-regression/",
102-
"datasets": ["tae", "balance-scale", "contact-lenses"],
103-
"hyperparam_cv_folds": 3,
104-
"jobs": 10,
105-
"input_preprocessing": "std",
106-
"output_folder": "my_runs/",
107-
"metrics": ["ccr", "mae", "amae", "mze"],
108-
"cv_metric": "mae"
109-
}
110-
```
111-
*note that all the keys (variable names) must be strings, while all pair: value elements are separated by commas.*
129+
Controls global experiment parameters.
112130

131+
**Required parameters:**
113132
- **`basedir`**: folder containing all dataset subfolders, it doesn't allow more than one folder at a time. It can be indicated using a full path, or a relative one to the framework folder.
114133
- **`datasets`**: name of datasets that will be experimented with. A subfolder with the same name must exist inside `basedir`.
134+
135+
**Optional parameters:**
115136
- **`hyperparam_cv_folds`**: number of folds used while cross-validating.
116137
- **`jobs`**: number of jobs used for GridSearchCV during cross-validation.
117-
- **`input_preprocessing`**: type of preprocessing to apply to the data, **`std`** for standardization and **`norm`** for normalization. Assigning an empty srtring will omit the preprocessing process.
138+
- **`input_preprocessing`**: data preprocessing (`"std"` for standardization, `"norm"` for normalization, `""` for none)
118139
- **`output_folder`**: name of the folder where all experiment results will be stored.
119140
- **`metrics`**: name of the accuracy metrics to measure the train and test performance of the classifier.
120141
- **`cv_metric`**: error measure used for GridSearchCV to find the best set of hyper-parameters.
121142

122-
Most of this variables do have default values (specified in [config.py](https://github.com/ayrna/orca-python/blob/master/config.py)), but "basedir" and "datasets" must always be written for the experiment to be run. Take into account, that all variable names in "general-conf" cannot be modified, otherwise the experiment will fail.
123-
124-
125143
### configurations
126144

127-
this dictionary will contain, at the same time, one dictionary for each configuration to try over the datasets during the experiment. This is, a classifier with some specific hyper-parameters to tune. (Keep in mind, that if two or more configurations share the same name, the later ones will be omitted)
145+
Defines classifiers and their hyperparameters for GridSearchCV. Each configuration has a name and consists of:
128146

129-
```
130-
"configurations": {
131-
"SVM": {
132-
133-
"classifier": "sklearn.svm.SVC",
134-
"parameters": {
135-
"C": [0.001, 0.1, 1, 10, 100],
136-
"gamma": [0.1, 1, 10]
137-
}
138-
},
139-
"SVMOP": {
140-
141-
"classifier": "orca_python.classifiers.OrdinalDecomposition",
142-
"parameters": {
143-
"dtype": "ordered_partitions",
144-
"decision_method": "frank_hall",
145-
"base_classifier": "sklearn.svm.SVC",
146-
"parameters": {
147-
"C": [0.01, 0.1, 1, 10],
148-
"gamma": [0.01, 0.1, 1, 10],
149-
"probability": ["True"]
150-
}
151-
152-
}
153-
},
154-
"LR": {
155-
156-
"classifier": "orca_python.classifiers.OrdinalDecomposition",
157-
"parameters": {
158-
"dtype": ["ordered_partitions", "one_vs_next"],
159-
"decision_method": "exponential_loss",
160-
"base_classifier": "sklearn.linear_model.LogisticRegression",
161-
"parameters": {
162-
"solver": ["liblinear"],
163-
"C": [0.01, 0.1, 1, 10],
164-
"penalty": ["l1","l2"]
165-
}
166-
167-
}
168-
},
169-
"REDSVM": {
170-
171-
"classifier": "orca_python.classifiers.REDSVM",
172-
"parameters": {
173-
"t": 2,
174-
"c": [0.1, 1, 10],
175-
"g": [0.1, 1, 10],
176-
"r": 0,
177-
"m": 100,
178-
"e": 0.001,
179-
"h": 1
180-
}
181-
182-
},
183-
"SVOREX": {
184-
185-
"classifier": "orca_python.classifiers.SVOREX",
186-
"parameters": {
187-
"kernel_type": 0,
188-
"c": [0.1, 1, 10],
189-
"k": [0.1, 1, 10],
190-
"t": 0.001
191-
}
192-
193-
}
194-
}
195-
```
147+
- **`classifier`**: scikit-learn path or built-in ORCA-python classifier
148+
- **`parameters`**: hyperparameters for grid search (nested for ensemble methods)
196149

197-
Each configuration has a name (whatever you want), and consists of:
150+
## Running Experiments
198151

199-
- **`classifier`**: tells the framework which classifier to use. Can be specified in two different ways:
200-
- A relative path to the classifier in sklearn module.
201-
- The name of a built-in class in Classifiers folder (found in the main folder of the project).
202-
- **`parameters`**: hyper-parameters to tune, having each one of them a list of values to cross-validate (not really necessary, can be just one value).
203-
204-
*In ensemble methods, as `OrdinalDecomposition`, you must nest another classifier (the base classifier, which doesn't have a configuration name), with it's respective parameters to tune.*
152+
### Basic Usage
205153

154+
```bash
155+
python config.py with experiment_file.json
156+
```
206157

158+
### Recommended Usage
207159

208-
## Running an Experiment
160+
For reproducible results with minimal output:
209161

210-
As viewed in [Installation Testing](#installation-testing), running an experiment is as simple as executing Config.py
211-
with the python interpreter, and tell what configuration file to use for this experiment, resulting in the next command:
162+
```bash
163+
python config.py with experiment_file.json seed=12345 -l ERROR
164+
```
212165

213-
`$ python config.py with experiment_file.json`
166+
**Parameters:**
167+
- `seed`: fixed random seed for reproducibility
168+
- `-l ERROR`: reduces Sacred framework verbosity
214169

215-
Running an experiment this way has two problems though, one of them being an excessive verbosity from Sacred,
216-
while the other consists of the non-reproducibility of the results of the experiment, due to the lack of a fixed seed.
170+
### Example Output
217171

218-
Both problems can be easily fixed. The seed can be specified after "with" in the command:
172+
Results are stored in the specified output folder with detailed performance metrics and hyperparameter information for each dataset and configuration combination.
219173

220-
`$ python config.py with experiment_file.json seed=12345`
174+
## License
175+
[BSD 3](LICENSE)
221176

222-
while we can silence Sacred just by adding "-l ERROR" at the end of the line (not necessarily at the end).
177+
<hr>
223178

224-
`$ python config.py with experiment_file.json seed=12345 -l ERROR`
179+
[Go to Top](#table-of-contents)

0 commit comments

Comments
 (0)