Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/concepts/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@ This section explains the core concepts and methodologies used in Octopus to hel
## What You'll Learn

- **Nested Cross-Validation** - Understand the nested CV approach that makes Octopus suitable for small datasets
- **Core Concepts** - Key terms, architecture, and how Octopus works internally
- **Workflow & Modules** - How to chain feature selection and ML modules into multi-step pipelines
- **Understanding Results** - How to interpret and use the predictions and metrics from Octopus

## Quick Navigation

- [Nested Cross-Validation](nested_cv.md) - Learn about the unique CV strategy that prevents overfitting
- [Workflow & Modules](workflow/index.md) - Build pipelines that progressively reduce features and train models
- [Understanding Results](understanding_results.md) - How to read and use Octopus outputs

If you're new to Octopus, we recommend starting with "Nested Cross-Validation" to understand why this tool is different.
12 changes: 12 additions & 0 deletions docs/concepts/workflow/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- [Workflow & Modules](index.md)
- Feature Selection
- [Boruta](boruta.md)
- [EFS](efs.md)
- [MRMR](mrmr.md)
- [RFE](rfe.md)
- [RFE2](rfe2.md)
- [ROC](roc.md)
- [SFS](sfs.md)
- Machine Learning
- [AutoGluon](autogluon.md)
- [Octo](octo.md)
98 changes: 98 additions & 0 deletions docs/concepts/workflow/autogluon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# AutoGluon

*Based on: [AutoGluon](https://github.com/autogluon/autogluon)*

AutoGluon wraps the [AutoGluon TabularPredictor](https://auto.gluon.ai/) to
provide fully automated model selection, hyperparameter tuning, and
stacking/ensembling within an Octopus workflow. Unlike Octo, which exposes
fine-grained control over optimization, AutoGluon aims for a hands-off
experience: you configure a quality preset and a time budget, and AutoGluon
handles the rest.

## How it works

1. **Initialize the TabularPredictor.** A `TabularPredictor` is created with the
target column, evaluation metric (mapped from Octopus metric names to
AutoGluon scorers), and verbosity level.

2. **Fit on training data.** AutoGluon's `fit()` method is called with the
combined feature + target DataFrame. Internally, AutoGluon:
- Performs automatic feature engineering (type inference, missing value
handling, encoding).
- Trains a portfolio of model types (controlled by `included_model_types` or
the full default set).
- Tunes hyperparameters using the strategy defined by the `presets`.
- Builds multi-layer stacking ensembles when using higher-quality presets
(`"good_quality"` and above).
- Uses `num_bag_folds` for bagging/cross-validation within each model.

3. **Evaluate performance.** After training, the module evaluates on train, dev
(out-of-fold), and test partitions. Scores are computed using both
AutoGluon's built-in metrics and Octopus's metric implementations for
cross-comparison.

4. **Feature importance.** Permutation feature importance is computed on the test
set using AutoGluon's `feature_importance()` method with confidence bands
(15 shuffle sets, 95% confidence). If feature groups are defined, group-level
importances are also calculated.

5. **Sklearn-compatible model.** The fitted AutoGluon predictor is wrapped in a
sklearn-compatible class (`SklearnClassifier` or `SklearnRegressor`) so that
downstream Octopus code (e.g., feature importance methods) can use it
seamlessly.

6. **No feature selection.** AutoGluon does not perform feature selection -- it
returns all input features. To select features, place AutoGluon after a
feature-selection module in the workflow.

## Supported model types

When `included_model_types` is not set, AutoGluon considers all available
model families:

| Code | Model |
|------|-------|
| `GBM` | LightGBM |
| `CAT` | CatBoost |
| `XGB` | XGBoost |
| `RF` | Random Forest |
| `XT` | Extra Trees |
| `KNN` | K-Nearest Neighbors |
| `LR` | Linear/Logistic Regression |
| `NN_TORCH` | PyTorch Neural Network |
| `FASTAI` | FastAI Neural Network |

## Key parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `presets` | `["medium_quality"]` | Quality presets: `"best_quality"`, `"high_quality"`, `"good_quality"`, `"medium_quality"` |
| `time_limit` | `None` | Total training time in seconds |
| `num_bag_folds` | `5` | Bagging folds |
| `included_model_types` | `None` | Restrict to specific model types (see table above) |
| `fit_strategy` | `"sequential"` | `"sequential"` or `"parallel"` |
| `verbosity` | `2` | Logging level (0--4) |
| `num_cpus` | `"auto"` | CPUs to allocate |
| `memory_limit` | `"auto"` | Memory limit in GB |

## When to use

AutoGluon is ideal when:

- You want a **fully automated baseline** with minimal configuration effort.
- You want to **compare** Octo's manually-configured pipeline against an
AutoML approach.
- You need access to model types not available in Octo (e.g., neural networks,
KNN, linear models, LightGBM).
- Time-constrained scenarios where setting a `time_limit` and a `presets` level
is sufficient.

## Limitations

- AutoGluon **does not perform feature selection**. All input features are passed
through. Combine it with upstream feature-selection modules if needed.
- Requires the `autogluon` optional dependency (`pip install octopus[autogluon]`).
- Higher-quality presets (`"best_quality"`, `"high_quality"`) use multi-layer
stacking which is memory-intensive and can be slow.
- The module integrates with Ray for resource management, which can conflict with
Octo's own Ray usage if not configured carefully.
72 changes: 72 additions & 0 deletions docs/concepts/workflow/boruta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Boruta -- Shadow-Feature Statistical Test


Boruta is a statistically principled, "all-relevant" feature selection method.
Unlike most other modules that select a fixed-size subset, Boruta asks a
different question: *which features are genuinely more important than random
noise?* It answers this by creating "shadow" copies of every feature, training a
model on both real and shadow features, and using a statistical test to decide
which real features carry true signal.

## How it works

1. **Hyperparameter optimization.** A `GridSearchCV` tunes the tree-based model
(RandomForest, ExtraTrees, or XGBoost) on the full feature set. Only
tree-based models are supported because Boruta relies on
`feature_importances_` from the trained model.

2. **Shadow feature generation.** For every real feature, a "shadow" copy is
created by randomly permuting its values across samples. This destroys any
relationship with the target while preserving the marginal distribution.

3. **Iterative importance comparison.** Over multiple rounds:
- A model is trained on the combined real + shadow feature set.
- The maximum importance among all shadow features in this round is recorded
(the "shadow max").
- Each real feature's importance is compared to the shadow max.
- A hit counter tracks how often each real feature exceeds the shadow max.

4. **Statistical testing.** After all rounds, a binomial test (with Bonferroni
correction for multiple testing) is applied to each real feature's hit count:
- **Confirmed**: The feature is significantly more important than random
noise at the `alpha` significance level.
- **Tentative**: The evidence is inconclusive.
- **Rejected**: The feature is not significantly better than noise.

Only *Confirmed* features are returned.

5. **Post-selection evaluation.** The selected features are evaluated on dev
(cross-validated) and test sets using both a refit and a grid-search + refit
strategy, matching the pattern used by RFE and SFS.

## Key parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `model` | `""` (auto) | Tree-based model only (`RandomForest`, `ExtraTrees`, or `XGB`) |
| `cv` | `5` | Cross-validation folds for hyperparameter tuning |
| `perc` | `100` | Percentile threshold for shadow-feature comparison (100 = max shadow importance) |
| `alpha` | `0.05` | Significance level for the statistical test |

## When to use

Boruta is particularly well-suited when:

- You want to find **all relevant features** rather than a fixed-size subset.
This is valuable for interpretability or when downstream models benefit from
having every informative feature available.
- The dataset has many noise features and you want a principled way to separate
signal from noise.
- You are uncertain about how many features to keep and prefer letting a
statistical test decide.

## Limitations

- Only supports tree-based models (RandomForest, ExtraTrees, XGBoost). CatBoost
is not supported because the BorutaPy implementation requires sklearn-style
`feature_importances_`.
- Runtime grows with the number of features (shadow features double the feature
space) and the number of Boruta iterations.
- The `perc` parameter (percentile of shadow importances) can affect sensitivity:
lowering it below 100 makes the test more conservative.
- Does not support time-to-event targets.
75 changes: 75 additions & 0 deletions docs/concepts/workflow/efs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# EFS -- Ensemble Feature Selection


EFS takes a fundamentally different approach to feature selection: instead of
evaluating features individually, it trains many models on random feature
subsets, then uses ensemble optimization to find the best *combination* of
models. Features that appear in the winning ensemble are selected. This
diversity-driven approach is especially effective for high-dimensional datasets
where individual feature rankings may be unstable.

## How it works

1. **Generate random feature subsets.** EFS creates `n_subsets` (default 100)
random subsets, each containing `subset_size` (default 30) features drawn
from the full feature set.

2. **Train a model per subset.** For each subset, a `GridSearchCV` tunes and
trains the chosen model (CatBoost, XGBoost, RandomForest, or ExtraTrees).
Cross-validated predictions are collected for every training sample.

3. **Build a model table.** Each trained model is recorded along with its CV
performance, the features it used (excluding those with zero importance), and
its out-of-fold predictions. Models are sorted by performance.

4. **Ensemble scan (hill-climbing).** Starting from the single best model, the
module incrementally adds the next-best model and computes the ensemble
performance (averaged predictions across models). This scan identifies the
number of top models that, when ensembled, give the best combined score.

5. **Ensemble optimization with replacement.** Starting from the models found in
the scan, the optimizer iteratively tests adding each of the top
`max_n_models` models (with replacement) to the ensemble. At each iteration,
the model that improves ensemble performance the most is added. The process
repeats for up to `max_n_iterations` or until no improvement is found.

6. **Feature aggregation.** The final optimized ensemble is a weighted
collection of models (weights = number of times each model appears). The
union of all features used by the ensemble models becomes the selected
feature set. Feature importance is reported as *counts* (how many times a
feature appeared across ensemble models) and *relative counts* (counts /
total models).

## Key parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `model` | `""` (auto) | Model name -- `CatBoost`, `XGB`, `RandomForest`, or `ExtraTrees` |
| `subset_size` | `30` | Number of features per random subset |
| `n_subsets` | `100` | Number of random subsets to create |
| `cv` | `5` | Cross-validation folds |
| `max_n_iterations` | `50` | Iterations for ensemble optimization |
| `max_n_models` | `30` | Maximum models to consider in optimization |

## When to use

EFS is ideal when:

- The dataset is **high-dimensional** (hundreds to thousands of features) and
individual feature rankings are noisy or inconsistent.
- You want a **diversity-driven** selection that captures complementary sets of
features rather than just the top-ranked ones.
- Compute resources are available for training many models in parallel.

## Limitations

- Computationally heavy: `n_subsets` models are trained, each with a grid
search. With 100 subsets and a 4-parameter grid this can mean thousands of
model fits.
- The random subset generation means results are seed-dependent. Different seeds
may produce different feature sets, though the ensemble optimization helps
stabilize this.
- Does not produce scores or predictions in the standard format (scores and
predictions DataFrames are empty); it primarily returns feature counts as
importance measures.
- Does not support time-to-event targets.
116 changes: 116 additions & 0 deletions docs/concepts/workflow/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Workflow & Modules

## Overview

Real-world datasets often contain many columns, but only a subset of them actually helps
a machine-learning model make accurate predictions. Finding that subset -- **feature selection** --
is a core goal of Octopus.

A **workflow** is an ordered list of **tasks** that are executed one after another.
Each task wraps a module, and each module either selects features, trains models, or both.
By chaining tasks together you build a pipeline that progressively narrows the feature set:
start with cheap, fast filters to discard obvious noise, then hand the reduced set to more
expensive methods for further refinement.

### Module types

Octopus ships two kinds of modules:

| Type | Purpose | Examples |
|------|---------|----------|
| **Feature Selection** | Reduce the number of features | [ROC](roc.md), [MRMR](mrmr.md), [RFE](rfe.md), [RFE2](rfe2.md), [SFS](sfs.md), [Boruta](boruta.md), [EFS](efs.md) |
| **Machine Learning** | Train models, optimize hyperparameters, and optionally select features | [Octo](octo.md), [AutoGluon](autogluon.md) |

Both types return a list of **selected features** that the next task in the workflow can consume.

### How tasks are connected

Every task has a `task_id` (starting at 0) and an optional `depends_on` parameter pointing to
the `task_id` of a prior task.

- The **first task** (`depends_on=None`) receives all columns listed in `feature_cols`.
- A **dependent task** (`depends_on=N`) receives only the features selected by task *N*,
plus any scores, predictions, and feature-importance tables that task *N* produced.

### Example workflow

A typical three-step pipeline looks like this:

```
Task 0 (Octo) all 30 features
|
v selected_features (e.g. 20)
Task 1 (MRMR) receives 20 features from Task 0
|
v selected_features (e.g. 15)
Task 2 (Octo) receives 15 features from Task 1
```

In Python this translates to:

```python
from octopus import OctoClassification
from octopus.modules import Mrmr, Octo

study = OctoClassification(
...,
workflow=[
Octo(
task_id=0,
depends_on=None,
description="step1_octo_full",
models=["ExtraTreesClassifier"],
n_trials=100,
n_folds_inner=5,
max_features=30,
),
Mrmr(
task_id=1,
depends_on=0,
description="step2_mrmr",
n_features=15,
),
Octo(
task_id=2,
depends_on=1,
description="step3_octo_reduced",
models=["ExtraTreesClassifier"],
n_trials=100,
n_folds_inner=5,
ensemble_selection=True,
),
],
)

study.fit(data=df)
```

!!! tip
Ordering matters: tasks with `depends_on=None` must appear before tasks that reference
them, and `task_id` values must form a contiguous sequence starting at 0.

---

## Feature Selection Modules

The table below lists all feature-selection modules roughly ordered from cheapest to most
expensive:

| Module | Wraps | Description |
|--------|-------|-------------|
| **[ROC](roc.md)** | scipy, networkx (custom) | Removes correlated features using graph-based grouping |
| **[MRMR](mrmr.md)** | Custom implementation | Maximum Relevance Minimum Redundancy filter |
| **[RFE](rfe.md)** | sklearn `RFECV` | Recursive Feature Elimination with cross-validation |
| **[RFE2](rfe2.md)** | Extends Octo (custom) | RFE using Octo's Optuna-based models |
| **[SFS](sfs.md)** | mlxtend / sklearn | Sequential forward/backward selection |
| **[Boruta](boruta.md)** | Custom (based on BorutaPy) | Shadow-feature statistical test |
| **[EFS](efs.md)** | Custom implementation | Ensemble of models on random feature subsets |

---

## Machine Learning Modules

| Module | Description |
|--------|-------------|
| **[Octo](octo.md)** | Core ML module with HPO, ensembling, and feature importance |
| **[AutoGluon](autogluon.md)** | AutoGluon TabularPredictor wrapper |
Loading