Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions src/jabs/resources/docs/user_guide/classifier-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Choosing a Classifier Type

JABS supports three machine learning classifier types: **Random Forest**, **CatBoost**, and **XGBoost**. Each has different characteristics that may make it more suitable for your specific use case.

## Random Forest (Default)

Random Forest is the default classifier and a good starting point for most users.

**Pros:**

-**Fast training** - Trains quickly, even on large datasets
-**Well-established** - Mature algorithm with extensive validation
-**Good baseline performance** - Reliable results across many behavior types
-**Low memory footprint** - Efficient with system resources

**Cons:**

- ⚠️ **May plateau** - Sometimes reaches a performance ceiling compared to gradient boosting methods
- ⚠️ **Less flexible** - Fewer tuning options than boosting methods
- ⚠️ **Does not handle missing data** - Random Forest does not natively handle missing (NaN) values, so JABS currently replaces all NaNs with 0 during training and classification. This might not be a good choice if your data has many missing values

**Best for:** Quick iterations, initial exploration, behaviors with simpler decision boundaries, or when training time is a priority.

## CatBoost

CatBoost is a gradient boosting algorithm that can achieve excellent performance, particularly for complex behaviors.

**Pros:**

-**High accuracy** - Often achieves the best classification performance
-**Handles missing data natively** - No imputation needed for NaN values
-**Robust to overfitting** - Built-in regularization techniques
-**No external dependencies** - Installs cleanly without additional libraries

**Cons:**

- ⚠️ **Slower training** - Takes significantly longer to train than Random Forest
- ⚠️ **Higher memory usage** - May require more RAM during training
- ⚠️ **Longer to classify** - Prediction can be slower on very large datasets

**Best for:** Final production classifiers where accuracy is paramount, complex behaviors with subtle patterns, or when you have time for longer training sessions.

## XGBoost

XGBoost is another gradient boosting algorithm known for winning machine learning competitions.

**Pros:**

-**Excellent performance** - Typically matches or exceeds Random Forest accuracy
-**Handles missing data natively** - Like CatBoost, works with NaN values
-**Faster than CatBoost** - Better training speed than CatBoost
-**Widely used** - Extensive community support and documentation

**Cons:**

- ⚠️ **Dependency on libomp** - On macOS, may require separate installation of OpenMP library
- ⚠️ **Slower than Random Forest** - Training takes longer than Random Forest
- ⚠️ **May be unavailable** - If libomp is not installed, XGBoost won't be available as a choice in JABS

**Best for:** When you need better accuracy than Random Forest but faster training than CatBoost, or when you're familiar with gradient boosting methods.

## Quick Comparison

| Feature | Random Forest | CatBoost | XGBoost |
|---------|--------------|----------|---------|
| **Training Speed** | ⚡⚡⚡ Fast | 🐌 Slow | ⚡⚡ Moderate |
| **Accuracy** | ⭐⭐⭐ Good | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐ Very Good |
| **Missing Data Handling** | Imputation to 0 | Native support | Native support |
| **Setup Complexity** | ✅ Simple | ✅ Simple | ⚠️ May need libomp |
| **Best Use Case** | Quick iterations | Production accuracy | Balanced performance |

## Recommendations

**Getting Started:** Start with **Random Forest** to quickly iterate and establish a baseline. It trains fast, allowing you to experiment with different labeling strategies and window sizes.

**Optimizing Performance:** Once you've refined your labels and parameters, try **CatBoost** for a potential accuracy boost. The longer training time is worthwhile for your final classifier.

**Alternative:** If CatBoost is too slow or you want something between Random Forest and CatBoost, try **XGBoost** (if available on your system).

**Note:** The actual performance difference between classifiers varies by behavior type and dataset. We recommend testing multiple classifiers on your specific data to find the best option for your use case.
78 changes: 1 addition & 77 deletions src/jabs/resources/docs/user_guide/gui.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,84 +34,8 @@

### Choosing a Classifier Type

JABS supports three machine learning classifier types: **Random Forest**, **CatBoost**, and **XGBoost**. Each has different characteristics that may make it more suitable for your specific use case.
JABS offers several choices for the underlying classifier implementation. See the [Classifier Types guide](classifier-types.md) for more information.

#### Random Forest (Default)

Random Forest is the default classifier and a good starting point for most users.

**Pros:**

- ✅ **Fast training** - Trains quickly, even on large datasets
- ✅ **Well-established** - Mature algorithm with extensive validation
- ✅ **Good baseline performance** - Reliable results across many behavior types
- ✅ **Low memory footprint** - Efficient with system resources

**Cons:**

- ⚠️ **May plateau** - Sometimes reaches a performance ceiling compared to gradient boosting methods
- ⚠️ **Less flexible** - Fewer tuning options than boosting methods
- ⚠️ **Does not handle missing data** - Random Forest does not natively handle missing (NaN) values, so JABS currently replaces all NaNs with 0 during training and classification. This might not be a good choice if your data has many missing values

**Best for:** Quick iterations, initial exploration, behaviors with simpler decision boundaries, or when training time is a priority.

#### CatBoost

CatBoost is a gradient boosting algorithm that can achieve excellent performance, particularly for complex behaviors.

**Pros:**

- ✅ **High accuracy** - Often achieves the best classification performance
- ✅ **Handles missing data natively** - No imputation needed for NaN values
- ✅ **Robust to overfitting** - Built-in regularization techniques
- ✅ **No external dependencies** - Installs cleanly without additional libraries

**Cons:**

- ⚠️ **Slower training** - Takes significantly longer to train than Random Forest
- ⚠️ **Higher memory usage** - May require more RAM during training
- ⚠️ **Longer to classify** - Prediction can be slower on very large datasets

**Best for:** Final production classifiers where accuracy is paramount, complex behaviors with subtle patterns, or when you have time for longer training sessions.

#### XGBoost

XGBoost is another gradient boosting algorithm known for winning machine learning competitions.

**Pros:**

- ✅ **Excellent performance** - Typically matches or exceeds Random Forest accuracy
- ✅ **Handles missing data natively** - Like CatBoost, works with NaN values
- ✅ **Faster than CatBoost** - Better training speed than CatBoost
- ✅ **Widely used** - Extensive community support and documentation

**Cons:**

- ⚠️ **Dependency on libomp** - On macOS, may require separate installation of OpenMP library
- ⚠️ **Slower than Random Forest** - Training takes longer than Random Forest
- ⚠️ **May be unavailable** - If libomp is not installed, XGBoost won't be available as a choice in JABS

**Best for:** When you need better accuracy than Random Forest but faster training than CatBoost, or when you're familiar with gradient boosting methods.

#### Quick Comparison

| Feature | Random Forest | CatBoost | XGBoost |
|---------|--------------|----------|---------|
| **Training Speed** | ⚡⚡⚡ Fast | 🐌 Slow | ⚡⚡ Moderate |
| **Accuracy** | ⭐⭐⭐ Good | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐ Very Good |
| **Missing Data Handling** | Imputation to 0 | Native support | Native support |
| **Setup Complexity** | ✅ Simple | ✅ Simple | ⚠️ May need libomp |
| **Best Use Case** | Quick iterations | Production accuracy | Balanced performance |

#### Recommendations

**Getting Started:** Start with **Random Forest** to quickly iterate and establish a baseline. It trains fast, allowing you to experiment with different labeling strategies and window sizes.

**Optimizing Performance:** Once you've refined your labels and parameters, try **CatBoost** for a potential accuracy boost. The longer training time is worthwhile for your final classifier.

**Alternative:** If CatBoost is too slow or you want something between Random Forest and CatBoost, try **XGBoost** (if available on your system).

**Note:** The actual performance difference between classifiers varies by behavior type and dataset. We recommend testing multiple classifiers on your specific data to find the best option for your use case.

### Training Reports

Expand Down
1 change: 1 addition & 0 deletions src/jabs/ui/user_guide_dialog.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@ def _build_tree(self) -> None:
"Prediction File": "file-formats.md#prediction-file",
"Feature File": "file-formats.md#feature-file",
},
"Choosing a Classifier": "classifier-types.md",
"Features Reference": "features.md",
"Keyboard Shortcuts Reference": "keyboard-shortcuts.md",
}
Expand Down