Skip to content

Commit d5c874d

Browse files
authored
Merge pull request #263 from KumarLabJax/reorganize-userguide
make the information about classifier types its own page in the userguide
2 parents 2970ba5 + 15d6fd6 commit d5c874d

File tree

3 files changed

+82
-77
lines changed

3 files changed

+82
-77
lines changed
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Choosing a Classifier Type
2+
3+
JABS supports three machine learning classifier types: **Random Forest**, **CatBoost**, and **XGBoost**. Each has different characteristics that may make it more suitable for your specific use case.
4+
5+
## Random Forest (Default)
6+
7+
Random Forest is the default classifier and a good starting point for most users.
8+
9+
**Pros:**
10+
11+
-**Fast training** - Trains quickly, even on large datasets
12+
-**Well-established** - Mature algorithm with extensive validation
13+
-**Good baseline performance** - Reliable results across many behavior types
14+
-**Low memory footprint** - Efficient with system resources
15+
16+
**Cons:**
17+
18+
- ⚠️ **May plateau** - Sometimes reaches a performance ceiling compared to gradient boosting methods
19+
- ⚠️ **Less flexible** - Fewer tuning options than boosting methods
20+
- ⚠️ **Does not handle missing data** - Random Forest does not natively handle missing (NaN) values, so JABS currently replaces all NaNs with 0 during training and classification. This might not be a good choice if your data has many missing values
21+
22+
**Best for:** Quick iterations, initial exploration, behaviors with simpler decision boundaries, or when training time is a priority.
23+
24+
## CatBoost
25+
26+
CatBoost is a gradient boosting algorithm that can achieve excellent performance, particularly for complex behaviors.
27+
28+
**Pros:**
29+
30+
-**High accuracy** - Often achieves the best classification performance
31+
-**Handles missing data natively** - No imputation needed for NaN values
32+
-**Robust to overfitting** - Built-in regularization techniques
33+
-**No external dependencies** - Installs cleanly without additional libraries
34+
35+
**Cons:**
36+
37+
- ⚠️ **Slower training** - Takes significantly longer to train than Random Forest
38+
- ⚠️ **Higher memory usage** - May require more RAM during training
39+
- ⚠️ **Longer to classify** - Prediction can be slower on very large datasets
40+
41+
**Best for:** Final production classifiers where accuracy is paramount, complex behaviors with subtle patterns, or when you have time for longer training sessions.
42+
43+
## XGBoost
44+
45+
XGBoost is another gradient boosting algorithm known for winning machine learning competitions.
46+
47+
**Pros:**
48+
49+
-**Excellent performance** - Typically matches or exceeds Random Forest accuracy
50+
-**Handles missing data natively** - Like CatBoost, works with NaN values
51+
-**Faster than CatBoost** - Better training speed than CatBoost
52+
-**Widely used** - Extensive community support and documentation
53+
54+
**Cons:**
55+
56+
- ⚠️ **Dependency on libomp** - On macOS, may require separate installation of OpenMP library
57+
- ⚠️ **Slower than Random Forest** - Training takes longer than Random Forest
58+
- ⚠️ **May be unavailable** - If libomp is not installed, XGBoost won't be available as a choice in JABS
59+
60+
**Best for:** When you need better accuracy than Random Forest but faster training than CatBoost, or when you're familiar with gradient boosting methods.
61+
62+
## Quick Comparison
63+
64+
| Feature | Random Forest | CatBoost | XGBoost |
65+
|---------|--------------|----------|---------|
66+
| **Training Speed** | ⚡⚡⚡ Fast | 🐌 Slow | ⚡⚡ Moderate |
67+
| **Accuracy** | ⭐⭐⭐ Good | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐ Very Good |
68+
| **Missing Data Handling** | Imputation to 0 | Native support | Native support |
69+
| **Setup Complexity** | ✅ Simple | ✅ Simple | ⚠️ May need libomp |
70+
| **Best Use Case** | Quick iterations | Production accuracy | Balanced performance |
71+
72+
## Recommendations
73+
74+
**Getting Started:** Start with **Random Forest** to quickly iterate and establish a baseline. It trains fast, allowing you to experiment with different labeling strategies and window sizes.
75+
76+
**Optimizing Performance:** Once you've refined your labels and parameters, try **CatBoost** for a potential accuracy boost. The longer training time is worthwhile for your final classifier.
77+
78+
**Alternative:** If CatBoost is too slow or you want something between Random Forest and CatBoost, try **XGBoost** (if available on your system).
79+
80+
**Note:** The actual performance difference between classifiers varies by behavior type and dataset. We recommend testing multiple classifiers on your specific data to find the best option for your use case.

src/jabs/resources/docs/user_guide/gui.md

Lines changed: 1 addition & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -34,84 +34,8 @@
3434

3535
### Choosing a Classifier Type
3636

37-
JABS supports three machine learning classifier types: **Random Forest**, **CatBoost**, and **XGBoost**. Each has different characteristics that may make it more suitable for your specific use case.
37+
JABS offers several choices for the underlying classifier implementation. See the [Classifier Types guide](classifier-types.md) for more information.
3838

39-
#### Random Forest (Default)
40-
41-
Random Forest is the default classifier and a good starting point for most users.
42-
43-
**Pros:**
44-
45-
-**Fast training** - Trains quickly, even on large datasets
46-
-**Well-established** - Mature algorithm with extensive validation
47-
-**Good baseline performance** - Reliable results across many behavior types
48-
-**Low memory footprint** - Efficient with system resources
49-
50-
**Cons:**
51-
52-
- ⚠️ **May plateau** - Sometimes reaches a performance ceiling compared to gradient boosting methods
53-
- ⚠️ **Less flexible** - Fewer tuning options than boosting methods
54-
- ⚠️ **Does not handle missing data** - Random Forest does not natively handle missing (NaN) values, so JABS currently replaces all NaNs with 0 during training and classification. This might not be a good choice if your data has many missing values
55-
56-
**Best for:** Quick iterations, initial exploration, behaviors with simpler decision boundaries, or when training time is a priority.
57-
58-
#### CatBoost
59-
60-
CatBoost is a gradient boosting algorithm that can achieve excellent performance, particularly for complex behaviors.
61-
62-
**Pros:**
63-
64-
-**High accuracy** - Often achieves the best classification performance
65-
-**Handles missing data natively** - No imputation needed for NaN values
66-
-**Robust to overfitting** - Built-in regularization techniques
67-
-**No external dependencies** - Installs cleanly without additional libraries
68-
69-
**Cons:**
70-
71-
- ⚠️ **Slower training** - Takes significantly longer to train than Random Forest
72-
- ⚠️ **Higher memory usage** - May require more RAM during training
73-
- ⚠️ **Longer to classify** - Prediction can be slower on very large datasets
74-
75-
**Best for:** Final production classifiers where accuracy is paramount, complex behaviors with subtle patterns, or when you have time for longer training sessions.
76-
77-
#### XGBoost
78-
79-
XGBoost is another gradient boosting algorithm known for winning machine learning competitions.
80-
81-
**Pros:**
82-
83-
-**Excellent performance** - Typically matches or exceeds Random Forest accuracy
84-
-**Handles missing data natively** - Like CatBoost, works with NaN values
85-
-**Faster than CatBoost** - Better training speed than CatBoost
86-
-**Widely used** - Extensive community support and documentation
87-
88-
**Cons:**
89-
90-
- ⚠️ **Dependency on libomp** - On macOS, may require separate installation of OpenMP library
91-
- ⚠️ **Slower than Random Forest** - Training takes longer than Random Forest
92-
- ⚠️ **May be unavailable** - If libomp is not installed, XGBoost won't be available as a choice in JABS
93-
94-
**Best for:** When you need better accuracy than Random Forest but faster training than CatBoost, or when you're familiar with gradient boosting methods.
95-
96-
#### Quick Comparison
97-
98-
| Feature | Random Forest | CatBoost | XGBoost |
99-
|---------|--------------|----------|---------|
100-
| **Training Speed** | ⚡⚡⚡ Fast | 🐌 Slow | ⚡⚡ Moderate |
101-
| **Accuracy** | ⭐⭐⭐ Good | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐ Very Good |
102-
| **Missing Data Handling** | Imputation to 0 | Native support | Native support |
103-
| **Setup Complexity** | ✅ Simple | ✅ Simple | ⚠️ May need libomp |
104-
| **Best Use Case** | Quick iterations | Production accuracy | Balanced performance |
105-
106-
#### Recommendations
107-
108-
**Getting Started:** Start with **Random Forest** to quickly iterate and establish a baseline. It trains fast, allowing you to experiment with different labeling strategies and window sizes.
109-
110-
**Optimizing Performance:** Once you've refined your labels and parameters, try **CatBoost** for a potential accuracy boost. The longer training time is worthwhile for your final classifier.
111-
112-
**Alternative:** If CatBoost is too slow or you want something between Random Forest and CatBoost, try **XGBoost** (if available on your system).
113-
114-
**Note:** The actual performance difference between classifiers varies by behavior type and dataset. We recommend testing multiple classifiers on your specific data to find the best option for your use case.
11539

11640
### Training Reports
11741

src/jabs/ui/user_guide_dialog.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,7 @@ def _build_tree(self) -> None:
202202
"Prediction File": "file-formats.md#prediction-file",
203203
"Feature File": "file-formats.md#feature-file",
204204
},
205+
"Choosing a Classifier": "classifier-types.md",
205206
"Features Reference": "features.md",
206207
"Keyboard Shortcuts Reference": "keyboard-shortcuts.md",
207208
}

0 commit comments

Comments
 (0)