|
34 | 34 |
|
35 | 35 | ### Choosing a Classifier Type |
36 | 36 |
|
37 | | -JABS supports three machine learning classifier types: **Random Forest**, **CatBoost**, and **XGBoost**. Each has different characteristics that may make it more suitable for your specific use case. |
| 37 | +JABS offers several choices for the underlying classifier implementation. See the [Classifier Types guide](classifier-types.md) for more information. |
38 | 38 |
|
39 | | -#### Random Forest (Default) |
40 | | - |
41 | | -Random Forest is the default classifier and a good starting point for most users. |
42 | | - |
43 | | -**Pros:** |
44 | | - |
45 | | -- ✅ **Fast training** - Trains quickly, even on large datasets |
46 | | -- ✅ **Well-established** - Mature algorithm with extensive validation |
47 | | -- ✅ **Good baseline performance** - Reliable results across many behavior types |
48 | | -- ✅ **Low memory footprint** - Efficient with system resources |
49 | | - |
50 | | -**Cons:** |
51 | | - |
52 | | -- ⚠️ **May plateau** - Sometimes reaches a performance ceiling compared to gradient boosting methods |
53 | | -- ⚠️ **Less flexible** - Fewer tuning options than boosting methods |
54 | | -- ⚠️ **Does not handle missing data** - Random Forest does not natively handle missing (NaN) values, so JABS currently replaces all NaNs with 0 during training and classification. This might not be a good choice if your data has many missing values |
55 | | - |
56 | | -**Best for:** Quick iterations, initial exploration, behaviors with simpler decision boundaries, or when training time is a priority. |
57 | | - |
58 | | -#### CatBoost |
59 | | - |
60 | | -CatBoost is a gradient boosting algorithm that can achieve excellent performance, particularly for complex behaviors. |
61 | | - |
62 | | -**Pros:** |
63 | | - |
64 | | -- ✅ **High accuracy** - Often achieves the best classification performance |
65 | | -- ✅ **Handles missing data natively** - No imputation needed for NaN values |
66 | | -- ✅ **Robust to overfitting** - Built-in regularization techniques |
67 | | -- ✅ **No external dependencies** - Installs cleanly without additional libraries |
68 | | - |
69 | | -**Cons:** |
70 | | - |
71 | | -- ⚠️ **Slower training** - Takes significantly longer to train than Random Forest |
72 | | -- ⚠️ **Higher memory usage** - May require more RAM during training |
73 | | -- ⚠️ **Longer to classify** - Prediction can be slower on very large datasets |
74 | | - |
75 | | -**Best for:** Final production classifiers where accuracy is paramount, complex behaviors with subtle patterns, or when you have time for longer training sessions. |
76 | | - |
77 | | -#### XGBoost |
78 | | - |
79 | | -XGBoost is another gradient boosting algorithm known for winning machine learning competitions. |
80 | | - |
81 | | -**Pros:** |
82 | | - |
83 | | -- ✅ **Excellent performance** - Typically matches or exceeds Random Forest accuracy |
84 | | -- ✅ **Handles missing data natively** - Like CatBoost, works with NaN values |
85 | | -- ✅ **Faster than CatBoost** - Better training speed than CatBoost |
86 | | -- ✅ **Widely used** - Extensive community support and documentation |
87 | | - |
88 | | -**Cons:** |
89 | | - |
90 | | -- ⚠️ **Dependency on libomp** - On macOS, may require separate installation of OpenMP library |
91 | | -- ⚠️ **Slower than Random Forest** - Training takes longer than Random Forest |
92 | | -- ⚠️ **May be unavailable** - If libomp is not installed, XGBoost won't be available as a choice in JABS |
93 | | - |
94 | | -**Best for:** When you need better accuracy than Random Forest but faster training than CatBoost, or when you're familiar with gradient boosting methods. |
95 | | - |
96 | | -#### Quick Comparison |
97 | | - |
98 | | -| Feature | Random Forest | CatBoost | XGBoost | |
99 | | -|---------|--------------|----------|---------| |
100 | | -| **Training Speed** | ⚡⚡⚡ Fast | 🐌 Slow | ⚡⚡ Moderate | |
101 | | -| **Accuracy** | ⭐⭐⭐ Good | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐ Very Good | |
102 | | -| **Missing Data Handling** | Imputation to 0 | Native support | Native support | |
103 | | -| **Setup Complexity** | ✅ Simple | ✅ Simple | ⚠️ May need libomp | |
104 | | -| **Best Use Case** | Quick iterations | Production accuracy | Balanced performance | |
105 | | - |
106 | | -#### Recommendations |
107 | | - |
108 | | -**Getting Started:** Start with **Random Forest** to quickly iterate and establish a baseline. It trains fast, allowing you to experiment with different labeling strategies and window sizes. |
109 | | - |
110 | | -**Optimizing Performance:** Once you've refined your labels and parameters, try **CatBoost** for a potential accuracy boost. The longer training time is worthwhile for your final classifier. |
111 | | - |
112 | | -**Alternative:** If CatBoost is too slow or you want something between Random Forest and CatBoost, try **XGBoost** (if available on your system). |
113 | | - |
114 | | -**Note:** The actual performance difference between classifiers varies by behavior type and dataset. We recommend testing multiple classifiers on your specific data to find the best option for your use case. |
115 | 39 |
|
116 | 40 | ### Training Reports |
117 | 41 |
|
|
0 commit comments