Skip to content

Commit 92992d5

Browse files
committed
add page on automl theory
1 parent 7d029db commit 92992d5

File tree

2 files changed

+243
-1
lines changed

2 files changed

+243
-1
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
AutoML and Hyperparameter Optimization
2+
======================================
3+
4+
This section provides a deep dive into the theoretical foundations of automated machine learning (AutoML) and hyperparameter optimization as implemented in AutoIntent.
5+
6+
The Hyperparameter Optimization Problem
7+
---------------------------------------
8+
9+
**The Core Problem**
10+
11+
Hyperparameter optimization is about finding the best configuration of settings that maximizes model performance. Think of it as searching through all possible combinations of hyperparameters (like learning rates, model sizes, regularization strengths) to find the combination that gives the best results on validation data.
12+
13+
The performance metric is typically estimated through cross-validation to avoid overfitting - we want configurations that work well on unseen data, not just the training data.
14+
15+
**The Challenge of Combinatorial Explosion**
16+
17+
In AutoIntent's three-stage pipeline, the total search space grows multiplicatively across all stages. If we have:
18+
19+
- 10 different embedding models to choose from
20+
- 20 different scoring configurations
21+
- 5 different decision strategies
22+
23+
Then we have 10 × 20 × 5 = 1,000 total combinations. In realistic scenarios, this can easily exceed 1,000,000 configurations, making it impossible to test every combination within reasonable time and computational budgets.
24+
25+
Hierarchical Optimization Strategy
26+
----------------------------------
27+
28+
AutoIntent addresses combinatorial explosion through a **hierarchical greedy optimization** approach that optimizes modules sequentially.
29+
30+
**Sequential Module Optimization**
31+
32+
The optimization proceeds in three stages, where each stage builds on the results of the previous one:
33+
34+
1. **Embedding Optimization**: First, find the best embedding model configuration by testing different models and settings, evaluating them using retrieval or classification metrics.
35+
36+
2. **Scoring Optimization**: Using the best embedding model from step 1, now optimize the scoring module by testing different classifiers (KNN, linear, neural networks, etc.) with various hyperparameters.
37+
38+
3. **Decision Optimization**: Using the best embedding and scoring combination from steps 1-2, optimize the decision module by finding optimal thresholds and decision strategies for final predictions.
39+
40+
**Proxy Metrics**
41+
42+
Each stage uses specialized proxy metrics that correlate with final performance:
43+
44+
- **Embedding Stage**: Retrieval metrics (NDCG, hit rate) or lightweight classification accuracy
45+
- **Scoring Stage**: Classification metrics (F1, ROC-AUC) on validation data
46+
- **Decision Stage**: Threshold-specific metrics for multi-label/OOS scenarios
47+
48+
**Trade-offs**
49+
50+
- ✅ **Computational Efficiency**: Instead of testing all possible combinations (which grows exponentially), we only test combinations within each stage separately, making optimization much faster and more manageable.
51+
- ✅ **Parallelization**: Each stage can be parallelized independently, allowing multiple configurations to be tested simultaneously.
52+
- ⚠️ **Local Optimality**: May miss globally optimal combinations due to greedy choices - the best embedding might work better with a different scorer than the one we pick, but we won't discover this combination.
53+
54+
Tree-Structured Parzen Estimators (TPE)
55+
----------------------------------------
56+
57+
AutoIntent uses Optuna's TPE algorithm for sophisticated hyperparameter optimization within each module. This is a form of Bayesian optimization that learns from previous trials to make smarter choices about which hyperparameters to try next.
58+
59+
**How TPE Works**
60+
61+
TPE builds two separate models:
62+
63+
- **Good Configuration Model**: Learns the distribution of hyperparameters that led to good performance (typically the top 25% of trials)
64+
- **Bad Configuration Model**: Learns the distribution of hyperparameters that led to poor performance (the remaining 75% of trials)
65+
66+
The algorithm then suggests new hyperparameters by finding configurations that are likely under the "good" model but unlikely under the "bad" model. This naturally balances exploration (trying untested areas) with exploitation (focusing on promising regions).
67+
68+
**Benefits of TPE**
69+
70+
- **Smart Sampling**: After initial random trials, TPE makes increasingly informed decisions about which hyperparameters to try
71+
- **Handles Different Parameter Types**: Works well with categorical, continuous, and integer parameters
72+
- **Robust to Noisy Evaluations**: Can handle situations where the same hyperparameters might give slightly different results due to randomness
73+
- **No Prior Knowledge Required**: Works without needing to specify complex relationships between parameters
74+
75+
Search Space Design
76+
-------------------
77+
78+
**Parameter Types**
79+
80+
AutoIntent supports various hyperparameter types with appropriate sampling strategies:
81+
82+
AutoIntent supports several types of hyperparameters, each requiring different optimization strategies:
83+
84+
**Categorical Parameters**: These are discrete choices from a fixed set of options, like choosing between different model types ("knn", "linear", "bert") or activation functions ("relu", "tanh", "sigmoid"). The optimizer samples uniformly from the available choices.
85+
86+
**Continuous Parameters**: These are real-valued parameters like learning rates, regularization strengths, or temperature values. The optimizer can sample from uniform distributions (for parameters like dropout rates between 0.0 and 1.0) or log-uniform distributions (for parameters like learning rates that work better on logarithmic scales).
87+
88+
**Integer Parameters**: These are whole number parameters like the number of neighbors in KNN, hidden dimensions in neural networks, or batch sizes. The optimizer can specify step sizes and bounds to ensure valid configurations.
89+
90+
**Conditional Parameters**: Some parameters only make sense when certain other parameters have specific values. For example, LoRA-specific parameters (like lora_alpha and lora_r) only apply when the model type is "lora". AutoIntent handles these dependencies automatically in the search space configuration.
91+
92+
93+
**Search Space Configuration**
94+
95+
.. code-block:: yaml
96+
97+
search_space:
98+
- node_type: scoring
99+
target_metric: scoring_f1
100+
search_space:
101+
- module_name: knn
102+
k:
103+
low: 1
104+
high: 20
105+
weights: [uniform, distance, closest]
106+
- module_name: linear
107+
cv: [3, 5, 10]
108+
109+
Cross-Validation and Data Splitting
110+
-----------------------------------
111+
112+
**Validation Schemes**
113+
114+
AutoIntent supports multiple validation strategies to ensure robust hyperparameter selection:
115+
116+
**Hold-out Validation (HO)**
117+
118+
Split data into training and validation sets once. Train the model on the training set and evaluate performance on the validation set. This gives a single performance score for each hyperparameter configuration.
119+
120+
**Cross-Validation (CV)**
121+
122+
Split data into K folds (typically 3-5). For each fold, train on the remaining folds and validate on the current fold. Average the performance scores across all K folds to get a more robust estimate of how well the hyperparameters work.
123+
124+
**Stratified Splitting**
125+
126+
For imbalanced datasets, AutoIntent uses stratified sampling to maintain class distributions:
127+
128+
.. code-block:: python
129+
130+
from autointent.configs import DataConfig
131+
132+
data_config = DataConfig(
133+
scheme="cv", # Cross-validation
134+
n_folds=5, # 5-fold CV
135+
validation_size=0.2, # 20% for validation in HO
136+
separation_ratio=0.5 # Prevent data leakage between modules
137+
)
138+
139+
**Data Leakage Prevention**
140+
141+
The ``separation_ratio`` parameter prevents information leakage between scoring and decision modules by using different data subsets for each stage.
142+
143+
**Hyperparameter Bounds**
144+
145+
Search spaces include reasonable bounds to prevent extreme configurations:
146+
147+
.. code-block:: yaml
148+
149+
learning_rate:
150+
low: 1.0e-5 # Prevent too slow learning
151+
high: 1.0e-2 # Prevent instability
152+
log: true # Log-uniform sampling
153+
154+
Multi-Objective Optimization Considerations
155+
--------------------------------------------
156+
157+
While AutoIntent primarily optimizes single metrics, it considers multiple objectives implicitly:
158+
159+
**Performance vs. Efficiency Trade-offs**
160+
161+
- **Model size**: Smaller models for deployment efficiency
162+
- **Training time**: Faster models for rapid iteration
163+
- **Inference speed**: Optimized for production latency
164+
165+
**Presets as Multi-Objective Solutions**
166+
167+
AutoIntent provides presets that balance different objectives:
168+
169+
.. code-block:: python
170+
171+
# Different computational budgets
172+
pipeline_light = Pipeline.from_preset("classic-light") # Speed-focused
173+
pipeline_heavy = Pipeline.from_preset("classic-heavy") # Performance-focused
174+
175+
# Different model types
176+
pipeline_zero_shot = Pipeline.from_preset("zero-shot-transformers") # No training data
177+
178+
Bayesian Optimization Theory
179+
-----------------------------
180+
181+
**Gaussian Process Surrogate Models**
182+
183+
While TPE uses tree-structured models, the general Bayesian optimization framework uses Gaussian Processes as surrogate models. These are probabilistic models that learn to predict performance based on previous trials, including uncertainty estimates about unexplored regions of the hyperparameter space.
184+
185+
**Exploration vs. Exploitation**
186+
187+
Bayesian optimization balances:
188+
189+
- **Exploitation**: Sampling near known good configurations
190+
- **Exploration**: Sampling in uncertain regions of the space
191+
192+
The acquisition function mathematically encodes this trade-off.
193+
194+
**Convergence Properties**
195+
196+
TPE and related algorithms have theoretical guarantees for convergence to global optima under certain conditions, though practical performance depends on:
197+
198+
- Search space dimensionality
199+
- Function smoothness
200+
- Available computational budget
201+
202+
Practical Optimization Strategies
203+
----------------------------------
204+
205+
**Budget Allocation**
206+
207+
.. code-block:: python
208+
209+
hpo_config = HPOConfig(
210+
sampler="tpe",
211+
n_trials=50, # Total optimization budget
212+
n_startup_trials=10, # Random initialization
213+
timeout=3600, # 1-hour time limit
214+
n_jobs=4 # Parallel trials
215+
)
216+
217+
**Warm Starting**
218+
219+
AutoIntent can resume interrupted optimization. This is the approximate code we use for creating optuna studies:
220+
221+
.. code-block:: python
222+
223+
# Optimization state is automatically saved
224+
study = optuna.create_study(
225+
study_name="intent_classification",
226+
storage="sqlite:///optuna.db",
227+
load_if_exists=True
228+
)
229+
230+
Advanced Topics
231+
---------------
232+
233+
**Meta-Learning**
234+
235+
AutoIntent's presets can be viewed as meta-learning solutions - configurations that work well across diverse datasets based on empirical analysis.
236+
237+
**Neural Architecture Search (NAS)**
238+
239+
While not fully implemented, AutoIntent's modular design supports architecture search within model families (e.g., different CNN configurations).
240+
241+
**Automated Feature Engineering**
242+
243+
AutoIntent's embedding-centric design can be seen as automated feature engineering - the system automatically learns relevant representations through selecting best fitting embedding model.

docs/source/learn/optimization.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,3 @@ This is similar to random search over a subset, but during the search, we attemp
4343

4444
This approach is more sophisticated and can lead to better results by intelligently exploring the hyperparameter space.
4545

46-
The implementation of Bayesian optimization is planned for release v0.1.0.

0 commit comments

Comments
 (0)