A custom implementation of a matrix-free Newton Trust-Region optimizer with Conjugate Gradient (Steihaug) for neural network training, applied to MLB home run prediction. This project demonstrates that second-order optimization methods can achieve 4.3× faster training than SGD while maintaining comparable accuracy.
Author: Ryan Lutz (Duke University, ME 555 - Numerical Optimization)
Neural networks typically rely on first-order optimization methods (SGD, Adam) that use only gradient information. This project implements a second-order Newton Trust-Region optimizer that leverages curvature information through Hessian-vector products, avoiding the O(N²) computational cost of computing the full Hessian matrix.
Key Innovation: Matrix-free Newton method using Hessian-vector products computed in O(N) time through automatic differentiation, enabling practical second-order optimization for neural networks.
Task: Binary classification of batted balls in MLB games
- Input: 10 Statcast features (launch speed, launch angle, pitch characteristics, etc.)
- Output: Home run (1) or other outcome (0)
- Challenge: Train a neural network using custom second-order optimization instead of standard first-order methods
Structure: 4-layer feedforward network
- Input layer: 10 features
- Hidden layer 1: 64 neurons (ReLU activation)
- Hidden layer 2: 32 neurons (ReLU activation)
- Output layer: 1 neuron (Sigmoid activation)
- Total parameters: 2,817 weights
Source: MLB Statcast data via PyBaseball library
Period: June 2024 MLB season
Size: ~14,000 samples
Features: 10 engineered parameters
| Feature | Description |
|---|---|
| Release Speed | Pitch velocity (mph) |
| Batter Stance | Right/Left handed |
| Pitcher Arm | Right/Left handed |
| Batter Advantage | Count favorability (1=ahead, 0=behind/tied) |
| Launch Speed | Exit velocity off bat |
| Launch Angle | Vertical departure angle |
| Effective Speed | Effective velocity at contact |
| Spin Rate | Ball rotation (rpm) |
| Pitch Number | Count in at-bat |
| Pitch Type | Fastball/Breaking/Offspeed/Other |
Core Algorithm: Trust-Region method with Conjugate Gradient subproblem solver
At each iteration k, solve the trust-region subproblem:
min_{p ∈ ℝⁿ} m_k(p) = f_k + g_k^T p + (1/2) p^T B_k p
subject to: ||p|| ≤ Δ_k
Where:
f_k= current lossg_k= gradientB_k= Hessian approximationΔ_k= trust-region radius
Key Innovation - Hessian-Free Implementation:
Instead of computing the full Hessian B_k (O(N²) operations), we compute Hessian-vector products v^T H in O(N) time using the relationship:
v^T H = v^T ∇(∇E) = ℛ{∇E}
Where ℛ{·} denotes the directional derivative operator. This is implemented through:
-
Forward propagation with ℛ operator:
ℛ{a_j} = Σ w_ji ℛ{x_i} ℛ{z_j} = h'(a_j) ℛ{a_j} ℛ{y_k} = Σ w_kj ℛ{z_j} + Σ w_kj z_j -
Backward propagation with ℛ operator:
ℛ{δ_k} = ℛ{y_k} ℛ{δ_j} = h''(a_j)ℛ{a_j}Σw_kjδ_k + h'(a_j)Σw_kjℛ{δ_k} -
Hessian-vector product elements:
ℛ{∂E/∂w_kj} = ℛ{δ_k}z_j + δ_k ℛ{z_j} ℛ{∂E/∂w_ji} = x_i ℛ{δ_j}
CG-Steihaug Truncation: The conjugate gradient subproblem is truncated when:
- Trust-region boundary is reached:
||p_k|| > Δ_k - Negative curvature detected:
p^T B_k p < 0
Adaptive Trust-Region: Radius Δ_k is adjusted based on the ratio:
ρ_k = [f(x_k) - f(x_k + p_k)] / [m(0) - m_k(p_k)]
- If ρ_k ≤ 0: Reject step, decrease Δ_k
- If ρ_k ≈ 1: Accept step, expand Δ_k
- Otherwise: Accept step, maintain Δ_k
Training Time (100 epochs):
| Optimizer | Time | Speedup |
|---|---|---|
| Newton TR-CG | 4.42 seconds | 4.3× faster |
| SGD | 18.83 seconds | baseline |
Accuracy Metrics:
| Metric | Newton TR-CG | SGD | Difference |
|---|---|---|---|
| Accuracy | 95.6% | 96.2% | -0.6% |
| L2 Error | 22.432 | 19.3 | +16.2% |
| MSE | 0.0362 | 0.2685 | -86.5% |
Newton TR-CG:
- Rapid initial convergence (first 10 epochs)
- Noticeable fluctuations due to adaptive trust-region sizing
- Final loss: ~0.07
SGD:
- Smooth, monotonic decrease
- Consistent convergence throughout training
- Final loss: ~0.01
-
Computational Efficiency: Newton TR-CG achieves 4.3× speedup over SGD by exploiting second-order curvature information through efficient Hessian-vector products
-
Comparable Accuracy: Only 0.6 percentage point accuracy difference (95.6% vs 96.2%), demonstrating that second-order methods can match first-order performance
-
Convergence Characteristics:
- Newton TR-CG shows faster early convergence but more oscillations
- SGD exhibits smoother, more stable convergence
- Trust-region adaptation causes characteristic fluctuations in Newton method
-
Practical Tradeoff: For larger-scale problems, the 4.3× speedup may justify the slight accuracy decrease, especially when training time is a bottleneck
Homerun-Predictor/
├── NewtonTR_project.py # Main Newton TR-CG implementation
├── NN_with_TR.py # Neural network with Trust-Region optimizer
├── Num_Opt_SGD.py # SGD baseline implementation
├── Preprocessing_baseball.py # Statcast data preprocessing
├── baseball_processing.py # Feature engineering
└── README.md
pip install torch numpy pandas
pip install pybaseball # For MLB Statcast dataTrain with Newton TR-CG:
python NewtonTR_project.pyTrain with SGD baseline:
python Num_Opt_SGD.pyPreprocess data:
python Preprocessing_baseball.pyTrust-Region Formulation:
- Quadratic model approximation of loss landscape
- Adaptive region sizing based on model quality
- Guaranteed convergence to local minimum
Hessian-Free Optimization:
- O(N) Hessian-vector products vs. O(N²) full Hessian
- Enables second-order methods for large-scale problems
- Leverages PyTorch automatic differentiation
CG-Steihaug Solver:
- Iteratively solves trust-region subproblem
- Truncates at boundary or negative curvature
- Combines benefits of Newton's method with trust-region safety
ReLU (Hidden Layers):
f(x) = max(0, x)
- Computationally efficient
- Maintains non-linearity
- Mitigates vanishing gradient
Sigmoid (Output Layer):
f(x) = 1 / (1 + e^(-x))
- Maps to [0,1] range
- Ideal for binary classification
- Provides probabilistic interpretation
This second-order optimization approach is particularly valuable for:
- Limited training budget: When faster convergence matters more than final accuracy
- Expensive forward passes: When gradient computation dominates training time
- Small to medium networks: Where Hessian-vector products remain tractable
- Well-conditioned problems: Where second-order information improves convergence
- Very large networks: Where even O(N) Hessian-vector products become expensive
- Maximum accuracy required: When that extra 0.5-1% matters
- Noisy gradients: When mini-batch variance dominates
- Established pipelines: When using proven, battle-tested optimizers
1. Quasi-Newton Methods
- Implement L-BFGS with limited memory Hessian approximation
- Compare computational efficiency with full Newton TR-CG
- Evaluate accuracy-speed tradeoffs
2. Stochastic Newton Methods
- Extend to mini-batch training with subsampled Hessians
- Implement variance reduction techniques
- Balance per-iteration cost with convergence speed
3. Hybrid Approaches
- Warm start with SGD, finish with Newton TR-CG
- Switch optimizers based on gradient norm
- Adaptive optimizer selection
4. Larger Datasets
- Scale to full MLB season (150,000+ samples)
- Evaluate performance on deeper networks
- Test computational scalability
5. Alternative Applications
- Image classification (MNIST, CIFAR-10)
- Regression tasks (housing prices, stock prediction)
- Multi-class classification
First-order methods (SGD, Adam):
- Use only gradient information: ∇f(x)
- Update: x_{k+1} = x_k - α∇f(x_k)
- Linear convergence near optimum
Second-order methods (Newton):
- Use gradient and curvature: ∇f(x), ∇²f(x)
- Update: x_{k+1} = x_k - [∇²f(x_k)]^(-1) ∇f(x_k)
- Quadratic convergence near optimum
- Traditional bottleneck: O(N³) Hessian inversion
Trust-Region Newton:
- Avoids Hessian inversion through subproblem solving
- Adaptive step sizing via trust-region radius
- Guaranteed descent with sufficient model quality
The key insight is that we never need the full Hessian matrix B_k. We only need to compute products of the form B_k v for various vectors v during CG iterations.
Using automatic differentiation:
∇(∇f · v) = [∇²f] v = Hv
This can be computed with one forward pass and one backward pass, making it only ~2× the cost of a gradient computation instead of N× (where N is the number of parameters).
Current Implementation:
- Single dataset (MLB June 2024)
- Fixed network architecture
- No hyperparameter tuning for Newton method
- Limited comparison (only vs. vanilla SGD)
Newton TR-CG Challenges:
- More complex implementation than SGD
- Hyperparameters (initial Δ, expansion/contraction factors) require tuning
- CG subproblem may require many iterations
- Memory overhead for CG solver state
-
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer New York.
- Theoretical foundation for neural networks and optimization
-
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization. Springer.
- Trust-region methods and conjugate gradient algorithms
-
Cornell Optimization Wiki - Trust-Region Methods
-
PyBaseball Library - MLB Statcast Data Access
This project is available under the MIT License.
Ryan Lutz
ryanjohnlutz@gmail.com
Duke University - Mechanical Engineering and Materials Science
GitHub
This work was completed as a final project for ME 555 (Numerical Optimization) at Duke University, demonstrating the practical application of advanced optimization theory to machine learning problems.
Acknowledgments: Special thanks to the ME 555 course staff for guidance on trust-region methods and the PyBaseball community for providing accessible MLB data.