Skip to content

Machine Learning program to predict likelihood of batted balls being homeruns. Replaced stochastic gradient optimizer with custom Newton TR algorithm

Notifications You must be signed in to change notification settings

rjl33/Homerun-Predictor

Repository files navigation

Homerun Predictor: Neural Networks with Newton Trust-Region Optimization

A custom implementation of a matrix-free Newton Trust-Region optimizer with Conjugate Gradient (Steihaug) for neural network training, applied to MLB home run prediction. This project demonstrates that second-order optimization methods can achieve 4.3× faster training than SGD while maintaining comparable accuracy.

Author: Ryan Lutz (Duke University, ME 555 - Numerical Optimization)

Overview

Neural networks typically rely on first-order optimization methods (SGD, Adam) that use only gradient information. This project implements a second-order Newton Trust-Region optimizer that leverages curvature information through Hessian-vector products, avoiding the O(N²) computational cost of computing the full Hessian matrix.

Key Innovation: Matrix-free Newton method using Hessian-vector products computed in O(N) time through automatic differentiation, enabling practical second-order optimization for neural networks.

Problem Statement

Task: Binary classification of batted balls in MLB games

  • Input: 10 Statcast features (launch speed, launch angle, pitch characteristics, etc.)
  • Output: Home run (1) or other outcome (0)
  • Challenge: Train a neural network using custom second-order optimization instead of standard first-order methods

Methodology

Neural Network Architecture

Structure: 4-layer feedforward network

  • Input layer: 10 features
  • Hidden layer 1: 64 neurons (ReLU activation)
  • Hidden layer 2: 32 neurons (ReLU activation)
  • Output layer: 1 neuron (Sigmoid activation)
  • Total parameters: 2,817 weights

Dataset

Source: MLB Statcast data via PyBaseball library
Period: June 2024 MLB season
Size: ~14,000 samples
Features: 10 engineered parameters

Feature Description
Release Speed Pitch velocity (mph)
Batter Stance Right/Left handed
Pitcher Arm Right/Left handed
Batter Advantage Count favorability (1=ahead, 0=behind/tied)
Launch Speed Exit velocity off bat
Launch Angle Vertical departure angle
Effective Speed Effective velocity at contact
Spin Rate Ball rotation (rpm)
Pitch Number Count in at-bat
Pitch Type Fastball/Breaking/Offspeed/Other

Newton Trust-Region CG Optimizer

Core Algorithm: Trust-Region method with Conjugate Gradient subproblem solver

At each iteration k, solve the trust-region subproblem:

min_{p ∈ ℝⁿ} m_k(p) = f_k + g_k^T p + (1/2) p^T B_k p
subject to: ||p|| ≤ Δ_k

Where:

  • f_k = current loss
  • g_k = gradient
  • B_k = Hessian approximation
  • Δ_k = trust-region radius

Key Innovation - Hessian-Free Implementation:

Instead of computing the full Hessian B_k (O(N²) operations), we compute Hessian-vector products v^T H in O(N) time using the relationship:

v^T H = v^T ∇(∇E) = ℛ{∇E}

Where ℛ{·} denotes the directional derivative operator. This is implemented through:

  1. Forward propagation with ℛ operator:

    ℛ{a_j} = Σ w_ji ℛ{x_i}
    ℛ{z_j} = h'(a_j) ℛ{a_j}
    ℛ{y_k} = Σ w_kj ℛ{z_j} + Σ w_kj z_j
    
  2. Backward propagation with ℛ operator:

    ℛ{δ_k} = ℛ{y_k}
    ℛ{δ_j} = h''(a_j)ℛ{a_j}Σw_kjδ_k + h'(a_j)Σw_kjℛ{δ_k}
    
  3. Hessian-vector product elements:

    ℛ{∂E/∂w_kj} = ℛ{δ_k}z_j + δ_k ℛ{z_j}
    ℛ{∂E/∂w_ji} = x_i ℛ{δ_j}
    

CG-Steihaug Truncation: The conjugate gradient subproblem is truncated when:

  • Trust-region boundary is reached: ||p_k|| > Δ_k
  • Negative curvature detected: p^T B_k p < 0

Adaptive Trust-Region: Radius Δ_k is adjusted based on the ratio:

ρ_k = [f(x_k) - f(x_k + p_k)] / [m(0) - m_k(p_k)]
  • If ρ_k ≤ 0: Reject step, decrease Δ_k
  • If ρ_k ≈ 1: Accept step, expand Δ_k
  • Otherwise: Accept step, maintain Δ_k

Results

Performance Comparison

Training Time (100 epochs):

Optimizer Time Speedup
Newton TR-CG 4.42 seconds 4.3× faster
SGD 18.83 seconds baseline

Accuracy Metrics:

Metric Newton TR-CG SGD Difference
Accuracy 95.6% 96.2% -0.6%
L2 Error 22.432 19.3 +16.2%
MSE 0.0362 0.2685 -86.5%

Convergence Behavior

Newton TR-CG:

  • Rapid initial convergence (first 10 epochs)
  • Noticeable fluctuations due to adaptive trust-region sizing
  • Final loss: ~0.07

SGD:

  • Smooth, monotonic decrease
  • Consistent convergence throughout training
  • Final loss: ~0.01

Key Findings

  1. Computational Efficiency: Newton TR-CG achieves 4.3× speedup over SGD by exploiting second-order curvature information through efficient Hessian-vector products

  2. Comparable Accuracy: Only 0.6 percentage point accuracy difference (95.6% vs 96.2%), demonstrating that second-order methods can match first-order performance

  3. Convergence Characteristics:

    • Newton TR-CG shows faster early convergence but more oscillations
    • SGD exhibits smoother, more stable convergence
    • Trust-region adaptation causes characteristic fluctuations in Newton method
  4. Practical Tradeoff: For larger-scale problems, the 4.3× speedup may justify the slight accuracy decrease, especially when training time is a bottleneck

Implementation Details

File Structure

Homerun-Predictor/
├── NewtonTR_project.py          # Main Newton TR-CG implementation
├── NN_with_TR.py                # Neural network with Trust-Region optimizer
├── Num_Opt_SGD.py               # SGD baseline implementation
├── Preprocessing_baseball.py    # Statcast data preprocessing
├── baseball_processing.py       # Feature engineering
└── README.md

Dependencies

pip install torch numpy pandas
pip install pybaseball  # For MLB Statcast data

Usage

Train with Newton TR-CG:

python NewtonTR_project.py

Train with SGD baseline:

python Num_Opt_SGD.py

Preprocess data:

python Preprocessing_baseball.py

Technical Highlights

Mathematical Foundation

Trust-Region Formulation:

  • Quadratic model approximation of loss landscape
  • Adaptive region sizing based on model quality
  • Guaranteed convergence to local minimum

Hessian-Free Optimization:

  • O(N) Hessian-vector products vs. O(N²) full Hessian
  • Enables second-order methods for large-scale problems
  • Leverages PyTorch automatic differentiation

CG-Steihaug Solver:

  • Iteratively solves trust-region subproblem
  • Truncates at boundary or negative curvature
  • Combines benefits of Newton's method with trust-region safety

Activation Functions

ReLU (Hidden Layers):

f(x) = max(0, x)
  • Computationally efficient
  • Maintains non-linearity
  • Mitigates vanishing gradient

Sigmoid (Output Layer):

f(x) = 1 / (1 + e^(-x))
  • Maps to [0,1] range
  • Ideal for binary classification
  • Provides probabilistic interpretation

Applications

This second-order optimization approach is particularly valuable for:

When to Use Newton TR-CG

  • Limited training budget: When faster convergence matters more than final accuracy
  • Expensive forward passes: When gradient computation dominates training time
  • Small to medium networks: Where Hessian-vector products remain tractable
  • Well-conditioned problems: Where second-order information improves convergence

When to Use SGD

  • Very large networks: Where even O(N) Hessian-vector products become expensive
  • Maximum accuracy required: When that extra 0.5-1% matters
  • Noisy gradients: When mini-batch variance dominates
  • Established pipelines: When using proven, battle-tested optimizers

Extensions and Future Work

1. Quasi-Newton Methods

  • Implement L-BFGS with limited memory Hessian approximation
  • Compare computational efficiency with full Newton TR-CG
  • Evaluate accuracy-speed tradeoffs

2. Stochastic Newton Methods

  • Extend to mini-batch training with subsampled Hessians
  • Implement variance reduction techniques
  • Balance per-iteration cost with convergence speed

3. Hybrid Approaches

  • Warm start with SGD, finish with Newton TR-CG
  • Switch optimizers based on gradient norm
  • Adaptive optimizer selection

4. Larger Datasets

  • Scale to full MLB season (150,000+ samples)
  • Evaluate performance on deeper networks
  • Test computational scalability

5. Alternative Applications

  • Image classification (MNIST, CIFAR-10)
  • Regression tasks (housing prices, stock prediction)
  • Multi-class classification

Theoretical Background

Second-Order Optimization

First-order methods (SGD, Adam):

  • Use only gradient information: ∇f(x)
  • Update: x_{k+1} = x_k - α∇f(x_k)
  • Linear convergence near optimum

Second-order methods (Newton):

  • Use gradient and curvature: ∇f(x), ∇²f(x)
  • Update: x_{k+1} = x_k - [∇²f(x_k)]^(-1) ∇f(x_k)
  • Quadratic convergence near optimum
  • Traditional bottleneck: O(N³) Hessian inversion

Trust-Region Newton:

  • Avoids Hessian inversion through subproblem solving
  • Adaptive step sizing via trust-region radius
  • Guaranteed descent with sufficient model quality

Why Hessian-Vector Products Work

The key insight is that we never need the full Hessian matrix B_k. We only need to compute products of the form B_k v for various vectors v during CG iterations.

Using automatic differentiation:

∇(∇f · v) = [∇²f] v = Hv

This can be computed with one forward pass and one backward pass, making it only ~2× the cost of a gradient computation instead of N× (where N is the number of parameters).

Limitations

Current Implementation:

  • Single dataset (MLB June 2024)
  • Fixed network architecture
  • No hyperparameter tuning for Newton method
  • Limited comparison (only vs. vanilla SGD)

Newton TR-CG Challenges:

  • More complex implementation than SGD
  • Hyperparameters (initial Δ, expansion/contraction factors) require tuning
  • CG subproblem may require many iterations
  • Memory overhead for CG solver state

References

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer New York.

    • Theoretical foundation for neural networks and optimization
  2. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization. Springer.

    • Trust-region methods and conjugate gradient algorithms
  3. Cornell Optimization Wiki - Trust-Region Methods

  4. PyBaseball Library - MLB Statcast Data Access

License

This project is available under the MIT License.

Contact

Ryan Lutz
ryanjohnlutz@gmail.com Duke University - Mechanical Engineering and Materials Science
GitHub

This work was completed as a final project for ME 555 (Numerical Optimization) at Duke University, demonstrating the practical application of advanced optimization theory to machine learning problems.


Acknowledgments: Special thanks to the ME 555 course staff for guidance on trust-region methods and the PyBaseball community for providing accessible MLB data.

About

Machine Learning program to predict likelihood of batted balls being homeruns. Replaced stochastic gradient optimizer with custom Newton TR algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages