Skip to content

High‑performance compilation of statistical formulas to zero‑allocation evaluators with FD/AD derivative support, serving as a stable computational foundation for post-estimation analysis and simulation with formula-based linear models in Julia.

License

Notifications You must be signed in to change notification settings

emfeltham/FormulaCompiler.jl

Repository files navigation

FormulaCompiler.jl

CI Documentation Docs Docs

Efficient per-row model matrix evaluation for Julia statistical models using position-mapped compilation and compile-time specialization.

Key Features

  • Memory efficiency: Optimized evaluation approach minimizes memory allocation during computation
  • Performance: Faster single-row evaluation than building full model matrices
  • Compatibility: Supports StatsModels.jl formulas, including interactions and transformations
  • Scenario analysis: Memory-efficient variable override system for counterfactual analysis
  • Unified architecture: Single compilation pipeline accommodates diverse formula structures
  • Ecosystem integration: Compatible with GLM.jl, MixedModels.jl, and StandardizedPredictors.jl

How It Works

Fit a model, convert data to a column table, compile once, then evaluate rows quickly without allocations.

Installation

using Pkg
Pkg.add(url="https://github.com/emfeltham/FormulaCompiler.jl")

Quick Start

using FormulaCompiler, GLM, DataFrames, Tables

# Fit your model normally
df = DataFrame(
    y = randn(1000),
    x = randn(1000),
    z = abs.(randn(1000)) .+ 0.1,
    group = categorical(rand(["A", "B", "C"], 1000))
)

model = lm(@formula(y ~ x * group + log(z)), df)

# Compile once for efficient repeated evaluation  
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

# Memory-efficient evaluation suitable for repeated calls
compiled(row_vec, data, 1)  # Zero allocations after warmup

Core Interfaces

Optimized Interface (Recommended for Performance-Critical Applications)

# Pre-compile for optimal performance
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

# Evaluate individual rows
compiled(row_vec, data, 1)    # Row 1
compiled(row_vec, data, 100)  # Row 100

# Evaluate multiple rows
matrix = Matrix{Float64}(undef, 10, length(compiled))
modelrow!(matrix, compiled, data, 1:10)

Convenient Interface

# Single row (allocating)
row_values = modelrow(model, data, 1)

# Multiple rows (allocating)
matrix = modelrow(model, data, [1, 5, 10, 50])

# In-place with automatic caching
row_vec = Vector{Float64}(undef, size(modelmatrix(model), 2))
modelrow!(row_vec, model, data, 1)  # Uses cache

Object-Based Interface

# Create evaluator object
evaluator = ModelRowEvaluator(model, df)

# Efficient evaluation
result = evaluator(1)           # Row 1
evaluator(row_vec, 1)          # In-place evaluation

Advanced Scenario Analysis

Create data scenarios with variable overrides for policy analysis and counterfactuals:

data = Tables.columntable(df)

# Create policy scenarios
baseline = create_scenario("baseline", data)
treatment = create_scenario("treatment", data; 
    treatment = true,
    dose = 100.0
)
policy_change = create_scenario("policy", data;
    x = mean(df.x),           # Set to population mean
    group = "A",              # Override categorical
    regulatory = true         # Add policy variable
)

# Evaluate scenarios
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

compiled(row_vec, baseline.data, 1)      # Original data
compiled(row_vec, treatment.data, 1)     # With treatment
compiled(row_vec, policy_change.data, 1) # Policy scenario

Scenario Grids

Generate comprehensive scenario combinations:

# Create all combinations
policy_grid = create_scenario_grid("policy_analysis", data, Dict(
    :treatment => [false, true],
    :dose => [50.0, 100.0, 150.0],
    :region => ["North", "South"]
))

# Evaluates 2×3×2 = 12 scenarios
results = Matrix{Float64}(undef, length(policy_grid), length(compiled))
for (i, scenario) in enumerate(policy_grid)
    compiled(view(results, i, :), scenario.data, 1)
end

Dynamic Scenario Modification

scenario = create_scenario("dynamic", data; x = 1.0)

# Modify scenarios iteratively
set_override!(scenario, :y, 100.0)           # Add override
update_scenario!(scenario; x = 2.0, z = 0.5) # Bulk update  
remove_override!(scenario, :y)               # Remove override

Ecosystem Integration

GLM.jl Models

using GLM

# Linear models
linear_model = lm(@formula(mpg ~ hp * cyl + log(wt)), mtcars)
data_mtcars = Tables.columntable(mtcars)
compiled = compile_formula(linear_model, data_mtcars)

# Generalized linear models
logit_model = glm(@formula(vs ~ hp + wt), mtcars, Binomial(), LogitLink())
compiled_logit = compile_formula(logit_model, data_mtcars)

# Both work identically
row_vec = Vector{Float64}(undef, length(compiled))
compiled(row_vec, data_mtcars, 1)

MixedModels.jl Integration

Automatically extracts fixed effects from mixed models:

using MixedModels

# Mixed model with random effects
mixed_model = fit(MixedModel, @formula(y ~ x + z + (1|group) + (1+x|cluster)), df)

# Compiles only the fixed effects: y ~ x + z
compiled = compile_formula(mixed_model, Tables.columntable(df))

# Random effects are automatically stripped
fixed_form = fixed_effects_form(mixed_model)  # Returns: y ~ x + z

StandardizedPredictors.jl Integration

Standardized predictors are integrated (currently only ZScore()):

using StandardizedPredictors

contrasts = Dict(:x => ZScore(), :z => ZScore())
model = lm(@formula(y ~ x + z + group), df, contrasts=contrasts)
compiled = compile_formula(model, Tables.columntable(df))  # Standardization built-in

Derivatives and Marginal Effects

FormulaCompiler provides memory-efficient computation of derivatives and marginal effects with standard errors:

# Build derivative evaluator
vars = [:x, :z]
de = build_derivative_evaluator(compiled, data; vars=vars)
β = coef(model)

# Single-row marginal effect gradient (η case)= Vector{Float64}(undef, length(compiled))
me_eta_grad_beta!(gβ, de, β, 1, :x)  # Minimal memory allocation

# Standard error via delta method
Σ = vcov(model)
se = delta_method_se(gβ, Σ)  # Efficient computation

# Average marginal effects with backend selection
rows = 1:100
gβ_ame = Vector{Float64}(undef, length(compiled))
accumulate_ame_gradient!(gβ_ame, de, β, rows, :x; backend=:fd)  # Memory-efficient computation
se_ame = delta_method_se(gβ_ame, Σ)

println("AME standard error for x: ", se_ame)

Key capabilities:

  • Dual backends: :fd (memory-efficient) and :ad (higher numerical accuracy)
  • η and μ cases: Linear predictors and link function transformations
  • Delta method: Standard error computation for marginal effects
  • Validated implementation: Cross-validated against reference implementations

Advanced Features

Memory Efficiency

The scenario system employs OverrideVector for efficient data representation:

# Traditional approach: memory allocation for large datasets
traditional = fill(42.0, 1_000_000)  # ~8MB allocation

# FormulaCompiler approach: reduced memory overhead
efficient = OverrideVector(42.0, 1_000_000)  # ~32 bytes allocation

# Identical computational interface
traditional[500_000] == efficient[500_000]  # true

Supported Formula Features

  • Basic terms: x, log(z), x^2, (x > 0)
  • Boolean variables: Vector{Bool} treated as continuous (false → 0.0, true → 1.0)
  • Categorical variables: All contrast types, ordered/unordered
  • Interactions: x * group, x * y * z, log(z) * group
  • Functions: log, exp, sqrt, sin, cos, abs, ^, boolean operators
  • Complex formulas: x * log(z) * group + sqrt(abs(y)) + (x > mean(x))
  • Standardized predictors: ZScore, custom transformations
  • Mixed models: Automatic fixed-effects extraction

Performance Tips

  1. Pre-compile formulas for repeated evaluation:

    data = Tables.columntable(df)
    compiled = compile_formula(model, data)  # Do once
    # Then call compiled() millions of times
  2. Use column-table format for best performance:

    data = Tables.columntable(df)  # Convert once, reuse many times
  3. Pre-allocate output vectors:

    row_vec = Vector{Float64}(undef, length(compiled))  # Reuse across calls
  4. Batch operations when possible:

    # Better: batch evaluation
    matrix = Matrix{Float64}(undef, 1000, length(compiled))
    modelrow!(matrix, compiled, data, 1:1000)
    
    # Avoid: many single evaluations with allocation
    results = [modelrow(model, data, i) for i in 1:1000]

Performance Characteristics

Comparative performance evaluation across formula types:

using BenchmarkTools

# Traditional approach (full model matrix construction)
@benchmark modelmatrix(model)[1, :]
# Note: Constructs entire model matrix, computationally intensive for large datasets

# FormulaCompiler optimized approach
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

@benchmark compiled(row_vec, data, 1)
# Performance improvement with reduced memory allocation

Derivative computation:

  • ForwardDiff-based operations involve some per-call allocations due to automatic differentiation requirements
  • Finite difference backend provides alternative with validation against automatic differentiation results
  • Marginal effects computations utilize preallocated buffers

The automatic differentiation backend for batch gradient operations (accumulate_ame_gradient!) involves allocations. Users with strict memory constraints should utilize the :fd backend, which provides mathematically equivalent results with reduced memory overhead.

Architecture

FormulaCompiler achieves efficiency through a unified compilation pipeline that transforms statistical formulas into specialized, type-stable execution paths:

  • Position mapping: Operations utilize compile-time position specialization
  • Adaptive dispatch: Threshold-based approach (≤25 operations: recursive dispatch, >25 operations: generated functions) chosen based on Julia's compilation limits
  • Unified design: Single compilation system accommodates diverse formula structures without special-case handling

Use Cases

  • Monte Carlo simulations: Millions of model evaluations
  • Bootstrap resampling: Repeated matrix construction
  • Marginal effects: Numerical derivatives require many evaluations
  • Policy analysis: Evaluate many counterfactual scenarios
  • Real-time applications: Low-latency prediction serving
  • Large-scale inference: Memory-efficient batch processing

Contributing

Contributions are welcome!

Related Packages

Documentation

  • DIAGRAMS.md: Complete visual guide with system architecture, technical implementation, and usage workflows
  • categorical_handling.md: Detailed explanation of categorical variable and interaction handling
  • docs/diagrams/: Individual diagram files for embedding in documentation

License

MIT License. See LICENSE for details.

About

High‑performance compilation of statistical formulas to zero‑allocation evaluators with FD/AD derivative support, serving as a stable computational foundation for post-estimation analysis and simulation with formula-based linear models in Julia.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published