Skip to content

Commit 12dbc5c

Browse files
authored
Merge pull request #18 from apoorvalal/feature/estimator_classes
Add estimator API, full extras, and Quarto docs
2 parents 75184d9 + 1543409 commit 12dbc5c

29 files changed

+6677
-132
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,3 +166,9 @@ cython_debug/
166166
.DS_Store
167167
.claude/settings.local.json
168168
tmp.md
169+
ensmallen/
170+
docs/_site/
171+
.quarto/
172+
synthlearners_repo/
173+
docs/reference/
174+
docs/objects.json

ECONOMETRICS_ML_ROADMAP.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Econometrics and Supervised Learning Roadmap
2+
3+
This document collects proposed functionality expansions for `pyensmallen`, based on the existing notebooks and current API surface.
4+
5+
## First Tranche
6+
7+
The first set of items to prioritize:
8+
9+
1. Estimator classes for common supervised models
10+
2. First-class regularization support
11+
3. Proper stochastic / mini-batch training support
12+
13+
These are the highest-leverage additions for making `pyensmallen` useful beyond optimizer demos and low-level objective wrappers.
14+
15+
## Full Proposal List
16+
17+
### 1. Estimator classes for common supervised models
18+
19+
Add estimator APIs for standard econometrics and ML models:
20+
21+
- `LinearRegression`
22+
- `LogisticRegression`
23+
- `PoissonRegression`
24+
- `MultinomialLogit`
25+
- `Probit`
26+
- `NegativeBinomial`
27+
- optionally `CoxPH`
28+
29+
Each estimator should expose a workflow-level API:
30+
31+
- `fit`
32+
- `predict`
33+
- `predict_proba` where applicable
34+
- `score`
35+
- fitted coefficients and intercept
36+
- convergence diagnostics
37+
- optional standard errors and summaries
38+
39+
Rationale:
40+
The current API is objective-first. Real workflows usually want model objects, not raw closures.
41+
42+
### 2. First-class regularization support
43+
44+
Add penalized estimation support across core models:
45+
46+
- L1
47+
- L2
48+
- elastic net
49+
- regularization paths
50+
- cross-validated penalty selection
51+
52+
This should work naturally with existing constrained optimization ideas already present in the package.
53+
54+
Rationale:
55+
This is central to both supervised learning and modern econometrics, especially in high-dimensional settings.
56+
57+
### 3. Productized JAX bridge
58+
59+
Turn the current notebook pattern into a supported API:
60+
61+
- `JaxObjective`
62+
- `AutoDiffObjective`
63+
- or `AutoDiffEstimator`
64+
65+
The wrapper should accept a JAX loss function and automatically provide:
66+
67+
- objective evaluation
68+
- gradients
69+
- shape handling
70+
- low-boilerplate integration with ensmallen optimizers
71+
72+
Rationale:
73+
The multinomial logit notebook already shows this is useful. It should be library functionality, not notebook glue code.
74+
75+
### 4. Proper stochastic / mini-batch training support
76+
77+
Expose true separable-objective support for first-order optimizers:
78+
79+
- mini-batch iteration
80+
- batch indexing
81+
- data shuffling
82+
- epoch-level callbacks
83+
- objective tracking
84+
- early stopping hooks
85+
86+
This is especially important for:
87+
88+
- large supervised-learning problems
89+
- neural-style differentiable objectives
90+
- scalable generalized linear models
91+
92+
Rationale:
93+
The Adam-family bindings exist, but the current wrapper behaves like full-batch optimization. That limits the ML use case substantially.
94+
95+
### 5. Inference utilities beyond point estimation
96+
97+
Expand the econometrics side with reusable inference tools:
98+
99+
- sandwich covariance
100+
- HC0-HC3 robust standard errors
101+
- clustered standard errors
102+
- HAC / Newey-West
103+
- Wald, likelihood-ratio, and score tests
104+
- delta method
105+
- marginal effects
106+
- bootstrap helpers for MLE models
107+
108+
Rationale:
109+
The package already goes in this direction for GMM. Extending it to MLE models would make it much more useful for empirical work.
110+
111+
### 6. Model selection and evaluation tools
112+
113+
Add workflow-level evaluation and tuning utilities:
114+
115+
- train / validation splitting
116+
- K-fold cross-validation
117+
- time-series cross-validation
118+
- standard supervised metrics
119+
- calibration diagnostics
120+
- hyperparameter search
121+
- early stopping support
122+
123+
Metrics should include at least:
124+
125+
- RMSE
126+
- MAE
127+
- log loss
128+
- AUC
129+
130+
Rationale:
131+
Several notebooks currently hand-roll comparison and tuning logic that should live in the library.
132+
133+
### 7. Higher-level causal and panel estimators
134+
135+
Potential estimator layer additions include:
136+
137+
- `SyntheticControl`
138+
- balancing weights estimators
139+
- ridge-augmented synthetic control
140+
- matrix-completion synthetic control
141+
- DiD and event-study estimators
142+
- IV / 2SLS / LIML
143+
- doubly robust or orthogonal-score estimators
144+
145+
Rationale:
146+
This is a natural applied econometrics extension, though a substantial part of this already exists in the sibling `synthlearners` repository.
147+
148+
### 8. Formula and DataFrame ergonomics
149+
150+
Improve usability for empirical workflows:
151+
152+
- formula interface
153+
- automatic intercept handling
154+
- categorical encoding
155+
- missing-data policy
156+
- sample weights
157+
- grouped / clustered identifiers
158+
- pandas-friendly summaries
159+
160+
Rationale:
161+
Econometrics users often work from tabular data first, not prebuilt dense matrices.
162+
163+
## Suggested Implementation Order
164+
165+
1. Estimator classes for core GLMs
166+
2. Regularization support
167+
3. True separable-objective and mini-batch support
168+
4. Inference utilities for MLE models
169+
5. Productized JAX autodiff bridge
170+
6. Evaluation and model-selection utilities
171+
7. Selective integration points with `synthlearners`
172+
8. Additional causal and panel estimators only where they belong in this repo
173+
174+
## Repo Boundary
175+
176+
Current working assumption:
177+
178+
- `pyensmallen` should focus on optimization primitives, reusable objectives, supervised estimators, autodiff integration, and inference utilities.
179+
- `synthlearners` should remain the home for most panel and synthetic-control estimators, while depending on `pyensmallen` where useful.
180+

README.md

Lines changed: 67 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Lightweight python bindings for `ensmallen` library. Currently supports
88
- constraints are either lp-ball (lasso, ridge, elastic-net) or simplex
99
+ (Generalized) Method of Moments estimation with ensmallen optimizers.
1010
- this uses ensmallen for optimization [and relies on `jax` for automatic differentiation to get gradients and jacobians]. This is the main use case for `pyensmallen` and is the reason for the bindings.
11+
+ Estimator classes for linear, logistic, and Poisson regression with classical and robust inference for unregularized fits
1112

1213
See [ensmallen docs](https://ensmallen.org/docs.html) for details. The `notebooks/` directory walks through several statistical examples.
1314

@@ -25,15 +26,76 @@ Then,
2526
__from pypi__
2627

2728
```
28-
pip install pyensmallen
29+
uv pip install pyensmallen
2930
```
3031

3132
__from source__
3233
1. Install `armadillo` and `ensmallen` for your system (build from source, or via conda-forge; I went with the latter)
3334
2. git clone this repository
34-
3. `pip install -e .`
35-
4. Profit? Or at least minimize loss?
35+
3. If you are using `uv`:
36+
- `uv pip install --python .venv/bin/python meson meson-python ninja pybind11`
37+
- `uv pip install --python .venv/bin/python --no-build-isolation -e .`
38+
4. If you are using vanilla `pip` in an activated environment:
39+
- `python -m pip install meson meson-python ninja pybind11`
40+
- `python -m pip install --no-build-isolation -e .`
41+
5. Profit? Or at least minimize loss?
42+
43+
__full development environment__
44+
45+
To install everything required to run tests and notebooks:
46+
47+
```bash
48+
uv pip install --python .venv/bin/python meson meson-python ninja pybind11
49+
uv pip install --python .venv/bin/python --no-build-isolation -e ".[full]"
50+
```
51+
52+
Vanilla `pip` equivalent:
53+
54+
```bash
55+
python -m pip install meson meson-python ninja pybind11
56+
python -m pip install --no-build-isolation -e ".[full]"
57+
```
58+
59+
The `full` extra includes the Python dependencies used by:
60+
61+
- the test suite
62+
- GMM and autodiff examples
63+
- benchmark notebooks
64+
- plotting and notebook tooling
65+
66+
__documentation__
67+
68+
### doc-generation
69+
70+
The repository includes a Quarto documentation site in `docs/`. The docs are built from three sources:
71+
72+
- hand-written Quarto pages in `docs/*.qmd`
73+
- generated API reference pages in `docs/reference/*.qmd`, built from Python and pybind11 docstrings with `quartodoc`
74+
- executed notebook pages in `docs/notebooks/*.ipynb`
75+
76+
Use the render script instead of calling `quarto render` directly:
77+
78+
```bash
79+
scripts/render_docs.sh
80+
```
81+
82+
The script does the following:
83+
84+
- uses the repository `.venv` as the Quarto Python runtime
85+
- forces JAX onto CPU so notebook execution is stable during docs builds
86+
- copies the tracked notebooks from `notebooks/` into `docs/notebooks/`
87+
- runs `quartodoc` to regenerate the API reference pages from docstrings
88+
- runs `quarto render docs` to execute the notebooks and build the site
89+
90+
If you need the full docs toolchain first:
91+
92+
```bash
93+
uv pip install --python .venv/bin/python meson meson-python ninja pybind11
94+
uv pip install --python .venv/bin/python --no-build-isolation -e ".[full]"
95+
```
96+
97+
The rendered site lands in `docs/_site/`. The generated API source pages land in `docs/reference/`.
3698

3799
__from wheel__
38-
- download the appropriate `.whl` for your system from the more recent release listed in `Releases` and run `pip install ./pyensmallen...` OR
39-
- copy the download url and run `pip install https://github.com/apoorvalal/pyensmallen/releases/download/<version>/pyensmallen-<version>-<pyversion>-linux_x86_64.whl`
100+
- download the appropriate `.whl` for your system from the more recent release listed in `Releases` and run `uv pip install ./pyensmallen...` OR
101+
- copy the download url and run `uv pip install https://github.com/apoorvalal/pyensmallen/releases/download/<version>/pyensmallen-<version>-<pyversion>-linux_x86_64.whl`

docs/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/.quarto/

docs/_quarto.yml

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
project:
2+
type: website
3+
output-dir: _site
4+
render:
5+
- index.qmd
6+
- benchmarks.qmd
7+
- optimizers.qmd
8+
- estimators.qmd
9+
- notebooks.qmd
10+
- reference/*.qmd
11+
- notebooks/example.ipynb
12+
- notebooks/banana.ipynb
13+
- notebooks/gmm.ipynb
14+
- notebooks/autodiff_mnl.ipynb
15+
- notebooks/regularization_comparison.ipynb
16+
17+
metadata-files:
18+
- reference/_sidebar.yml
19+
20+
website:
21+
title: pyensmallen
22+
navbar:
23+
left:
24+
- href: index.qmd
25+
text: Home
26+
- href: optimizers.qmd
27+
text: Optimizers
28+
- href: benchmarks.qmd
29+
text: Benchmarks
30+
- href: estimators.qmd
31+
text: Estimators
32+
- href: reference/index.qmd
33+
text: API
34+
- href: notebooks.qmd
35+
text: Notebooks
36+
page-footer:
37+
left: "pyensmallen documentation"
38+
39+
jupyter: python3
40+
41+
execute:
42+
enabled: true
43+
warning: false
44+
error: false
45+
46+
format:
47+
html:
48+
theme: cosmo
49+
css:
50+
- styles.css
51+
- reference/_styles-quartodoc.css
52+
toc: true
53+
54+
quartodoc:
55+
package: pyensmallen
56+
dir: reference
57+
title: API Reference
58+
sidebar: reference/_sidebar.yml
59+
css: reference/_styles-quartodoc.css
60+
parser: numpy
61+
dynamic: true
62+
sections:
63+
- title: Estimators
64+
desc: Supervised estimator classes with inference helpers.
65+
contents:
66+
- LinearRegression
67+
- LogisticRegression
68+
- PoissonRegression
69+
- title: Optimizers
70+
desc: Low-level optimizer bindings exposed from ensmallen.
71+
contents:
72+
- L_BFGS
73+
- FrankWolfe
74+
- SimplexFrankWolfe
75+
- Adam
76+
- AdaMax
77+
- AMSGrad
78+
- OptimisticAdam
79+
- Nadam
80+
- title: Objectives and GMM
81+
desc: Low-level objectives and the GMM estimator interface.
82+
contents:
83+
- linear_obj
84+
- logistic_obj
85+
- poisson_obj
86+
- EnsmallenEstimator

docs/assets/benchmark_conv.png

405 KB
Loading

docs/assets/benchmark_time.png

946 KB
Loading

0 commit comments

Comments
 (0)