Refactor formula parsing #1118

leostimpfle · 2025-12-28T10:59:55Z

This is a proof of concept for a refactor of PyFixest's formula parsing. The PR introduces a new module parse that refactors formula parsing from the ground up.

The core logic is implemented in pyfixest.estimation.formula.parse.parse which takes in a formula string and returns a collection of parsed formulas represented by pyfixest.estimation.formula.parse.Formula.

All references to the old FormulaParser are bypassed (mostly by renaming the old FixestFormula using imports of the form from pyfixest.estimation.formula.parse import Formula as FixestFormula)

codecov · 2025-12-28T11:19:16Z

Codecov Report

❌ Patch coverage is 87.34694% with 62 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pyfixest/estimation/formula/factor_interaction.py	64.78%	25 Missing ⚠️
pyfixest/did/saturated_twfe.py	0.00%	11 Missing ⚠️
pyfixest/estimation/formula/model_matrix.py	94.36%	8 Missing ⚠️
pyfixest/estimation/model_matrix_fixest_.py	12.50%	7 Missing ⚠️
pyfixest/estimation/formula/utils.py	33.33%	6 Missing ⚠️
pyfixest/estimation/formula/parse.py	97.42%	5 Missing ⚠️

Flag	Coverage Δ
core-tests	`72.10% <87.34%> (?)`
tests-extended	`?`
tests-vs-r	`18.51% <30.81%> (+0.95%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pyfixest/did/did2s.py	`89.32% <100.00%> (+71.39%)`	⬆️
pyfixest/errors/__init__.py	`100.00% <100.00%> (ø)`
pyfixest/estimation/FixestMulti_.py	`77.95% <100.00%> (+2.42%)`	⬆️
pyfixest/estimation/FormulaParser.py	`49.50% <100.00%> (-17.17%)`	⬇️
pyfixest/estimation/fegaussian_.py	`86.66% <100.00%> (+26.66%)`	⬆️
pyfixest/estimation/feglm_.py	`79.16% <100.00%> (+52.60%)`	⬆️
pyfixest/estimation/feiv_.py	`86.79% <100.00%> (+66.03%)`	⬆️
pyfixest/estimation/felogit_.py	`88.23% <100.00%> (+35.29%)`	⬆️
pyfixest/estimation/feols_.py	`86.80% <100.00%> (+26.28%)`	⬆️
pyfixest/estimation/feols_compressed_.py	`80.00% <100.00%> (+58.78%)`	⬆️
... and 13 more

... and 30 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

s3alfisc · 2025-12-28T12:14:11Z

After first look at the code base: much better and much cleaner than before. No fundamental suggestions for improvement from my side. Thank you!

leostimpfle · 2025-12-28T16:31:58Z

Note to self: Look into allowing formulaic's multi stage formula notation in the parser. For more background see matthewwardrop/formulaic#108 and matthewwardrop/formulaic#24

s3alfisc · 2026-01-01T13:36:25Z

@leostimpfle fixed a bug in the bin() function + adjusted the tests. You can run them via pixi r -e dev pytest tests/test_i.py. Currently give two errors:

      40 passed
       2 failed
         - tests/test_i.py:286 test_factor_x_factor[Y ~ i(f_str, i.g)-Y ~ i(f_str, g)]
         - tests/test_i.py:304 test_factor_x_factor_with_fe[Y ~ i(f_str, i.g) | fe1-Y ~ i(f_str, g) | fe1]
E           AssertionError: Name mismatch:
E               py=['f_str::apple:g::X', 'f_str::apple:g::Y', 'f_str::apple:g::Z', 'f_str::banana:g::X', 'f_str::banana:g::Y', 'f_str::banana:g::Z', 'f_str::cherry:g::X', 'f_str::cherry:g::Y']
E               r=['f_str::apple:g::Y', 'f_str::apple:g::Z', 'f_str::banana:g::X', 'f_str::banana:g::Y', 'f_str::banana:g::Z', 'f_str::cherry:g::X', 'f_str::cherry:g::Y', 'f_str::cherry:g::Z']
E           assert ['f_str::appl...na:g::Z', ...] == ['f_str::appl...ry:g::X', ...]
E             
E             At index 0 diff: 'f_str::apple:g::X' != 'f_str::apple:g::Y'

leostimpfle · 2026-01-04T15:21:20Z

@s3alfisc I've gone over the PR once more today. The biggest change is that I have rewritten pyfixest.estimation.formula.factor_interaction.factor_interaction to make the implementation (hopefully) easier to follow. More importantly, I have removed the monkey-patch of formulaic so we do not actually need the drop argument discussed in matthewwardrop/formulaic#263

I've also let Claude improve the docstrings and renamed a few functions/attributes for purely aesthetic reasons. The PR is now ready from my side (although there are many other potential improvements around formula parsing and model matrix construction: #1125, #1126, #1127, #1130).

s3alfisc · 2026-01-04T19:02:38Z

Review in 4 steps:

parse function
ModelMatrix class
new python i() interaction
downstream impact on did2s and saturated_twfe

…ching it early

…tiple estimation

s3alfisc · 2026-01-04T21:18:36Z

parse

I am still not a big fan of Formula.first_stageand Formula.second_stagenot containing fixed effects - potentially misleading to users despite documentation? Maybe we should add Formula.first_stage_no_fixed_effects etc as extra attributes to make more explicit what type of formula users are dealing with?
Can you specify the reason for the FORMULAIC_FEATURE_FLAG is DefaultFormulaParser.FeatureFlags.ALL in several spots in the code base? Why is it needed? Are there potential downsides?
Is the sortargument in parsestill needed?

I committed a few changes, I hope all of these make sense to you @leostimpfle and are more or less self-explanatory by the commit message?

leostimpfle · 2026-01-05T08:58:08Z

I am still not a big fan of Formula.first_stageand Formula.second_stagenot containing fixed effects - potentially misleading to users despite documentation? Maybe we should add Formula.first_stage_no_fixed_effects etc as extra attributes to make more explicit what type of formula users are dealing with?

Agreed that this is somewhat unintuitive. An alternative to changing the attribute names could be to include the encoded fixed effects directly in the formula. For example, instead of formula_kwargs = {'second_stage': 'Y ~ X1', 'fixed_effects' : 'f1 + f2'}, we could use formula_kwargs = {'second_stage': 'Y ~ X1 + __fixed_effect__(f1) + __fixed_effect__(f2)'} (where the sentinel __fixed_effect__ indicates the integer encoding of fixed effects). The main point is that the latter formula is what we already pass implicitly to formulaic, so in this approach we should call the attribute second_stage_formulaic.

Can you specify the reason for the FORMULAIC_FEATURE_FLAG is DefaultFormulaParser.FeatureFlags.ALL in several spots in the code base? Why is it needed? Are there potential downsides?

This is a hangover from my early attempts to use formulaic's multistage syntax (see #1125). DefaultFormulaParser.FeatureFlags.ALL indicates that the multistage syntax is enabled but the FORMULAIC_FEATURE_FLAG is set to DefaultFormulaParser.FeatureFlags.DEFAULT (i.e., multistage syntax is disabled). For clarity, I have removed references to FORMULAIC_FEATURE_FLAG in the parser for now.

Is the sortargument in parsestill needed?

Not needed, and I have removed it

I committed a few changes, I hope all of these make sense to you @leostimpfle and are more or less self-explanatory by the commit message?

Yes, all good. Thanks @s3alfisc!

leostimpfle added 4 commits December 28, 2025 09:53

Bypass FormulaParser

e6a7587

Reverse order to match hard-coded targets

f94a814

Fix pre-commit

3118d18

Freeze _MultipleEstimation

d0a8821

leostimpfle added 2 commits December 28, 2025 16:19

Sort independents by default for tests against fixest

100c357

Encode no fixed effects as None instead of '0'

f4b2ea0

leostimpfle and others added 20 commits December 28, 2025 18:03

Fix if fixed effects are None

2e93cbe

Fix encoding for multiple estimation of fixed effects

bf82eb6

Replace typing.Optional with union type

a928a6b

Close #1117

ce13140

Reorder checks to comply with test failurs

c4d750a

Add new model matrix functionality

8e8e5fe

Add singleton warning

f75da04

Various fixes (did2s and i()-syntax still failing)

761ea08

Fix pre-commit

d79f4e9

Retain nulls in fixed effect encoding

c24f969

Refactor fixest::i, closes #782, fixes #921, fixes #1109

972eb66

Fix pre-commit

415f5bc

Deal with log-related infinities

e23e7b2

Drop intercept after matrix construction for fixed effects

9219a81

Monkey patch formulaic

f3b7e67

Encode fixed effects only when non-numeric

986d21d

Fix inference of reduced_rank

be7aa93

Use to_numpy

0e0402d

fix binning to keep values not specified in binning as is instead of NaN

0e7facf

adjust tests for i-interaction

31714ea

Fix pre-commit

ec30160

s3alfisc mentioned this pull request Jan 4, 2026

Feature/demean accelerated #995

Open

leostimpfle mentioned this pull request Jan 4, 2026

Advanced binning functionality for fixest::i syntax #1130

Open

leostimpfle added 2 commits January 4, 2026 15:50

Improve docs and function/attribute names

3395ce5

Merge branch 'master' into formula

5661b85

s3alfisc added 10 commits January 4, 2026 20:20

fix incorrect test expectation with IV and fixed effects

e67810e

fix incorrect ordering of fixed effect and IV part of formula

0b4de2d

test for expected behavior of 0 fixed effects in formula syntax

7065321

clarification on overlap between independent, endogenous, instruments

aa093f6

clarifications on overlap of dependent, endogenous, instruments

292b496

fix silent pass through of incorrect syntax of Y ~ X | f1 | f2 by cat…

a520f06

…ching it early

only one tilde in part 2 permitted (same motif as before)

4ce3c29

is_multiple only checks dependent, independent, fixed effects for mul…

532049b

…tiple estimation

consolidate multiple estimation flag setting & checks

3704dd9

add examples to specifications

1ee80af

leostimpfle added 3 commits January 5, 2026 09:11

Fix pre-commit

c21b0e9

Remove sort

65da109

Remove FORMULAIC_FEATURE_FLAG

647ad27

This was linked to issues Jan 6, 2026

IV with multiple endogenous variables #1117

Open

Mimic fixest::i() by relying on formulaic stateful transforms #782

Open

i() does not accept two categorical terms #921

Open

Reference level not omitted in i() syntax #1109

Open

s3alfisc mentioned this pull request Jan 11, 2026

pyfixest is MUCH slower than (Stata + Julia) reghdfejl #1042

Open

leostimpfle mentioned this pull request Jan 12, 2026

Sum of variables not parsed correctly in multiple estimation syntax #1137

Open

Fix #1137

5731196

leostimpfle linked an issue Jan 12, 2026 that may be closed by this pull request

Sum of variables not parsed correctly in multiple estimation syntax #1137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor formula parsing #1118

Refactor formula parsing #1118

Uh oh!

leostimpfle commented Dec 28, 2025

Uh oh!

codecov bot commented Dec 28, 2025 •

edited

Loading

Uh oh!

s3alfisc commented Dec 28, 2025

Uh oh!

leostimpfle commented Dec 28, 2025

Uh oh!

s3alfisc commented Jan 1, 2026

Uh oh!

leostimpfle commented Jan 4, 2026

Uh oh!

s3alfisc commented Jan 4, 2026 •

edited

Loading

Uh oh!

s3alfisc commented Jan 4, 2026

Uh oh!

leostimpfle commented Jan 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor formula parsing #1118

Are you sure you want to change the base?

Refactor formula parsing #1118

Uh oh!

Conversation

leostimpfle commented Dec 28, 2025

Uh oh!

codecov bot commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

s3alfisc commented Dec 28, 2025

Uh oh!

leostimpfle commented Dec 28, 2025

Uh oh!

s3alfisc commented Jan 1, 2026

Uh oh!

leostimpfle commented Jan 4, 2026

Uh oh!

s3alfisc commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Jan 4, 2026

parse

Uh oh!

leostimpfle commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Dec 28, 2025 •

edited

Loading

s3alfisc commented Jan 4, 2026 •

edited

Loading

leostimpfle commented Jan 5, 2026 •

edited

Loading