Skip to content

Conversation

@leostimpfle
Copy link
Collaborator

This is a proof of concept for a refactor of PyFixest's formula parsing. The PR introduces a new module parse that refactors formula parsing from the ground up.

The core logic is implemented in pyfixest.estimation.formula.parse.parse which takes in a formula string and returns a collection of parsed formulas represented by pyfixest.estimation.formula.parse.Formula.

All references to the old FormulaParser are bypassed (mostly by renaming the old FixestFormula using imports of the form from pyfixest.estimation.formula.parse import Formula as FixestFormula)

@codecov
Copy link

codecov bot commented Dec 28, 2025

Codecov Report

❌ Patch coverage is 87.34694% with 62 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pyfixest/estimation/formula/factor_interaction.py 64.78% 25 Missing ⚠️
pyfixest/did/saturated_twfe.py 0.00% 11 Missing ⚠️
pyfixest/estimation/formula/model_matrix.py 94.36% 8 Missing ⚠️
pyfixest/estimation/model_matrix_fixest_.py 12.50% 7 Missing ⚠️
pyfixest/estimation/formula/utils.py 33.33% 6 Missing ⚠️
pyfixest/estimation/formula/parse.py 97.42% 5 Missing ⚠️
Flag Coverage Δ
core-tests 72.10% <87.34%> (?)
tests-extended ?
tests-vs-r 18.51% <30.81%> (+0.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pyfixest/did/did2s.py 89.32% <100.00%> (+71.39%) ⬆️
pyfixest/errors/__init__.py 100.00% <100.00%> (ø)
pyfixest/estimation/FixestMulti_.py 77.95% <100.00%> (+2.42%) ⬆️
pyfixest/estimation/FormulaParser.py 49.50% <100.00%> (-17.17%) ⬇️
pyfixest/estimation/fegaussian_.py 86.66% <100.00%> (+26.66%) ⬆️
pyfixest/estimation/feglm_.py 79.16% <100.00%> (+52.60%) ⬆️
pyfixest/estimation/feiv_.py 86.79% <100.00%> (+66.03%) ⬆️
pyfixest/estimation/felogit_.py 88.23% <100.00%> (+35.29%) ⬆️
pyfixest/estimation/feols_.py 86.80% <100.00%> (+26.28%) ⬆️
pyfixest/estimation/feols_compressed_.py 80.00% <100.00%> (+58.78%) ⬆️
... and 13 more

... and 30 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@s3alfisc
Copy link
Member

After first look at the code base: much better and much cleaner than before. No fundamental suggestions for improvement from my side. Thank you!

@leostimpfle
Copy link
Collaborator Author

Note to self: Look into allowing formulaic's multi stage formula notation in the parser. For more background see matthewwardrop/formulaic#108 and matthewwardrop/formulaic#24

@s3alfisc
Copy link
Member

s3alfisc commented Jan 1, 2026

@leostimpfle fixed a bug in the bin() function + adjusted the tests. You can run them via pixi r -e dev pytest tests/test_i.py. Currently give two errors:

      40 passed
       2 failed
         - tests/test_i.py:286 test_factor_x_factor[Y ~ i(f_str, i.g)-Y ~ i(f_str, g)]
         - tests/test_i.py:304 test_factor_x_factor_with_fe[Y ~ i(f_str, i.g) | fe1-Y ~ i(f_str, g) | fe1]
E           AssertionError: Name mismatch:
E               py=['f_str::apple:g::X', 'f_str::apple:g::Y', 'f_str::apple:g::Z', 'f_str::banana:g::X', 'f_str::banana:g::Y', 'f_str::banana:g::Z', 'f_str::cherry:g::X', 'f_str::cherry:g::Y']
E               r=['f_str::apple:g::Y', 'f_str::apple:g::Z', 'f_str::banana:g::X', 'f_str::banana:g::Y', 'f_str::banana:g::Z', 'f_str::cherry:g::X', 'f_str::cherry:g::Y', 'f_str::cherry:g::Z']
E           assert ['f_str::appl...na:g::Z', ...] == ['f_str::appl...ry:g::X', ...]
E             
E             At index 0 diff: 'f_str::apple:g::X' != 'f_str::apple:g::Y'

@leostimpfle
Copy link
Collaborator Author

@s3alfisc I've gone over the PR once more today. The biggest change is that I have rewritten pyfixest.estimation.formula.factor_interaction.factor_interaction to make the implementation (hopefully) easier to follow. More importantly, I have removed the monkey-patch of formulaic so we do not actually need the drop argument discussed in matthewwardrop/formulaic#263

I've also let Claude improve the docstrings and renamed a few functions/attributes for purely aesthetic reasons. The PR is now ready from my side (although there are many other potential improvements around formula parsing and model matrix construction: #1125, #1126, #1127, #1130).

@s3alfisc
Copy link
Member

s3alfisc commented Jan 4, 2026

Review in 4 steps:

  • parse function
  • ModelMatrix class
  • new python i() interaction
  • downstream impact on did2s and saturated_twfe

@s3alfisc
Copy link
Member

s3alfisc commented Jan 4, 2026

parse

  • I am still not a big fan of Formula.first_stageand Formula.second_stagenot containing fixed effects - potentially misleading to users despite documentation? Maybe we should add Formula.first_stage_no_fixed_effects etc as extra attributes to make more explicit what type of formula users are dealing with?
  • Can you specify the reason for the FORMULAIC_FEATURE_FLAG is DefaultFormulaParser.FeatureFlags.ALL in several spots in the code base? Why is it needed? Are there potential downsides?
  • Is the sortargument in parsestill needed?

I committed a few changes, I hope all of these make sense to you @leostimpfle and are more or less self-explanatory by the commit message?

@leostimpfle
Copy link
Collaborator Author

leostimpfle commented Jan 5, 2026

  • I am still not a big fan of Formula.first_stageand Formula.second_stagenot containing fixed effects - potentially misleading to users despite documentation? Maybe we should add Formula.first_stage_no_fixed_effects etc as extra attributes to make more explicit what type of formula users are dealing with?

Agreed that this is somewhat unintuitive. An alternative to changing the attribute names could be to include the encoded fixed effects directly in the formula. For example, instead of formula_kwargs = {'second_stage': 'Y ~ X1', 'fixed_effects' : 'f1 + f2'}, we could use formula_kwargs = {'second_stage': 'Y ~ X1 + __fixed_effect__(f1) + __fixed_effect__(f2)'} (where the sentinel __fixed_effect__ indicates the integer encoding of fixed effects). The main point is that the latter formula is what we already pass implicitly to formulaic, so in this approach we should call the attribute second_stage_formulaic.

  • Can you specify the reason for the FORMULAIC_FEATURE_FLAG is DefaultFormulaParser.FeatureFlags.ALL in several spots in the code base? Why is it needed? Are there potential downsides?

This is a hangover from my early attempts to use formulaic's multistage syntax (see #1125). DefaultFormulaParser.FeatureFlags.ALL indicates that the multistage syntax is enabled but the FORMULAIC_FEATURE_FLAG is set to DefaultFormulaParser.FeatureFlags.DEFAULT (i.e., multistage syntax is disabled). For clarity, I have removed references to FORMULAIC_FEATURE_FLAG in the parser for now.

  • Is the sortargument in parsestill needed?

Not needed, and I have removed it

I committed a few changes, I hope all of these make sense to you @leostimpfle and are more or less self-explanatory by the commit message?

Yes, all good. Thanks @s3alfisc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants