Skip to content

Commit 5b3a6c4

Browse files
committed
2 parents 9bceb42 + af18638 commit 5b3a6c4

File tree

5 files changed

+178
-3
lines changed

5 files changed

+178
-3
lines changed

guides/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Practical companion to "Verifying the existence of maximum likelihood estimates for generalized linear models" (Correia, Guimarães, Zylkin)
2+
3+
This companion consists of three documents, plus a suit of test datasets, that complement the paper:
4+
5+
> Sergio Correia, Paulo Guimarães, Thomas Zylkin: "Verifying the existence of maximum likelihood estimates for generalized linear models"
6+
7+
The documents are:
8+
9+
1. [*Primer on nonexistence of estimates and statistical separation for Poisson models*](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_primer.md): introductory guide to understanding the non-existence problem, with a focus on on Poisson models. Also discusses how to detect this issue, and explains solutions including our "iterative rectifier" method.
10+
2. [*Examples of nonexistence of estimates for Poisson, Logit, and Multinomial Logit models*](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_examples.md): discusses several canonical examples of non-existence and how our "iterative rectifier" addresses them. Examples include Logit, Multinomial Logit, and Poisson. Further presents seventeen new Poisson examples that can be used to test software implementation of existing separation algorithms as well as to benchmark the performance of future algorithms.
11+
3. [*Nonexistence of estimates of Poisson models across different statistical packages*](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_benchmarks.md): documents how non-existence affects some of the most popular statistical packages (Stata, R, Julia, Matlab), with either non-convergence or convergence to incorrect solutions.
12+
13+
Also see:
14+
15+
- [*Main page for the `ppmlhdfe` Stata package](https://github.com/sergiocorreia/ppmlhdfe), including some [undocumented options](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/undocumented.md) that can be used to illustrate and diagnose non-existence issues.
16+
- [*Suite of 17 poisson examples exhibiting non-existence*](https://github.com/sergiocorreia/ppmlhdfe/tree/master/guides/separation_datasets)
17+

guides/code/7-example-tobit.do

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
* ===========================================================================
2+
* Example: using the iterative rectifier algorithm to detect separation in Tobit models
3+
* ===========================================================================
4+
cls
5+
clear all
6+
7+
* Load SST data (see ppml help file)
8+
use http://personal.lse.ac.uk/tenreyro/mock
9+
10+
* Notice that -tobit- converges to wrong solution
11+
tobit y x z, ll(0)
12+
13+
* ppml help file example
14+
ppml y x z, check
15+
tobit y `e(included)' if e(sample)==1, ll(0)
16+
17+
* replicate the example using -ppmlhdfe-
18+
ppmlhdfe y x z
19+
tobit y x z if e(sample), ll(0)
20+
21+
* we can also use ppmlhdfe's diagnostic tool to detect which variables are driving separation and identify directions-of-recession
22+
ppmlhdfe y x z, tagsep(sep) zvar(z) r2
23+
24+
* Open question to readers: should we add a -check- option equivalent to:
25+
* ppmlhdfe y x*, tagsep(sep) zvar(z) r2
26+
* (in the spirit of ppml's check option)
27+
28+
exit
3.04 KB
Binary file not shown.

guides/nonexistence_benchmarks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
To the best of our knowledge, no existing statistical software addresses the separation problem in a robust way, more so when working with fixed effects. In this post, we study a few simple examples of separation, and how they affect some of the most popular statistical packages.
88

9-
We also include [18 example datasets](https://github.com/sergiocorreia/ppmlhdfe/tree/master/test/separation_datasets) that can be used by package developers to test for correctness of their programs, and we invite further additions to this list.
9+
We also include [17 example datasets](https://github.com/sergiocorreia/ppmlhdfe/tree/master/guides/separation_datasets) that can be used by package developers to test for correctness of their programs, and we invite further additions to this list.
1010

1111
Note that this is in no way a critique of the packages discussed below, which are in our opinion of excellent quality.
1212
Merely, we are using them to show the fact that separation is not only a [theoretical](http://scorreia.com/research/separation.pdf) problem, but a practical one.

guides/nonexistence_examples.md

Lines changed: 132 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
- About `ppmlhdfe`: [Github Readme](https://github.com/sergiocorreia/ppmlhdfe/tree/master?tab=readme-ov-file#ppmlhdfe-poisson-pseudo-likelihood-regression-with-multiple-levels-of-fixed-effects) | [Working Paper](https://arxiv.org/abs/1903.01690) | [Stata Journal](https://doi.org/10.1177/1536867X20909691) | [Help File](http://scorreia.com/help/ppmlhdfe.html) | [Undocumented Options](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/undocumented.md)
44
- About Nonexistence: [Working Paper](https://arxiv.org/abs/1903.01633) | [Primer](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_primer.md) | [Examples](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_examples.md) | [Software Benchmarks](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_benchmarks.md)
5-
- Sections: [Logit](#logit--logistic) | [Multinomial Logit](#multinomial-logit) | [Poisson](#poisson) | [Seventeen Examples](#seventeen-poisson-examples) | [References](#references)
5+
- Sections: [Logit](#logit--logistic) | [Multinomial Logit](#multinomial-logit) | [Poisson](#poisson) | [Seventeen Poisson Examples](#seventeen-poisson-examples) | [Tobit](#tobit-type-i-tobit-model) | [References](#references)
66

77
*(These examples complement [Verifying the existence of maximum likelihood estimates for generalized linear models](https://arxiv.org/abs/1903.01633); please see the links above for related guides.)*
88

@@ -355,6 +355,136 @@ As we can see, `ppmlhdfe` drops two observations as well as the variable `x2`. A
355355

356356
If you are a Stata user, you can run the script [`6-cgz-poisson-benchmarks.do`](code/6-cgz-poisson-benchmarks.do) in order to run all seventeen tests. Alternatively, it should be feasible to construct an equivalent for-loop in any statistical programming language.
357357

358+
## Tobit (Type I Tobit model)
359+
360+
Santos Silva and Tenreyro (2011) discuss a Tobit model left-censored at zero that suffers from nonexistence. In particular the model's likelihood is maximized when `b_{z} -> +∞`; i.e. as the coefficient for `z` approaches infinity.
361+
362+
```stata
363+
. use http://personal.lse.ac.uk/tenreyro/mock
364+
365+
. tobit y x z, ll(0)
366+
<some output omitted...>
367+
Tobit regression Number of obs = 100
368+
Uncensored = 96
369+
Limits: Lower = 0 Left-censored = 4
370+
Upper = +inf Right-censored = 0
371+
372+
LR chi2(2) = 8.52
373+
Prob > chi2 = 0.0141
374+
Log likelihood = -191.09891 Pseudo R2 = 0.0218
375+
376+
------------------------------------------------------------------------------
377+
y | Coefficient Std. err. t P>|t| [95% conf. interval]
378+
-------------+----------------------------------------------------------------
379+
x | .2096114 .1752055 1.20 0.234 -.1380783 .557301
380+
z | -10.18944 382.0137 -0.03 0.979 -768.2833 747.9044
381+
_cons | 1.703695 .1754911 9.71 0.000 1.355438 2.051951
382+
-------------+----------------------------------------------------------------
383+
var(e.y)| 3.003131 .435032 2.252828 4.003321
384+
------------------------------------------------------------------------------
385+
```
386+
387+
In this example, the model "converged" for `b_{z}=-10`, but using a stricter tolerance would return quite different values:
388+
389+
390+
```stata
391+
. tobit y x z, ll(0) nrtol(1e-12)
392+
<some output omitted...>
393+
Tobit regression Number of obs = 100
394+
Uncensored = 96
395+
Limits: Lower = 0 Left-censored = 4
396+
Upper = +inf Right-censored = 0
397+
398+
LR chi2(2) = 8.52
399+
Prob > chi2 = 0.0141
400+
Log likelihood = -191.09891 Pseudo R2 = 0.0218
401+
402+
------------------------------------------------------------------------------
403+
y | Coefficient Std. err. t P>|t| [95% conf. interval]
404+
-------------+----------------------------------------------------------------
405+
x | .2095681 .1752046 1.20 0.235 -.1381197 .5572558
406+
z | -14.32966 463697.2 -0.00 1.000 -920206.2 920177.6
407+
_cons | 1.703772 .1754901 9.71 0.000 1.355518 2.052027
408+
-------------+----------------------------------------------------------------
409+
var(e.y)| 3.003099 .4350229 2.252811 4.003267
410+
------------------------------------------------------------------------------
411+
```
412+
413+
In this specific example, the `ppml` comamnd can detect this issue:
414+
415+
```stata
416+
. ppml y x z, check
417+
418+
note: checking the existence of the estimates
419+
420+
Number of regressors excluded to ensure that the estimates exist: 1
421+
Excluded regressors: z
422+
Number of observations excluded: 2
423+
```
424+
425+
Although as discussed in our [discussion of software packages](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_benchmarks.md), this is only able to detect some specific instances of separation.
426+
427+
Instead, a more general alternative could be to repurpose `ppmlhdfe` to detect and exclude separation (notice how the tobit regression omits `z` due to collinearity, and also the two separated observations):
428+
429+
```stata
430+
. ppmlhdfe y x z
431+
(simplex method dropped 2 separated observations)
432+
note: 1 variable omitted because of collinearity: z
433+
<regression output omitted...>
434+
435+
. tobit y x z if e(sample), ll(0)
436+
note: z omitted because of collinearity.
437+
<some output omitted...>
438+
Tobit regression Number of obs = 98
439+
Uncensored = 96
440+
Limits: Lower = 0 Left-censored = 2
441+
Upper = +inf Right-censored = 0
442+
443+
LR chi2(1) = 1.42
444+
Prob > chi2 = 0.2331
445+
Log likelihood = -191.09891 Pseudo R2 = 0.0037
446+
447+
------------------------------------------------------------------------------
448+
y | Coefficient Std. err. t P>|t| [95% conf. interval]
449+
-------------+----------------------------------------------------------------
450+
x | .2095681 .1752046 1.20 0.235 -.1381645 .5573006
451+
z | 0 (omitted)
452+
_cons | 1.703772 .1754901 9.71 0.000 1.355473 2.052072
453+
-------------+----------------------------------------------------------------
454+
var(e.y)| 3.003099 .4350229 2.252728 4.003415
455+
------------------------------------------------------------------------------
456+
```
457+
458+
Finally, notice that we could even use ppmlhdfe's diagnostic tool to identify the specific directions-of-recession, and what exact linear combination of regressors is driving the separation:
459+
460+
```stata
461+
. ppmlhdfe y x z, tagsep(sep) zvar(z) r2
462+
(identifying separated observations instead of running regressions)
463+
<some output omitted...>
464+
(ReLU method dropped 2 separated observations in 1 iterations)
465+
466+
Verifying certificate of separation:
467+
. reghdfe z x z, noabsorb
468+
HDFE Linear regression Number of obs = 100
469+
Absorbing 1 HDFE group F( 2, 97) = .
470+
Prob > F = .
471+
R-squared = 1.0000
472+
Adj R-squared = 1.0000
473+
Within R-sq. = 1.0000
474+
Root MSE = 0.0000
475+
476+
------------------------------------------------------------------------------
477+
z | Coefficient Std. err. t P>|t| [95% conf. interval]
478+
-------------+----------------------------------------------------------------
479+
x | 2.08e-17 4.38e-17 0.48 0.636 -6.62e-17 1.08e-16
480+
z | 1 3.11e-16 3.2e+15 0.000 1 1
481+
_cons | 5.86e-17 4.40e-17 1.33 0.186 -2.87e-17 1.46e-16
482+
------------------------------------------------------------------------------
483+
```
484+
485+
As shown in the last regression, `ppmlhdfe` identifies that only one coefficient does not exist, as it has a non-zero coefficient.
486+
487+
358488
## References
359489

360490
- Palmgren (1981). "Models for the analysis of contingency tables with quantitative outcome variables". Biometrika, 68(3):563–576. https://www.jstor.org/stable/2335606
@@ -364,7 +494,7 @@ If you are a Stata user, you can run the script [`6-cgz-poisson-benchmarks.do`](
364494
- Correia, Guimarães, and Zylkin (2019). "Verifying the existence of maximum likelihood estimates for generalized linear models". arXiv Working Paper: https://arxiv.org/abs/1903.01633
365495
- Kosmidis and Schumacher (2021). "`detectseparation`: Detect and Check for Separation and Infinite Maximum Likelihood Estimates". https://cran.r-project.org/web/packages/detectseparation/
366496
- Kosmidis (2017). "`brglm2`: Bias Reduction in Multinomial Models". https://cran.r-project.org/web/packages/brglm2/vignettes/multinomial.html
367-
- Geyer (2009). "Likelihood Inference in Exponential Families and Directions of Recession." University of Minnesota, School of Statistics. http://www.stat.umn.edu/geyer/5421/notes/infinity.pdf
497+
- Geyer (2009). "Likelihood Inference in Exponential Families and Directions of Recession." University of Minnesota, School of Statistics. https://arxiv.org/pdf/0901.0455
368498
- Geyer (2025). Course notes. https://www.stat.umn.edu/geyer/5421/notes/infinity.html#complete-separation-example-of-agresti
369499
- Agresti (2012). "Categorical Data Analysis", 3rd Edition. Wiley.
370500
- Agresti (2015). "Foundations of Linear and Generalized Linear Models." Wiley.

0 commit comments

Comments
 (0)