Merge branch 'master' of https://github.com/sergiocorreia/ppmlhdfe

sergiocorreia · sergiocorreia · commit 5b3a6c425e0e · 2025-11-02T01:01:55.000-05:00
diff --git a/guides/README.md b/guides/README.md
@@ -0,0 +1,17 @@
+# Practical companion to "Verifying the existence of maximum likelihood estimates for generalized linear models" (Correia, Guimarães, Zylkin)
+
+This companion consists of three documents, plus a suit of test datasets, that complement the paper:
+
+> Sergio Correia, Paulo Guimarães, Thomas Zylkin: "Verifying the existence of maximum likelihood estimates for generalized linear models"
+
+The documents are:
+
+1. [*Primer on nonexistence of estimates and statistical separation for Poisson models*](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_primer.md): introductory guide to understanding the non-existence problem, with a focus on on Poisson models. Also discusses how to detect this issue, and explains solutions including our "iterative rectifier" method.
+2. [*Examples of nonexistence of estimates for Poisson, Logit, and Multinomial Logit models*](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_examples.md): discusses several canonical examples of non-existence and how our "iterative rectifier" addresses them. Examples include Logit, Multinomial Logit, and Poisson. Further presents seventeen new Poisson examples that can be used to test software implementation of existing separation algorithms as well as to benchmark the performance of future algorithms.
+3. [*Nonexistence of estimates of Poisson models across different statistical packages*](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_benchmarks.md): documents how non-existence affects some of the most popular statistical packages (Stata, R, Julia, Matlab), with either non-convergence or convergence to incorrect solutions.
+
+Also see:
+
+- [*Main page for the `ppmlhdfe` Stata package](https://github.com/sergiocorreia/ppmlhdfe), including some [undocumented options](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/undocumented.md) that can be used to illustrate and diagnose non-existence issues.
+- [*Suite of 17 poisson examples exhibiting non-existence*](https://github.com/sergiocorreia/ppmlhdfe/tree/master/guides/separation_datasets)
+
diff --git a/guides/code/7-example-tobit.do b/guides/code/7-example-tobit.do
@@ -0,0 +1,28 @@
+* ===========================================================================
+* Example: using the iterative rectifier algorithm to detect separation in Tobit models
+* ===========================================================================
+cls
+clear all
+
+* Load SST data (see ppml help file)
+use http://personal.lse.ac.uk/tenreyro/mock
+
+* Notice that -tobit- converges to wrong solution
+tobit y x z, ll(0)
+
+* ppml help file example
+ppml y x z, check
+tobit y `e(included)' if e(sample)==1, ll(0)
+
+* replicate the example using -ppmlhdfe-
+ppmlhdfe y x z
+tobit y x z if e(sample), ll(0)
+
+* we can also use ppmlhdfe's diagnostic tool to detect which variables are driving separation and identify directions-of-recession
+ppmlhdfe y x z, tagsep(sep) zvar(z) r2
+
+* Open question to readers: should we add a -check- option equivalent to:
+* ppmlhdfe y x*, tagsep(sep) zvar(z) r2
+* (in the spirit of ppml's check option)
+
+exit
diff --git a/guides/input/mock-sst2011-tobit.dta b/guides/input/mock-sst2011-tobit.dta
diff --git a/guides/nonexistence_benchmarks.md b/guides/nonexistence_benchmarks.md
@@ -6,7 +6,7 @@
 
 To the best of our knowledge, no existing statistical software addresses the separation problem in a robust way, more so when working with fixed effects. In this post, we study a few simple examples of separation, and how they affect some of the most popular statistical packages.
 
-We also include [18 example datasets](https://github.com/sergiocorreia/ppmlhdfe/tree/master/test/separation_datasets) that can be used by package developers to test for correctness of their programs, and we invite further additions to this list.
+We also include [17 example datasets](https://github.com/sergiocorreia/ppmlhdfe/tree/master/guides/separation_datasets) that can be used by package developers to test for correctness of their programs, and we invite further additions to this list.
 
 Note that this is in no way a critique of the packages discussed below, which are in our opinion of excellent quality.
 Merely, we are using them to show the fact that separation is not only a [theoretical](http://scorreia.com/research/separation.pdf) problem, but a practical one.
diff --git a/guides/nonexistence_examples.md b/guides/nonexistence_examples.md
@@ -2,7 +2,7 @@
 
 - About `ppmlhdfe`: [Github Readme](https://github.com/sergiocorreia/ppmlhdfe/tree/master?tab=readme-ov-file#ppmlhdfe-poisson-pseudo-likelihood-regression-with-multiple-levels-of-fixed-effects) | [Working Paper](https://arxiv.org/abs/1903.01690) | [Stata Journal](https://doi.org/10.1177/1536867X20909691) | [Help File](http://scorreia.com/help/ppmlhdfe.html) | [Undocumented Options](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/undocumented.md)
 - About Nonexistence: [Working Paper](https://arxiv.org/abs/1903.01633) | [Primer](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_primer.md) | [Examples](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_examples.md) | [Software Benchmarks](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_benchmarks.md)
-- Sections: [Logit](#logit--logistic) | [Multinomial Logit](#multinomial-logit) | [Poisson](#poisson) | [Seventeen Examples](#seventeen-poisson-examples) | [References](#references)
+- Sections: [Logit](#logit--logistic) | [Multinomial Logit](#multinomial-logit) | [Poisson](#poisson) | [Seventeen Poisson Examples](#seventeen-poisson-examples) | [Tobit](#tobit-type-i-tobit-model) | [References](#references)
 
 *(These examples complement [Verifying the existence of maximum likelihood estimates for generalized linear models](https://arxiv.org/abs/1903.01633); please see the links above for related guides.)*
 
@@ -355,6 +355,136 @@ As we can see, `ppmlhdfe` drops two observations as well as the variable `x2`. A
 
 If you are a Stata user, you can run the script [`6-cgz-poisson-benchmarks.do`](code/6-cgz-poisson-benchmarks.do) in order to run all seventeen tests. Alternatively, it should be feasible to construct an equivalent for-loop in any statistical programming language.
 
+## Tobit (Type I Tobit model)
+
+Santos Silva and Tenreyro (2011) discuss a Tobit model left-censored at zero that suffers from nonexistence. In particular the model's likelihood is maximized when `b_{z} -> +∞`; i.e. as the coefficient for `z` approaches infinity.
+
+```stata
+. use http://personal.lse.ac.uk/tenreyro/mock
+
+. tobit y x z, ll(0)
+<some output omitted...>
+Tobit regression                                    Number of obs     =    100
+                                                           Uncensored =     96
+Limits: Lower =    0                                    Left-censored =      4
+        Upper = +inf                                   Right-censored =      0
+
+                                                    LR chi2(2)        =   8.52
+                                                    Prob > chi2       = 0.0141
+Log likelihood = -191.09891                         Pseudo R2         = 0.0218
+
+------------------------------------------------------------------------------
+           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
+-------------+----------------------------------------------------------------
+           x |   .2096114   .1752055     1.20   0.234    -.1380783     .557301
+           z |  -10.18944   382.0137    -0.03   0.979    -768.2833    747.9044
+       _cons |   1.703695   .1754911     9.71   0.000     1.355438    2.051951
+-------------+----------------------------------------------------------------
+     var(e.y)|   3.003131    .435032                      2.252828    4.003321
+------------------------------------------------------------------------------
+```
+
+In this example, the model "converged" for  `b_{z}=-10`, but using a stricter tolerance would return quite different values:
+
+
+```stata
+.  tobit y x z, ll(0) nrtol(1e-12)
+<some output omitted...>
+Tobit regression                                    Number of obs     =    100
+                                                           Uncensored =     96
+Limits: Lower =    0                                    Left-censored =      4
+        Upper = +inf                                   Right-censored =      0
+
+                                                    LR chi2(2)        =   8.52
+                                                    Prob > chi2       = 0.0141
+Log likelihood = -191.09891                         Pseudo R2         = 0.0218
+
+------------------------------------------------------------------------------
+           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
+-------------+----------------------------------------------------------------
+           x |   .2095681   .1752046     1.20   0.235    -.1381197    .5572558
+           z |  -14.32966   463697.2    -0.00   1.000    -920206.2    920177.6
+       _cons |   1.703772   .1754901     9.71   0.000     1.355518    2.052027
+-------------+----------------------------------------------------------------
+     var(e.y)|   3.003099   .4350229                      2.252811    4.003267
+------------------------------------------------------------------------------
+```
+
+In this specific example, the `ppml` comamnd can detect this issue:
+
+```stata
+. ppml y x z, check
+
+note: checking the existence of the estimates
+
+Number of regressors excluded to ensure that the estimates exist: 1
+Excluded regressors:  z
+Number of observations excluded: 2
+```
+
+Although as discussed in our [discussion of software packages](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/nonexistence_benchmarks.md), this is only able to detect some specific instances of separation.
+
+Instead, a more general alternative could be to repurpose `ppmlhdfe` to detect and exclude separation (notice how the tobit regression omits `z` due to collinearity, and also the two separated observations):
+
+```stata
+. ppmlhdfe y x z
+(simplex method dropped 2 separated observations)
+note: 1 variable omitted because of collinearity: z
+<regression output omitted...>
+
+. tobit y x z if e(sample), ll(0)
+note: z omitted because of collinearity.
+<some output omitted...>
+Tobit regression                                    Number of obs     =     98
+                                                           Uncensored =     96
+Limits: Lower =    0                                    Left-censored =      2
+        Upper = +inf                                   Right-censored =      0
+
+                                                    LR chi2(1)        =   1.42
+                                                    Prob > chi2       = 0.2331
+Log likelihood = -191.09891                         Pseudo R2         = 0.0037
+
+------------------------------------------------------------------------------
+           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
+-------------+----------------------------------------------------------------
+           x |   .2095681   .1752046     1.20   0.235    -.1381645    .5573006
+           z |          0  (omitted)
+       _cons |   1.703772   .1754901     9.71   0.000     1.355473    2.052072
+-------------+----------------------------------------------------------------
+     var(e.y)|   3.003099   .4350229                      2.252728    4.003415
+------------------------------------------------------------------------------
+```
+
+Finally, notice that we could even use ppmlhdfe's diagnostic tool to identify the specific directions-of-recession, and what exact linear combination of regressors is driving the separation:
+
+```stata
+. ppmlhdfe y x z, tagsep(sep) zvar(z) r2
+ (identifying separated observations instead of running regressions)
+<some output omitted...>
+(ReLU method dropped 2 separated observations in 1 iterations)
+
+Verifying certificate of separation:
+. reghdfe z x z, noabsorb
+HDFE Linear regression                            Number of obs   =        100
+Absorbing 1 HDFE group                            F(   2,     97) =          .
+                                                  Prob > F        =          .
+                                                  R-squared       =     1.0000
+                                                  Adj R-squared   =     1.0000
+                                                  Within R-sq.    =     1.0000
+                                                  Root MSE        =     0.0000
+
+------------------------------------------------------------------------------
+           z | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
+-------------+----------------------------------------------------------------
+           x |   2.08e-17   4.38e-17     0.48   0.636    -6.62e-17    1.08e-16
+           z |          1   3.11e-16  3.2e+15   0.000            1           1
+       _cons |   5.86e-17   4.40e-17     1.33   0.186    -2.87e-17    1.46e-16
+------------------------------------------------------------------------------
+```
+
+As shown in the last regression, `ppmlhdfe` identifies that only one coefficient does not exist, as it has a non-zero coefficient.
+
+
 ## References
 
 - Palmgren (1981). "Models for the analysis of contingency tables with quantitative outcome variables". Biometrika, 68(3):563–576. https://www.jstor.org/stable/2335606
@@ -364,7 +494,7 @@ If you are a Stata user, you can run the script [`6-cgz-poisson-benchmarks.do`](
 - Correia, Guimarães, and Zylkin (2019). "Verifying the existence of maximum likelihood estimates for generalized linear models". arXiv Working Paper: https://arxiv.org/abs/1903.01633
 - Kosmidis and Schumacher (2021). "`detectseparation`: Detect and Check for Separation and Infinite Maximum Likelihood Estimates". https://cran.r-project.org/web/packages/detectseparation/
 - Kosmidis (2017). "`brglm2`: Bias Reduction in Multinomial Models". https://cran.r-project.org/web/packages/brglm2/vignettes/multinomial.html
-- Geyer (2009). "Likelihood Inference in Exponential Families and Directions of Recession." University of Minnesota, School of Statistics. http://www.stat.umn.edu/geyer/5421/notes/infinity.pdf
+- Geyer (2009). "Likelihood Inference in Exponential Families and Directions of Recession." University of Minnesota, School of Statistics. https://arxiv.org/pdf/0901.0455
 - Geyer (2025). Course notes. https://www.stat.umn.edu/geyer/5421/notes/infinity.html#complete-separation-example-of-agresti
 - Agresti (2012). "Categorical Data Analysis", 3rd Edition. Wiley.
 - Agresti (2015). "Foundations of Linear and Generalized Linear Models." Wiley.