Statistics

Descriptive statistics

Basic example

While there are many statistical functions, the summary command is a simple way to compute descriptive statistics for a list of series. Here is an example only computing the most basic statistics using the --simple option:

open abdata.gdt --quiet
list Y = IND YEAR n 
summary Y --simple

# Store the results as a matrix
summary Y --simple
matrix stats = $result
print stats

The output is:

                 Mean     Median       S.D.        Min        Max
IND             5.123      5.000      2.678      1.000      9.000
YEAR             1980       1980      2.583       1976       1984
n               1.056     0.8272      1.342     -2.263      4.687

stats (3 x 5)

             Mean       Median         S.D.          Min          Max 
 IND       5.1232       5.0000       2.6781       1.0000       9.0000 
YEAR       1980.0       1980.0       2.5830       1976.0       1984.0 
   n       1.0560      0.82724       1.3415      -2.2634       4.6873

Grouped statistics

By means of the --by=Series option, you can also compute statistics for each category of some other variable. The following example prints basic statistics for series n and w for each value of series IND (industry ID):

set verbose off
open abdata.gdt --quiet
list Y = n w
summary Y --by=IND --simple

The output for the first three industries is:

IND = 1 (n = 122):

                 Mean     Median       S.D.        Min        Max
n               1.234      1.095      1.172    -0.5942      4.099
w               3.186      3.183     0.1511      2.757      3.581

IND = 2 (n = 88):

                 Mean     Median       S.D.        Min        Max
n               1.039     0.9792      1.387     -2.104      3.223
w               3.410      3.409     0.1363      2.870      3.812

IND = 3 (n = 89):

                 Mean     Median       S.D.        Min        Max
n              0.7006     0.4324      1.199     -1.726      3.030
w               3.287      3.331     0.1640      2.910      3.614

Aggregation

The aggregate() function is powerful and allows you to aggregate data (like Pivot tables) by means of some aggregation function. Here is a simple example on how to compute the mean values of series n and w for each unique combination of the discrete series IND and YEAR (only showing the initial rows)

open abdata.gdt --quiet
list Y = n w
list groupby = IND YEAR

matrix mean_values = aggregate(Y, groupby, "mean")
printf "\n%12.2f\n", mean_values

The output is:

         IND        YEAR       count           n           w
        1.00     1976.00        8.00        0.89        3.12
        1.00     1977.00       16.00        1.34        3.11
        1.00     1978.00       17.00        1.37        3.09
          .
          .
          .
        2.00     1976.00        8.00        1.51        3.58
        2.00     1977.00       12.00        1.14        3.50
        2.00     1978.00       12.00        1.13        3.44

Custom aggregate function

You can also pass your own custom aggregate function to aggregate(). The function must return a scalar value. Here is an example for the inter-quartile range:

set verbose off

function scalar	iqr (const series y)
    /* Compute the interquartile range. */
    
    scalar result = quantile(y, 0.75) - quantile(y, 0.25)
    return result
end function

open mroz87.gdt --quiet
matrix result = aggregate(FAMINC, CIT, iqr)
printf "%12.2f\n", result

which returns the following table

       byvar       count        f(x)
        0.00      269.00     9591.00
        1.00      484.00    13349.25

OLS regression

Estimation

The following example shows how to run a simple OLS regression and how to store post-estimation information.

set verbose off

open abdata.gdt --quiet

ols ys const n w   #OPTIONAL: --robust

matrix coeff = $coeff  # point estimates
matrix stderr = $stderr  # std. error
series uhat = $uhat  # residuals
series yhat = $yhat  # fitted values

# Print values
print coeff ~ stderr
print ys yhat uhat --byobs --range=:5

The output is:

Model 1: Pooled OLS, using 1031 observations
Included 140 cross-sectional units
Time-series length: minimum 7, maximum 9
Dependent variable: ys

             coefficient   std. error   t-ratio    p-value
  --------------------------------------------------------
  const      4.60388       0.0351033    131.2      0.0000  ***
  n          0.00626942    0.00217539     2.882    0.0040  ***
  w          0.00875566    0.0110959      0.7891   0.4302 

Mean dependent var   4.638015   S.D. dependent var   0.093961
Sum squared resid    9.015800   S.E. of regression   0.093650
R-squared            0.008551   Adjusted R-squared   0.006622
F(2, 1028)           4.433125   P-value(F)           0.012105
Log-likelihood       980.1866   Akaike criterion    −1954.373
Schwarz criterion   −1939.558   Hannan-Quinn        −1948.751
rho                  0.802880   Durbin-Watson        0.305346

      4.6039     0.035103 
   0.0062694    0.0021754 
   0.0087557     0.011096 

              ys         yhat         uhat
1:1                                       
1:2     4.561294     4.636576     -0.07528
1:3     4.578384     4.636651     -0.05827
1:4     4.601245     4.636334     -0.03509
1:5     4.610656     4.636581     -0.02592

Specification tests

The modtest command provides various specification tests which can be conducted after having estimated a model. Another command is reset for running Ramsey's RESET test:. Here are examples:

open abdata.csv --quiet
ols ys const n w

modtest --normality --quiet
modtest --white --quiet
reset --squares-only --quiet

Hypothesis testing

Gretl allows you to test hypothesis in a simple manner.

`omit` variables

First you can call the omit command for testing zero restrictions on coefficients. Here is a simple example for testing the removal of two variables by means of an F-Test:

list X = n w
ols ys const X

# Test the restriction but do not re-estimate the model
omit X --test-only

# Test the restriction and re-estimate the model
omit X

The output is:

Test on Model 3:

  Null hypothesis: the regression parameters are zero for the variables
    n, w
  Test statistic: F(2, 1028) = 4.43312, p-value 0.0121052

Set of linear restrictions

The restrict-block command provides a powerful apparatus for testing set of (non-)linear restrictions. Here is an example using the --quiet option for avoiding detailed output. You may also try the --bootstrap option:

restrict --quiet # --bootstrap
    b[w] = 0.005
    b[n] - b[w] = 0
end restrict

This returns:

Restriction set
 1: b[w] = 0.005
 2: b[n] - b[w] = 0

Test statistic: F(2, 1028) = 0.224807, with p-value = 0.798709

Non-parametric test for differences between variables

This example illustrates on how to run non-parametric test to test for differences between variables. The example uses simulated series.

##################
## Non-parametric difference tests
##################
set seed 1234 	# only to ensure replicability
nulldata 100 	# cross-sectional dataset

# Create some random variables
series y = normal(0, 2)  # expected value 0
series x = normal(10, 2)  # expected value 2
list L = y x			# define a list of series which can be handy

# Stats and plot
summary L --simple
boxplot L --output=display
freq y --normal --plot=display

# Non-parametric difference tests
help difftest			# see the help for information

difftest y x --sign   # Sign test -- less powerful
printf "\nP-value of the Sign-test = %.2f (test-stat = %g)\n", $pvalue, $test

difftest y x --rank-sum   # Wilcoxon rank-sum test (aka Mann-Whitney U test)
printf "\nP-value of the Wilcoxon rank-sum test = %.2f (test-stat = %g)\n", $pvalue, $test

difftest y x --signed-rank   # Wilcoxon rank test

Parametric regression-based test for differences between categories

This example loads the cross-sectional and well-known MROZ dataset. By means of an OLS regression employing a level dummy, we want to test whether men earn higher wages in large cities compared to small cities.

open mroz87.gdt

boxplot HW CIT --factorized --output=display \
  { set title "Wages of men in small and large cities" font ',15'; }

# Regression for explaining Husband's wage by CIT (0: lives in small city, 1: lives in large city)
ols HW const CIT --robust   # robust standard errors wrt eventual heteroskedasticity
printf "\nThe null hypothesis that wages in large cities are equal  \n\
  to wages in small cities can be rejected at the %.2f pct. \n\
  significance level\n", $pvalue

# Run a restriction by hand
help restrict

# Test the null that hourly wages are on average 3$ higher in large cities
restrict --bootstrap
    b[CIT] = 3
end restrict

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Statistics

Descriptive statistics

Basic example

Grouped statistics

Aggregation

Custom aggregate function

OLS regression

Estimation

Specification tests

Hypothesis testing

`omit` variables

Set of linear restrictions

Non-parametric test for differences between variables

Parametric regression-based test for differences between categories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Statistics

Descriptive statistics

Basic example

Grouped statistics

Aggregation

Custom aggregate function

OLS regression

Estimation

Specification tests

Hypothesis testing

omit variables

Set of linear restrictions

Non-parametric test for differences between variables

Parametric regression-based test for differences between categories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`omit` variables